F2LLM-v2-0.6B β FP32 ONNX (dynamo export)
ONNX export of codefuse-ai/F2LLM-v2-0.6B, a general-purpose multilingual embedding model from the F2LLM-v2 family, trained on 60M high-quality multilingual examples supporting 200+ languages.
This is the full-precision (FP32) reference export. For production use, prefer the INT8 or INT4 variants which are 2β3Γ smaller with negligible quality loss.
Export method
Exported with torch.onnx.export(dynamo=True) (PyTorch 2.9, opset 20).
The dynamo exporter traces at the FX-graph / symbolic level. All internal tensor shapes β including the Qwen3 causal attention mask β carry symbolic batch and sequence dimensions throughout. Dynamic batch verified: batch = 1, 2, 4, 8 all produce correct output shapes.
Model details
| Property | Value |
|---|---|
| Base model | codefuse-ai/F2LLM-v2-0.6B |
| Architecture | Qwen3 decoder |
| Parameters | ~600 M |
| Embedding dim | 1024 |
| Max context | 32 768 tokens |
| Languages | 200+ (multilingual) |
| Inputs | input_ids [batch, seq], attention_mask [batch, seq] |
| Output | last_hidden_state [batch, seq, 1024] |
| Pooling | Last-token pooling + L2 normalisation (applied by inference runtime) |
| File size | ~2.4 GB (model.onnx + model.onnx.data) |
Inference
import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0, direction="right")
tokenizer.enable_truncation(max_length=512)
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
texts = ["semantic search example", "another sentence"]
enc = tokenizer.encode_batch(texts)
ids = np.array([e.ids for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)
lhs = session.run(None, {"input_ids": ids, "attention_mask": mask})[0] # [batch, seq, 1024]
# Last-token pooling: take embedding at last non-padding position
seq_lens = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]
# L2 normalise
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape) # (2, 1024)
Query prefix: For asymmetric retrieval, prepend
"Instruct: <task description>\nQuery: "to query strings. Documents are encoded without a prefix.
Files
| File | Size | Description |
|---|---|---|
model.onnx |
~4 MB | ONNX graph (opset 20, no weights) |
model.onnx.data |
~2.38 GB | External weight data |
tokenizer.json |
8 MB | HuggingFace fast tokenizer |
config.json |
β | Model config |
Quantized variants
| Repo | Precision | Size | Notes |
|---|---|---|---|
| cstr/F2LLM-v2-0.6B-ONNX | FP32 | 2.4 GB | This repo β reference |
| cstr/F2LLM-v2-0.6B-ONNX-INT8 | INT8 per-channel | 1.1 GB | Recommended for most use |
| cstr/F2LLM-v2-0.6B-ONNX-INT4 | INT4 MatMulNBits | 0.9 GB | Minimum RAM |
| cstr/F2LLM-v2-0.6B-ONNX-INT8-FULL | INT8 incl. embeddings | 0.6 GB | Smallest file |
Citation
@misc{f2llm-v2,
title={F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World},
author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
year={2026},
eprint={2603.19223},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.19223},
}
License
Apache 2.0 β same as codefuse-ai/F2LLM-v2-0.6B.
- Downloads last month
- 82
Model tree for cstr/F2LLM-v2-0.6B-ONNX
Base model
Qwen/Qwen3-0.6B-Base