F2LLM-v2-0.6B β€” FP32 ONNX (dynamo export)

ONNX export of codefuse-ai/F2LLM-v2-0.6B, a general-purpose multilingual embedding model from the F2LLM-v2 family, trained on 60M high-quality multilingual examples supporting 200+ languages.

This is the full-precision (FP32) reference export. For production use, prefer the INT8 or INT4 variants which are 2–3Γ— smaller with negligible quality loss.

Export method

Exported with torch.onnx.export(dynamo=True) (PyTorch 2.9, opset 20).

The dynamo exporter traces at the FX-graph / symbolic level. All internal tensor shapes β€” including the Qwen3 causal attention mask β€” carry symbolic batch and sequence dimensions throughout. Dynamic batch verified: batch = 1, 2, 4, 8 all produce correct output shapes.

Model details

Property Value
Base model codefuse-ai/F2LLM-v2-0.6B
Architecture Qwen3 decoder
Parameters ~600 M
Embedding dim 1024
Max context 32 768 tokens
Languages 200+ (multilingual)
Inputs input_ids [batch, seq], attention_mask [batch, seq]
Output last_hidden_state [batch, seq, 1024]
Pooling Last-token pooling + L2 normalisation (applied by inference runtime)
File size ~2.4 GB (model.onnx + model.onnx.data)

Inference

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0, direction="right")
tokenizer.enable_truncation(max_length=512)

session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

texts = ["semantic search example", "another sentence"]
enc  = tokenizer.encode_batch(texts)
ids  = np.array([e.ids            for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)

lhs = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]  # [batch, seq, 1024]

# Last-token pooling: take embedding at last non-padding position
seq_lens   = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]

# L2 normalise
norms      = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape)  # (2, 1024)

Query prefix: For asymmetric retrieval, prepend "Instruct: <task description>\nQuery: " to query strings. Documents are encoded without a prefix.

Files

File Size Description
model.onnx ~4 MB ONNX graph (opset 20, no weights)
model.onnx.data ~2.38 GB External weight data
tokenizer.json 8 MB HuggingFace fast tokenizer
config.json β€” Model config

Quantized variants

Repo Precision Size Notes
cstr/F2LLM-v2-0.6B-ONNX FP32 2.4 GB This repo β€” reference
cstr/F2LLM-v2-0.6B-ONNX-INT8 INT8 per-channel 1.1 GB Recommended for most use
cstr/F2LLM-v2-0.6B-ONNX-INT4 INT4 MatMulNBits 0.9 GB Minimum RAM
cstr/F2LLM-v2-0.6B-ONNX-INT8-FULL INT8 incl. embeddings 0.6 GB Smallest file

Citation

@misc{f2llm-v2,
      title={F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World},
      author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
      year={2026},
      eprint={2603.19223},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.19223},
}

License

Apache 2.0 β€” same as codefuse-ai/F2LLM-v2-0.6B.

Downloads last month
82
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/F2LLM-v2-0.6B-ONNX

Finetuned
Qwen/Qwen3-0.6B
Quantized
(5)
this model

Paper for cstr/F2LLM-v2-0.6B-ONNX