F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
Paper β’ 2603.19223 β’ Published β’ 31
INT8-quantized ONNX of codefuse-ai/F2LLM-v2-0.6B. Recommended variant β 2.2Γ smaller than FP32 with negligible quality loss.
| Property | Value |
|---|---|
| Method | onnxruntime.quantization.quantize_dynamic |
| Granularity | Per-channel (1 scale per output channel per MatMul) |
| Op | Standard QLinearMatMul (supported on all ORT execution providers) |
| Ops quantized | MatMul only β Gather (embedding table) left in FP32 |
| Input | FP32 dynamo export (cstr/F2LLM-v2-0.6B-ONNX) |
| Dynamic batch | β batch = 1, 2, 4, 8, β¦ |
| Property | Value |
|---|---|
| Base model | codefuse-ai/F2LLM-v2-0.6B |
| Architecture | Qwen3 decoder |
| Embedding dim | 1024 |
| Max context | 32 768 tokens |
| Pooling | Last-token pooling + L2 normalisation |
| File size | ~1.1 GB (model.int8.onnx + model.int8.onnx.data) |
import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0, direction="right")
tokenizer.enable_truncation(max_length=512)
session = ort.InferenceSession("model.int8.onnx", providers=["CPUExecutionProvider"])
texts = ["semantic search example", "another sentence"]
enc = tokenizer.encode_batch(texts)
ids = np.array([e.ids for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)
lhs = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]
seq_lens = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape) # (2, 1024)
| Repo | Precision | Size | Notes |
|---|---|---|---|
| cstr/F2LLM-v2-0.6B-ONNX | FP32 | 2.4 GB | Reference |
| cstr/F2LLM-v2-0.6B-ONNX-INT8 | INT8 per-channel | 1.1 GB | This repo β recommended |
| cstr/F2LLM-v2-0.6B-ONNX-INT4 | INT4 MatMulNBits | 0.9 GB | Minimum RAM |
| cstr/F2LLM-v2-0.6B-ONNX-INT8-FULL | INT8 incl. embeddings | 0.6 GB | Smallest file |
@misc{f2llm-v2,
title={F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World},
author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
year={2026},
eprint={2603.19223},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.19223},
}
Apache 2.0 β same as codefuse-ai/F2LLM-v2-0.6B.
Base model
Qwen/Qwen3-0.6B-Base