F2LLM-v2-0.6B — INT8 ONNX (per-channel QLinearMatMul)

INT8-quantized ONNX of codefuse-ai/F2LLM-v2-0.6B. Recommended variant — 2.2× smaller than FP32 with negligible quality loss.

Quantization details

Property	Value
Method	`onnxruntime.quantization.quantize_dynamic`
Granularity	Per-channel (1 scale per output channel per MatMul)
Op	Standard `QLinearMatMul` (supported on all ORT execution providers)
Ops quantized	MatMul only — Gather (embedding table) left in FP32
Input	FP32 dynamo export (`cstr/F2LLM-v2-0.6B-ONNX`)
Dynamic batch	✓ batch = 1, 2, 4, 8, …

Model details

Property	Value
Base model	codefuse-ai/F2LLM-v2-0.6B
Architecture	Qwen3 decoder
Embedding dim	1024
Max context	32 768 tokens
Pooling	Last-token pooling + L2 normalisation
File size	~1.1 GB (`model.int8.onnx` + `model.int8.onnx.data`)

Inference

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0, direction="right")
tokenizer.enable_truncation(max_length=512)

session = ort.InferenceSession("model.int8.onnx", providers=["CPUExecutionProvider"])

texts = ["semantic search example", "another sentence"]
enc  = tokenizer.encode_batch(texts)
ids  = np.array([e.ids            for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)

lhs        = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]
seq_lens   = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]
norms      = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape)  # (2, 1024)

Variants

Repo	Precision	Size	Notes
cstr/F2LLM-v2-0.6B-ONNX	FP32	2.4 GB	Reference
cstr/F2LLM-v2-0.6B-ONNX-INT8	INT8 per-channel	1.1 GB	This repo — recommended
cstr/F2LLM-v2-0.6B-ONNX-INT4	INT4 MatMulNBits	0.9 GB	Minimum RAM
cstr/F2LLM-v2-0.6B-ONNX-INT8-FULL	INT8 incl. embeddings	0.6 GB	Smallest file

Citation

@misc{f2llm-v2,
      title={F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World},
      author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
      year={2026},
      eprint={2603.19223},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.19223},
}