F2LLM-v2-0.6B β€” INT8 ONNX (per-channel QLinearMatMul)

INT8-quantized ONNX of codefuse-ai/F2LLM-v2-0.6B. Recommended variant β€” 2.2Γ— smaller than FP32 with negligible quality loss.

Quantization details

Property Value
Method onnxruntime.quantization.quantize_dynamic
Granularity Per-channel (1 scale per output channel per MatMul)
Op Standard QLinearMatMul (supported on all ORT execution providers)
Ops quantized MatMul only β€” Gather (embedding table) left in FP32
Input FP32 dynamo export (cstr/F2LLM-v2-0.6B-ONNX)
Dynamic batch βœ“ batch = 1, 2, 4, 8, …

Model details

Property Value
Base model codefuse-ai/F2LLM-v2-0.6B
Architecture Qwen3 decoder
Embedding dim 1024
Max context 32 768 tokens
Pooling Last-token pooling + L2 normalisation
File size ~1.1 GB (model.int8.onnx + model.int8.onnx.data)

Inference

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0, direction="right")
tokenizer.enable_truncation(max_length=512)

session = ort.InferenceSession("model.int8.onnx", providers=["CPUExecutionProvider"])

texts = ["semantic search example", "another sentence"]
enc  = tokenizer.encode_batch(texts)
ids  = np.array([e.ids            for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)

lhs        = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]
seq_lens   = mask.sum(axis=1) - 1
embeddings = lhs[np.arange(len(texts)), seq_lens]
norms      = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
print(embeddings.shape)  # (2, 1024)

Variants

Repo Precision Size Notes
cstr/F2LLM-v2-0.6B-ONNX FP32 2.4 GB Reference
cstr/F2LLM-v2-0.6B-ONNX-INT8 INT8 per-channel 1.1 GB This repo β€” recommended
cstr/F2LLM-v2-0.6B-ONNX-INT4 INT4 MatMulNBits 0.9 GB Minimum RAM
cstr/F2LLM-v2-0.6B-ONNX-INT8-FULL INT8 incl. embeddings 0.6 GB Smallest file

Citation

@misc{f2llm-v2,
      title={F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World},
      author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
      year={2026},
      eprint={2603.19223},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.19223},
}

License

Apache 2.0 β€” same as codefuse-ai/F2LLM-v2-0.6B.

Downloads last month
80
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/F2LLM-v2-0.6B-ONNX-INT8

Finetuned
Qwen/Qwen3-0.6B
Quantized
(5)
this model

Paper for cstr/F2LLM-v2-0.6B-ONNX-INT8