pplx-embed-v1-0.6B — ONNX INT8 Standard (Transformers.js compatible)
Re-export of perplexity-ai/pplx-embed-v1-0.6B with standard ONNX operators for browser and Node.js compatibility.
Why this model exists
The official pplx-embed-v1 ONNX exports use GatherBlockQuantized (a Microsoft-custom ONNX operator) which is not supported by:
- Transformers.js (browser WASM/WebGPU)
- ONNX Runtime Web
- ONNX Runtime Node.js
This re-export uses torch.onnx.export -> quantize_dynamic(QInt8) producing only standard ONNX operators (DynamicQuantizeLinear, QLinearMatMul) that work everywhere.
Language support
60 languages in training data (paper), including:
- MIRACL-evaluated (18): Arabic, Bengali, Chinese, English, Farsi, Finnish, French, German, Hindi, Indonesian, Japanese, Korean, Russian, Spanish, Swahili, Telugu, Thai, Yoruba
- Pretraining data (30 named): eng, rus, cmn, jpn, deu, spa, fra, ita, por, nld, pol, ind, vie, kor, tur, fas, ces, swe, ron, arb, nob, hun, dan, ukr, tha, ell, fin, hin, ben, zsm
- Base model: Qwen3 (100+ languages natively supported)
Training distribution: 65.6% English, 26.7% multilingual (60 languages), 6.7% cross-lingual, 1% code.
Model details
| Property | Value |
|---|---|
| Base model | perplexity-ai/pplx-embed-v1-0.6B (Qwen3-based) |
| Parameters | 596M |
| Native dimensions | 1024 |
| MRL truncation | 256D (Matryoshka, compatible with existing embeddings) |
| Languages | 60 (trained), 18 evaluated on MIRACL |
| Quantization | INT8 dynamic (quantize_dynamic, per-channel) |
| Cosine vs FP32 | 0.971 |
| ONNX opset | 18 |
| License | MIT (same as base model) |
Performance (Node.js ORT, CPU)
| Metric | Value |
|---|---|
| Latency per embedding | 24ms |
| RAM | 698MB |
| Model download | 596MB (cached after first load) |
| Cold start | ~40s (includes download, cached afterwards) |
Usage — Transformers.js (browser or Node.js)
import { pipeline } from '@huggingface/transformers'
const embedder = await pipeline('feature-extraction',
'tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standard'
)
const result = await embedder('Your text here', {
pooling: 'mean',
normalize: true,
})
// MRL Matryoshka truncation to 256D
const full = Array.from(result.data) // 1024D
const trunc = full.slice(0, 256) // 256D
const norm = Math.sqrt(trunc.reduce((s, v) => s + v * v, 0))
const embedding = trunc.map(v => v / norm) // L2 normalized
Usage — Python (ONNX Runtime)
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standard")
session = ort.InferenceSession("onnx/model.onnx")
inputs = tokenizer("Your text", return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"],
})
# Mean pooling + MRL truncation + L2 normalize
emb = (outputs[0] * inputs["attention_mask"][:,:,None]).sum(1) / inputs["attention_mask"].sum(1, keepdims=True)
emb = emb[:, :256]
emb = emb / np.linalg.norm(emb, axis=1, keepdims=True)
File structure
onnx/model.onnx — 6.2 MB (graph, small constants inline)
onnx/model.onnx_data — 596 MB (weights, external data format)
config.json — Model config + transformers.js_config
tokenizer.json — Tokenizer vocabulary
tokenizer_config.json — Tokenizer settings
Important: config.json contains transformers.js_config.use_external_data_format = {"model.onnx": 1} which tells Transformers.js to load model.onnx_data alongside the graph.
Export process
- Load PyTorch model from
perplexity-ai/pplx-embed-v1-0.6B torch.onnx.export()withopset_version=17,dynamic_axes,do_constant_foldingonnxruntime.quantization.quantize_dynamic()withQInt8,per_channel=Trueonnx.save_model()withsave_as_external_data=True,size_threshold=1024- Patch
config.jsonwithtransformers.js_config
Colab notebook: pplx_embed_v1_onnx_standard_export.ipynb
Limitations
- Cosine 0.971 vs FP32 — 3% quality loss from INT8 quantization via ORT
quantize_dynamic. The official INT8 (GatherBlockQuantized) achieves 0.9997 but is not compatible with Transformers.js. - 596MB download — first browser visit downloads the model. Cached in IndexedDB afterwards.
- ~700MB RAM — may not work on low-memory mobile devices (< 2GB RAM).
Citation
Base model: pplx-embed-v1 by Perplexity AI. Paper: arxiv.org/abs/2602.11151
Re-export for Transformers.js compatibility by Deposium.
- Downloads last month
- 184