pplx-embed-v1-0.6B — ONNX INT8 Standard (Transformers.js compatible)

Re-export of perplexity-ai/pplx-embed-v1-0.6B with standard ONNX operators for browser and Node.js compatibility.

Why this model exists

The official pplx-embed-v1 ONNX exports use GatherBlockQuantized (a Microsoft-custom ONNX operator) which is not supported by:

Transformers.js (browser WASM/WebGPU)
ONNX Runtime Web
ONNX Runtime Node.js

This re-export uses torch.onnx.export -> quantize_dynamic(QInt8) producing only standard ONNX operators (DynamicQuantizeLinear, QLinearMatMul) that work everywhere.

Language support

60 languages in training data (paper), including:

MIRACL-evaluated (18): Arabic, Bengali, Chinese, English, Farsi, Finnish, French, German, Hindi, Indonesian, Japanese, Korean, Russian, Spanish, Swahili, Telugu, Thai, Yoruba
Pretraining data (30 named): eng, rus, cmn, jpn, deu, spa, fra, ita, por, nld, pol, ind, vie, kor, tur, fas, ces, swe, ron, arb, nob, hun, dan, ukr, tha, ell, fin, hin, ben, zsm
Base model: Qwen3 (100+ languages natively supported)

Training distribution: 65.6% English, 26.7% multilingual (60 languages), 6.7% cross-lingual, 1% code.

Model details

Property	Value
Base model	`perplexity-ai/pplx-embed-v1-0.6B` (Qwen3-based)
Parameters	596M
Native dimensions	1024
MRL truncation	256D (Matryoshka, compatible with existing embeddings)
Languages	60 (trained), 18 evaluated on MIRACL
Quantization	INT8 dynamic (`quantize_dynamic`, per-channel)
Cosine vs FP32	0.971
ONNX opset	18
License	MIT (same as base model)

Performance (Node.js ORT, CPU)

Metric	Value
Latency per embedding	24ms
RAM	698MB
Model download	596MB (cached after first load)
Cold start	~40s (includes download, cached afterwards)

Usage — Transformers.js (browser or Node.js)

import { pipeline } from '@huggingface/transformers'

const embedder = await pipeline('feature-extraction',
  'tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standard'
)

const result = await embedder('Your text here', {
  pooling: 'mean',
  normalize: true,
})

// MRL Matryoshka truncation to 256D
const full = Array.from(result.data)          // 1024D
const trunc = full.slice(0, 256)              // 256D
const norm = Math.sqrt(trunc.reduce((s, v) => s + v * v, 0))
const embedding = trunc.map(v => v / norm)    // L2 normalized

Usage — Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standard")
session = ort.InferenceSession("onnx/model.onnx")

inputs = tokenizer("Your text", return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"],
})

# Mean pooling + MRL truncation + L2 normalize
emb = (outputs[0] * inputs["attention_mask"][:,:,None]).sum(1) / inputs["attention_mask"].sum(1, keepdims=True)
emb = emb[:, :256]
emb = emb / np.linalg.norm(emb, axis=1, keepdims=True)

File structure

onnx/model.onnx        — 6.2 MB  (graph, small constants inline)
onnx/model.onnx_data   — 596 MB  (weights, external data format)
config.json            — Model config + transformers.js_config
tokenizer.json         — Tokenizer vocabulary
tokenizer_config.json  — Tokenizer settings

Important: config.json contains transformers.js_config.use_external_data_format = {"model.onnx": 1} which tells Transformers.js to load model.onnx_data alongside the graph.

Export process

Load PyTorch model from perplexity-ai/pplx-embed-v1-0.6B
torch.onnx.export() with opset_version=17, dynamic_axes, do_constant_folding
onnxruntime.quantization.quantize_dynamic() with QInt8, per_channel=True
onnx.save_model() with save_as_external_data=True, size_threshold=1024
Patch config.json with transformers.js_config

Colab notebook: pplx_embed_v1_onnx_standard_export.ipynb

Limitations

Cosine 0.971 vs FP32 — 3% quality loss from INT8 quantization via ORT quantize_dynamic. The official INT8 (GatherBlockQuantized) achieves 0.9997 but is not compatible with Transformers.js.
596MB download — first browser visit downloads the model. Cached in IndexedDB afterwards.
~700MB RAM — may not work on low-memory mobile devices (< 2GB RAM).

Citation

Base model: pplx-embed-v1 by Perplexity AI. Paper: arxiv.org/abs/2602.11151

Re-export for Transformers.js compatibility by Deposium.

Downloads last month: 184

Dataset used to train tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standard

Paper for tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standard

Diffusion-Pretrained Dense and Contextual Embeddings

Paper • 2602.11151 • Published Feb 11 • 23