pplx-embed-v1-0.6B — ONNX INT8 Standard (Transformers.js compatible)

Re-export of perplexity-ai/pplx-embed-v1-0.6B with standard ONNX operators for browser and Node.js compatibility.

Why this model exists

The official pplx-embed-v1 ONNX exports use GatherBlockQuantized (a Microsoft-custom ONNX operator) which is not supported by:

  • Transformers.js (browser WASM/WebGPU)
  • ONNX Runtime Web
  • ONNX Runtime Node.js

This re-export uses torch.onnx.export -> quantize_dynamic(QInt8) producing only standard ONNX operators (DynamicQuantizeLinear, QLinearMatMul) that work everywhere.

Language support

60 languages in training data (paper), including:

  • MIRACL-evaluated (18): Arabic, Bengali, Chinese, English, Farsi, Finnish, French, German, Hindi, Indonesian, Japanese, Korean, Russian, Spanish, Swahili, Telugu, Thai, Yoruba
  • Pretraining data (30 named): eng, rus, cmn, jpn, deu, spa, fra, ita, por, nld, pol, ind, vie, kor, tur, fas, ces, swe, ron, arb, nob, hun, dan, ukr, tha, ell, fin, hin, ben, zsm
  • Base model: Qwen3 (100+ languages natively supported)

Training distribution: 65.6% English, 26.7% multilingual (60 languages), 6.7% cross-lingual, 1% code.

Model details

Property Value
Base model perplexity-ai/pplx-embed-v1-0.6B (Qwen3-based)
Parameters 596M
Native dimensions 1024
MRL truncation 256D (Matryoshka, compatible with existing embeddings)
Languages 60 (trained), 18 evaluated on MIRACL
Quantization INT8 dynamic (quantize_dynamic, per-channel)
Cosine vs FP32 0.971
ONNX opset 18
License MIT (same as base model)

Performance (Node.js ORT, CPU)

Metric Value
Latency per embedding 24ms
RAM 698MB
Model download 596MB (cached after first load)
Cold start ~40s (includes download, cached afterwards)

Usage — Transformers.js (browser or Node.js)

import { pipeline } from '@huggingface/transformers'

const embedder = await pipeline('feature-extraction',
  'tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standard'
)

const result = await embedder('Your text here', {
  pooling: 'mean',
  normalize: true,
})

// MRL Matryoshka truncation to 256D
const full = Array.from(result.data)          // 1024D
const trunc = full.slice(0, 256)              // 256D
const norm = Math.sqrt(trunc.reduce((s, v) => s + v * v, 0))
const embedding = trunc.map(v => v / norm)    // L2 normalized

Usage — Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standard")
session = ort.InferenceSession("onnx/model.onnx")

inputs = tokenizer("Your text", return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"],
})

# Mean pooling + MRL truncation + L2 normalize
emb = (outputs[0] * inputs["attention_mask"][:,:,None]).sum(1) / inputs["attention_mask"].sum(1, keepdims=True)
emb = emb[:, :256]
emb = emb / np.linalg.norm(emb, axis=1, keepdims=True)

File structure

onnx/model.onnx        — 6.2 MB  (graph, small constants inline)
onnx/model.onnx_data   — 596 MB  (weights, external data format)
config.json            — Model config + transformers.js_config
tokenizer.json         — Tokenizer vocabulary
tokenizer_config.json  — Tokenizer settings

Important: config.json contains transformers.js_config.use_external_data_format = {"model.onnx": 1} which tells Transformers.js to load model.onnx_data alongside the graph.

Export process

  1. Load PyTorch model from perplexity-ai/pplx-embed-v1-0.6B
  2. torch.onnx.export() with opset_version=17, dynamic_axes, do_constant_folding
  3. onnxruntime.quantization.quantize_dynamic() with QInt8, per_channel=True
  4. onnx.save_model() with save_as_external_data=True, size_threshold=1024
  5. Patch config.json with transformers.js_config

Colab notebook: pplx_embed_v1_onnx_standard_export.ipynb

Limitations

  • Cosine 0.971 vs FP32 — 3% quality loss from INT8 quantization via ORT quantize_dynamic. The official INT8 (GatherBlockQuantized) achieves 0.9997 but is not compatible with Transformers.js.
  • 596MB download — first browser visit downloads the model. Cached in IndexedDB afterwards.
  • ~700MB RAM — may not work on low-memory mobile devices (< 2GB RAM).

Citation

Base model: pplx-embed-v1 by Perplexity AI. Paper: arxiv.org/abs/2602.11151

Re-export for Transformers.js compatibility by Deposium.

Downloads last month
184
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standard

Paper for tss-deposium/pplx-embed-v1-0.6b-onnx-int8-standard