static-retrieval-cve-en-v2-longtail

A 1024-dim StaticEmbedding CVE retrieval model trained on real CVEs augmented with synthetic CVEs for rare CWE classes, using CWE-balanced sampling. Built as a Pareto improvement over oneryalcin/static-retrieval-cve-en-v1 on the long tail of CWE classes.


Architecture	`StaticEmbedding(EmbeddingBag(40522, 1024))`
Output dim	1024 (Matryoshka-truncatable to 512 / 256 / 128 / 64 / 32)
Vocab	bert-base-uncased (30,522) + 10,000 corpus-mined security tokens
Training data	388k pairs from real CVEs + 4,417 synthetic CVEs for 150 rare CWEs
Key change from v1	Synthetic data + CWE-balanced sampler (cap 5000→1200, upsample low-count to 100)
Best for	applications where rare-class recall matters (vulnerability triage, similarity search)

⚠️ Scope and limitations (read first)

This model is specialized for one narrow task and does NOT generalize to all CVE retrieval use cases. Please read this section before adopting it.

What this IS for

Given an NVD-formatted CVE description (~300-char paragraph, NVD prose style), retrieve other CVEs sharing the same MITRE CWE class from a corpus of CVE descriptions. On that task it dominates general-purpose models — see numbers below.

Real deployment fits include:

First-stage CWE-tagging — untagged CVE in, same-CWE peers out, vote on class
Active learning / borderline-case mining for CWE classifiers
CWE-balanced retrieval when you want diverse vulnerability classes rather than a flood of one type
Cheap full-corpus embedding on CPU (185× faster than mpnet, encodes all of NVD in ~70 sec)

What this is NOT for

We measured these explicitly and our model underperforms off-the-shelf alternatives here:

Use case	Our model	Off-the-shelf alternative	Recommendation
Famous-vulnerability lookup ("log4shell", "heartbleed")	0% Hit@10 on CVE-ID names	All models fail (need a glossary, not embeddings)	Use a CVE-ID lookup index, not embedding retrieval
Product/keyword search ("wordpress XSS", "apache RCE")	42% Acc@1	`static-retrieval-mrl-en-v1`: 79% • mpnet: 84%	Use the off-the-shelf models
Short informal queries (security analyst typing)	41-45% overall Acc@1	`static-retrieval-mrl-en-v1`: 54% • mpnet: 54%	Use the off-the-shelf models
Cross-format retrieval (GHSA / vendor advisories with short summary → long markdown)	0.87-0.89 NDCG@10	`static-retrieval-mrl-en-v1`: 0.91	Use the off-the-shelf static model

If your use case looks like the bottom table rows, use sentence-transformers/static-retrieval-mrl-en-v1 or BAAI/bge-large-en-v1.5 instead. Both are competitive or better on out-of-distribution security text.

Why the specialization trade-off exists

We trained this model on contrastive pairs of (CVE_i, CVE_j) where both descriptions share the same MITRE CWE class. The corpus-mined special tokens that boost in-domain performance (e.g. "stack-based buffer overflow" as a single token) help cluster CVEs by vulnerability class but reduce the model's ability to distinguish them by product, vendor, or topic. This is a classic narrow-task fine-tuning trade-off: dramatic gain on the trained metric, modest regression on adjacent tasks.

A v3 with mixed training objectives (same-CWE + product-similarity + cross-format pairs) could likely close the OOD gap, but does not exist yet.

TL;DR vs the v1 baseline

Metric	v1 (this model's parent)	v2-longtail (this model)
Aggregate NDCG@10	0.5811	0.5680 (–0.013)
Aggregate Acc@1	0.6128	0.6195 (+0.007)
Rare-CWE Acc@1 (50-200 train pairs)	0.083	0.167 (+100%, doubled)
Very-rare-CWE Acc@1 (<50 train pairs)	0.143	0.286 (+100%, doubled)
Rare-CWE Hit@10	0.083	0.500 (+503%, 6×)
Head-class Acc@1 (≥1000 pairs)	0.693	0.693 (preserved)

The aggregate NDCG drop is misleading. Per-CWE: this model doubles Acc@1 on rare classes while exactly preserving head-class precision. Use this model unless your aggregate-only benchmark prefers v1.

Quick start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("oneryalcin/static-retrieval-cve-en-v2-longtail")

cves = [
    "Improper input validation in Acme InventoryManager 3.2.1 leads to integer overflow when processing large stock quantities, causing memory corruption.",
    "OAuth2 token leakage in MyApp 2.0 due to insufficient session entropy allows attackers to predict session tokens.",
]
embeddings = model.encode(cves)
print(embeddings.shape)  # (2, 1024)

# Truncate to 256-dim for faster downstream retrieval at minimal quality cost
model = SentenceTransformer("oneryalcin/static-retrieval-cve-en-v2-longtail", truncate_dim=256)
embeddings = model.encode(cves)
print(embeddings.shape)  # (2, 256)

Performance — per-CWE breakdown

Same eval as v1 (297 held-out queries, 5,000 corpus, seed=42, never trained on synthetic data):

Class size                #CWEs  #queries  Acc@1   NDCG@10   Hit@10
very rare (<50 pairs)        7        7    0.286   0.337     0.429
rare (50-200 pairs)         12       12    0.167   0.159     0.500
med (200-1000 pairs)        28       40    0.375   0.296     0.725
large (≥1000 pairs)         39      238    0.693   0.641     0.866
                                          ──────  ───────   ──────
Aggregate                   86      297    0.620   0.568     0.840

The rare-class lift in context

Class	Metric	Off-shelf	v1	v2-longtail
very rare	Acc@1	0.143	0.143	0.286
very rare	Hit@10	0.143	0.429	0.429
rare	Acc@1	0.083	0.083	0.167
rare	Hit@10	0.083	0.083	0.500

Rare-class performance with v1 was indistinguishable from the off-the-shelf baseline despite the head-class lift — the model had no signal for rare CWEs. The synthetic CVE augmentation plus balanced sampling fixed this.

How it was built

Diagnosis (what v1 got wrong)

Per-CWE analysis of v1 revealed the aggregate NDCG@10 of 0.5811 was almost entirely from the 39 large CWE classes that scored Acc@1 = 0.693. The 19 rare/very-rare classes (representing 590 CWE classes in the training pool) scored Acc@1 ≈ 0.083 — random-baseline level. v1 had no real signal for rare CWE classes because there weren't enough same-CWE pairs to form a useful contrast.

Intervention

1. Synthetic data for rare CWEs. Generated 4,417 plausible CVE descriptions covering 150 under-represented CWE classes using deepseek-v4-flash. Each synthetic CVE matches NVD style (terse, factual, vendor + product + version + mechanism + consequence). Cost: ~$0.90 total. See oneryalcin/cve-synthetic-rare-cwes-v1.

2. CWE-balanced sampling. Lowered per-CWE pair cap from 5000 → 1200, added a minimum floor of 100 pairs per CWE via sample-with-replacement upsampling. This prevents the top-4 CWE classes from dominating gradient updates (in v1 they each had ~25,000 pairs vs ~50 for a rare CWE).

Final training data

Same-CWE pairs (cap=1200, 5 iter):  303k
Upsampled low-count pairs:           29k
CWE-definition pairs (MITRE 1000):   56k
─────────────────────────────────────────
Total:                              388k pairs

per-CWE pair distribution: min=1  p25=100  median=100  p75=123  max=6000

(Compare v1: 521k total pairs, distribution heavily skewed: top-4 CWEs at ~25k, rare CWEs at ~50.)

Training recipe

Architecture:        StaticEmbedding(40522 × 1024) ≈ 41M params
Base tokenizer:      google-bert/bert-base-uncased + 10k mined tokens (same as v1)
Loss:                MatryoshkaLoss(MultipleNegativesRankingLoss, dims=[1024,768,512,256,128,64,32])
Batch:               512
Sampler:             NO_DUPLICATES
LR:                  2e-1 (random init, standard for StaticEmbedding)
Epochs:              3 (best checkpoint loaded at end)
Warmup:              5% linear, linear decay
Precision:           fp32 (MPS-safe)
Hardware:            M4 Max MPS, 6 min wall

Retrieval quality vs general-purpose encoders

Same held-out CVE→same-CWE eval (297 queries, 5,000 corpus, seed=42), comparing to popular general-English sentence-transformers:

Model	Params	Dim	NDCG@10	MRR@10	Acc@1
`sentence-transformers/static-retrieval-mrl-en-v1` (reference static)	31M	1024	0.355	0.498	0.428
`sentence-transformers/all-mpnet-base-v2` (typical transformer)	110M	768	0.363	0.487	0.414
`sentence-transformers/all-MiniLM-L6-v2` (small transformer)	22M	384	0.374	0.505	0.434
`BAAI/bge-large-en-v1.5` (strong general-purpose, SOTA-class)	335M	1024	0.408	0.548	0.488
`oneryalcin/static-retrieval-cve-en-v1` (sibling — parent, before augmentation)	41M	1024	0.581 (+42% over bge)	0.681	0.613
`oneryalcin/static-retrieval-cve-en-v2-longtail` (THIS MODEL)	41M	1024	0.568 (+39% over bge)	0.678	0.620

Domain training beats general-purpose architecture here. Both our 41M-param token-bag models outperform a 335M-param SOTA general-English encoder (bge-large-en-v1.5) by a wide margin on in-domain CVE retrieval — and run ~185× faster on CPU than mpnet-base-v2 (see speed table above). This is not a quality-vs-speed trade-off: for CVE retrieval by CWE, the domain-tuned static model dominates on both axes.

Pick between v1 (best aggregate NDCG, strong on common CWEs) and v2-longtail (best Acc@1, doubles rare-class precision via synthetic-data augmentation) — see comparison details below.

Benchmark reproducer: quality_bench.py in this repo.

Inference speed (CPU)

Benchmarked on Apple M4 Max CPU, encoding 1000 real CVE descriptions (avg 328 chars), batch=128, best of 3 runs:

Model	Dim	q/s on CPU	Time for 1M CVEs	vs mpnet-base-v2
This model	1024	~14,000	~70 sec	~185×
`sentence-transformers/static-retrieval-mrl-en-v1` (reference static)	1024	~13,700	~73 sec	~178×
`sentence-transformers/all-MiniLM-L6-v2` (small transformer)	384	~600	~27 min	~8×
`sentence-transformers/all-mpnet-base-v2` (typical transformer)	768	~77	~3.6 hours	1×

Same architecture as the reference static model so the speed matches it. The entire current NVD CVE catalog (~300k entries) encodes in under 25 seconds on a single CPU core — no GPU required for deployment. (On Apple Silicon, CPU outperforms MPS for this model since StaticEmbedding is memory-bound rather than compute-bound — the GPU transfer overhead exceeds the parallelism gain.)

Benchmark reproducer: speed_bench.py in this repo.

Out-of-distribution generalization (measured)

To check whether the in-distribution wins above generalize, we ran the model on three out-of-distribution tests. Results posted honestly so you can judge fit for your use case.

Test 1: GHSA cross-corpus, summary → description (asymmetric, short-query)

GitHub Security Advisories (gh api /advisories) have a different style than NVD (Markdown sections, ~10× longer descriptions, ~64-char summaries). We sampled 10k GHSAs, took 300 short summaries as queries against a 5,000-doc corpus of GHSA descriptions, same-CWE retrieval:

Model	NDCG@10	Acc@1
`sentence-transformers/static-retrieval-mrl-en-v1` (baseline)	0.910	0.910
`sentence-transformers/all-mpnet-base-v2`	0.903	0.903
v2-longtail (ours)	0.887	0.887
`sentence-transformers/all-MiniLM-L6-v2`	0.873	0.873
v1 (ours)	0.867	0.867

Off-the-shelf baseline wins. Our domain-tuned models slot in the middle. The lift we get on in-distribution evals (NVD descriptions) does not transfer to short-query GHSA retrieval.

Reproducer: eval_ghsa.py + fetch_ghsa.py in this repo.

Test 2: Hand-built informal queries against NVD corpus

100 short informal queries representing how a security analyst, SOC operator, or threat-intel researcher would actually search. Categories: famous-vuln names (log4shell), CWE-class descriptions (SQL injection in login form), product mentions (wordpress plugin XSS), question form (how does TOCTOU work), typos. For each, an "expected" target (CVE-ID, CWE-ID, or keyword) was hand-specified. Searched against 30,000 real NVD CVEs.

Model	Acc@1 (top-1 hit)	Hit@10 (in top-10)
`sentence-transformers/static-retrieval-mrl-en-v1` (baseline)	0.538	0.721
`sentence-transformers/all-mpnet-base-v2`	0.538	0.731
v1 (ours)	0.452	0.654
v2-longtail (ours)	0.413	0.635

Breakdown by query type:

Query type (n)	v1	v2-longtail	baseline	mpnet
Famous vuln names ("log4shell") (27)	0.000	0.000	0.000	0.000
CWE-class queries ("SQL injection") (39)	0.692 (best)	0.692	0.667	0.615
Product/keyword ("wordpress XSS") (38)	0.421	0.421	0.789	0.842

Famous vulnerability names: every model scores zero. Pop-culture names like "log4shell", "heartbleed", "zerologon" don't appear in CVE description text. This needs a name → CVE-ID lookup table, not embedding retrieval.
CWE-class queries: our v2-longtail wins. Strongest for class-level semantic queries.
Product/keyword queries: our models LOSE by 30+ percentage points. Our embedding geometry clusters by CWE-class, not by product/vendor. Asking "find wordpress vulnerabilities" returns 3-4 of 10 actual WordPress CVEs (the rest are same-vuln-class CVEs in other products). The baseline returns 7-8.

Reproducer: informal_queries.py in this repo.

Takeaway

Our model is overspecialized to NVD-style same-CWE retrieval. The specialization that lets us beat bge-large by 40% on the trained task also makes the model less useful for general security search. For programmatic CVE→CWE retrieval pipelines this is a feature. For human-typed security search this is a bug. Pick accordingly.

Reproducibility

The full training script train.py is included in this repo. Reproduces deterministically (seed=42) on comparable hardware to within ±0.01 NDCG@10 of 0.5680.

pip install "sentence-transformers[train]>=5.5" "datasets>=2.19" "tokenizers>=0.20" "accelerate>=0.26"

# Get the synthetic data first (auto-loaded from data/synthetic_cves.json by the script)
python -c "
from datasets import load_dataset
ds = load_dataset('oneryalcin/cve-synthetic-rare-cwes-v1', split='train')
import json
records = [{'description': r['description'], 'CWE-ID': r['CWE-ID']} for r in ds]
import os; os.makedirs('data', exist_ok=True)
json.dump(records, open('data/synthetic_cves.json', 'w'))
"

python train.py

The diagnostic script eval_per_cwe.py (also included) reproduces the per-CWE breakdown tables above.

When to choose this model vs v1

Use v1 if:
- you only care about aggregate NDCG / MRR on a benchmark
- your queries are heavily concentrated in common CWE classes (XSS, SQLi, RCE, etc.)
- you need maximum head-class NDCG@10 (v1: 0.654 vs v2-longtail: 0.641)

Use v2-longtail if:
- your application encounters rare CWEs (~590 of ~944 total are rare in real CVE distribution)
- you need top-1 precision across all class sizes — v2-longtail is the only model that doesn't regress on common while doubling on rare
- you want the model to find related vulnerabilities in less-common categories (Hit@10 on rare went from 8% to 50%)

Related artifacts

Repo	Purpose
`oneryalcin/static-retrieval-cve-en-v1`	Direct parent — the baseline this model was built to improve on the long tail
`oneryalcin/cve-synthetic-rare-cwes-v1`	The 4,417 synthetic CVEs added during training

Limitations

Inherits all of v1's limitations:

No negation / sentiment handling (token-bag fundamental).
English only.
CVE-ID strings not tokenized whole.
Some confusions reflect inherent CWE label ambiguity.

Plus v2-longtail-specific:

Synthetic-data bias. 4,417 synthetic CVEs follow a slightly more templated style than real CVEs (LLM artifact). The model is robust to this in eval (eval is real-only) but may slightly over-fit boilerplate patterns when encoding very informal queries.
Aggregate NDCG@10 lower than v1 (0.568 vs 0.581). If your benchmark aggregates without per-class weighting, v1 looks better on paper despite worse per-class behavior.
Med-class NDCG@10 regressed (v1 0.362 → v2-longtail 0.296). The med bucket (200-1000-pair CWEs) was the only group not directly intervened on. See oneryalcin/static-retrieval-cve-en-v1 if med-class NDCG matters.

License

Apache-2.0.

Citation

@misc{oneryalcin_cve_static_p61,
  author = {Mehmet Oner Yalcin},
  title  = {static-retrieval-cve-en-v2-longtail: synthetic-augmented static-embedding CVE retrieval},
  year   = {2026},
  howpublished = {Hugging Face model},
  url    = {https://huggingface.co/oneryalcin/static-retrieval-cve-en-v2-longtail},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

41.5M params

Tensor type

F32

Model tree for oneryalcin/static-retrieval-cve-en-v2-longtail

Base model

google-bert/bert-base-uncased

Finetuned

(6728)

this model

Datasets used to train oneryalcin/static-retrieval-cve-en-v2-longtail

Evaluation results

NDCG@10 on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)
test set self-reported

0.568
MRR@10 on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)
test set self-reported

0.678
Accuracy@1 on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)
test set self-reported

0.620
rare-class Acc@1 (CWE <200 pairs) on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)
test set self-reported

0.167
rare-class Hit@10 (CWE <200 pairs) on CVE-CWE held-out (297 queries / 5000 corpus, deterministic seed=42)
test set self-reported

0.500