TTC-L2V-2 (Danish, Swedish and Norwegian)

Model Description

Supervised model for sentence embeddings.

Developed by: Jesper Alkestrup, The Tech Collective
Model type: Embedding model
Language(s) (NLP): Danish, Swedish and Norwegian
Finetuned from model: AI-Sweden-Models/Llama-3-8B-instruct
Finetuning procedure: LLM2Vec

Trained by using the approach outlined in the paper LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders.

MNTP finetuning on: https://huggingface.co/datasets/jealk/scandi-wiki-combined
Initial Supervised SimCSE on: https://huggingface.co/datasets/jealk/supervised-da
Followed by Supervised SimCSE with in-batch hard-negatives and instructions on: https://huggingface.co/datasets/DDSC/nordic-embedding-training-data

Usage

This is a sentence-transformers model — it loads directly and needs no packages beyond sentence-transformers. The bidirectional Llama encoder and the LLM2Vec instruction-aware mean pooling are provided by small custom modules in this repo, loaded via trust_remote_code.

pip install sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jealk/TTC-L2V-supervised-2", trust_remote_code=True)

# Queries are encoded with an instruction; documents without one.
instruction = "Givet et spørgsmål, find relevante tekstudsnit, der besvarer det:"
queries = [
    "Hvordan påvirker søvn vores koncentrationsevne",
    "Hvad skal man være opmærksom på, når man køber brugt cykel",
]
documents = [
    "Forskning viser, at for lidt søvn kan nedsætte både koncentration og hukommelse. "
    "Allerede efter én nat med dårlig søvn kan man opleve problemer med at fokusere.",
    "Når du køber en brugt cykel, bør du tjekke, om stellet har skader eller rust, og "
    "om gear og bremser fungerer korrekt.",
]

q_emb = model.encode(queries, prompt=instruction)   # or prompt_name="query"
d_emb = model.encode(documents)

# Cosine similarity
print(model.similarity(q_emb, d_emb))

The instruction is also registered as a named prompt, so model.encode(queries, prompt_name="query") applies the default Danish instruction shown above. Documents are encoded without an instruction.

Pooling is mean pooling over the content tokens; the instruction and the beginning-of-text token are excluded — exactly as in LLM2Vec.

Model size and quantization

The model has ~8B parameters and loads in roughly 14 GB (bfloat16). To run on smaller GPUs it can be loaded quantized via bitsandbytes:

from sentence_transformers import SentenceTransformer
from transformers import BitsAndBytesConfig

model = SentenceTransformer(
    "jealk/TTC-L2V-supervised-2", trust_remote_code=True,
    model_kwargs={"quantization_config": BitsAndBytesConfig(load_in_8bit=True)},
)

Approximate memory footprint: bfloat16 ≈ 14 GB, 8-bit ≈ 8 GB, 4-bit ≈ 5 GB (load_in_4bit=True).

The quantized variants have not been evaluated on benchmarks — embedding quality may differ from the full-precision model.

Notes

trust_remote_code=True is required: it loads the bidirectional Llama encoder and the LLM2Vec-style pooling module shipped in this repo.

Credits

Approach from LLM2Vec (McGill-NLP). Related model: https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised

Downloads last month: 110

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for jealk/TTC-L2V-supervised-2

Base model

meta-llama/Meta-Llama-3-8B

Finetuned

AI-Sweden-Models/Llama-3-8B

Finetuned

AI-Sweden-Models/Llama-3-8B-instruct

Finetuned

(7)

this model

Finetunes

2 models

jealk
/

TTC-L2V-supervised-2

TTC-L2V-2 (Danish, Swedish and Norwegian)

Model Description

Usage

Model size and quantization

Notes

Credits

Model tree for jealk/TTC-L2V-supervised-2

Datasets used to train jealk/TTC-L2V-supervised-2