You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By requesting access you confirm that you will use this model exclusively for academic research purposes, will not use it for commercial products and will not redistribute the model

multilingual-e5-small-RoD-TAL

Fine-tuned sentence embedding model for Romanian legal retrieval in the driving-license domain.

Base model: intfloat/multilingual-e5-small
Fine-tuned model: unstpb-nlp/multilingual-e5-small-RoD-TAL
Dataset: unstpb-nlp/RoD-TAL
Primary task: dense retrieval of Romanian legal references for exam-style questions

Model details

This checkpoint is domain-adapted for the RoD-TAL benchmark (Romanian driving-law QA/retrieval). It is intended for retrieval and RAG pipelines where legal passages must be matched to user questions.

Setup highlights:

Contrastive training (MultipleNegativesRankingLoss / InfoNCE)
Positive pairs: question ↔ correct legal references
Hard negatives mined from top candidates of a base retriever
Document encoding with 512-token truncation

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("unstpb-nlp/multilingual-e5-small-RoD-TAL")

queries = [
    "query: În ce situație trebuie să acord prioritate pietonilor?"
]
passages = [
    "passage: Art. X - Conducătorul este obligat să acorde prioritate pietonilor angajați regulamentar în traversare.",
    "passage: Art. Y - ..."
]

q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)

scores = (q_emb @ p_emb.T)[0]
print(scores)

Query formatting recommendation

This is an E5-family model, so using explicit query: / passage: prefixes is recommended in inference.

For RoD-TAL-style retrieval, best reported performance used question + answer options as query text (concatenated), rather than question-only.

Intended uses

Legal article retrieval in Romanian traffic-law assistants
Retrieval stage for Romanian legal RAG systems
Domain-specific benchmarking on RoD-TAL IR/VIR tasks

Limitations

Truncation at 512 tokens can reduce recall for long legal documents
Performance is domain-focused (Romanian driving-law); transfer to other legal domains may drop
Reported gains depend on retrieval protocol and corpus preparation

Training and evaluation summary

Reported IR test metrics for the fine-tuned retriever on RoD-TAL:

Recall@10: 88.14
Precision@10: 23.28
nDCG@10: 81.41

Citation

If you use this model, please cite:

@misc{man2025rodtalbenchmarkansweringquestions,
  title={RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams},
  author={Andrei Vlad Man and Răzvan-Alexandru Smădu and Cristian-George Craciun and Dumitru-Clementin Cercel and Florin Pop and Mihaela-Claudia Cercel},
  year={2025},
  eprint={2507.19666},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.19666}
}