You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
By requesting access you confirm that you will use this model exclusively for academic research purposes, will not use it for commercial products and will not redistribute the model
Log in or Sign Up to review the conditions and access this model content.
multilingual-e5-small-RoD-TAL
Fine-tuned sentence embedding model for Romanian legal retrieval in the driving-license domain.
- Base model:
intfloat/multilingual-e5-small - Fine-tuned model:
unstpb-nlp/multilingual-e5-small-RoD-TAL - Dataset:
unstpb-nlp/RoD-TAL - Primary task: dense retrieval of Romanian legal references for exam-style questions
Model details
This checkpoint is domain-adapted for the RoD-TAL benchmark (Romanian driving-law QA/retrieval). It is intended for retrieval and RAG pipelines where legal passages must be matched to user questions.
Setup highlights:
- Contrastive training (
MultipleNegativesRankingLoss/ InfoNCE) - Positive pairs: question ↔ correct legal references
- Hard negatives mined from top candidates of a base retriever
- Document encoding with 512-token truncation
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("unstpb-nlp/multilingual-e5-small-RoD-TAL")
queries = [
"query: În ce situație trebuie să acord prioritate pietonilor?"
]
passages = [
"passage: Art. X - Conducătorul este obligat să acorde prioritate pietonilor angajați regulamentar în traversare.",
"passage: Art. Y - ..."
]
q_emb = model.encode(queries, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = (q_emb @ p_emb.T)[0]
print(scores)
Query formatting recommendation
This is an E5-family model, so using explicit query: / passage: prefixes is recommended in inference.
For RoD-TAL-style retrieval, best reported performance used question + answer options as query text (concatenated), rather than question-only.
Intended uses
- Legal article retrieval in Romanian traffic-law assistants
- Retrieval stage for Romanian legal RAG systems
- Domain-specific benchmarking on RoD-TAL IR/VIR tasks
Limitations
- Truncation at 512 tokens can reduce recall for long legal documents
- Performance is domain-focused (Romanian driving-law); transfer to other legal domains may drop
- Reported gains depend on retrieval protocol and corpus preparation
Training and evaluation summary
Reported IR test metrics for the fine-tuned retriever on RoD-TAL:
- Recall@10: 88.14
- Precision@10: 23.28
- nDCG@10: 81.41
Citation
If you use this model, please cite:
@misc{man2025rodtalbenchmarkansweringquestions,
title={RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams},
author={Andrei Vlad Man and Răzvan-Alexandru Smădu and Cristian-George Craciun and Dumitru-Clementin Cercel and Florin Pop and Mihaela-Claudia Cercel},
year={2025},
eprint={2507.19666},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.19666}
}
- Downloads last month
- 14
Model tree for unstpb-nlp/multilingual-e5-small-RoD-TAL
Base model
intfloat/multilingual-e5-small