mpnet-use-combined-no-pt

This model is a fine-tuned version of paraphrase-multilingual-mpnet-base-v2, trained on the Ukrainian text corpus UberText 2.0 with combined data augmentation strategies but without pool targets. It is part of the Ukrainian Sentence Embeddings collection, which explores the effect of different training strategies on sentence embedding quality for Ukrainian.

Model Description

The model was fine-tuned using a contrastive objective on UberText 2.0, combining multiple data augmentation techniques to compensate for the skewed distribution of polysemous words in the corpus. Compared to mpnet-use-ubertext-no-pt, this variant applies augmentation strategies during training, which improves sense-level distinctions for underrepresented homonyms while keeping pool targets disabled.

Collection Overview

Model	Description
mpnet-use-ubertext-no-pt	Raw UberText 2.0, no augmentation, no pool targets
mpnet-use-combined-no-pt (this model)	Combined augmentation strategies, no pool targets
mpnet-use-markov-pt	Markov-based augmentation with pool targets

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("victormuryn/mpnet-use-combined-no-pt")

sentences = [
    "Проводжає сина мати захищати рідний край",
    "Хоч би малесеньку хатину він мріяв мати над Дніпром",
]

embeddings = model.encode(sentences)
print(embeddings.shape)

Training Details

Base model: paraphrase-multilingual-mpnet-base-v2
Training corpus: UberText 2.0
Augmentation: Combined
Pool targets: No

Citation

To be added

License

Apache 2.0

Downloads last month: 31

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for victormuryn/mpnet-use-combined-no-pt

Base model

sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Finetuned

(161)

this model

Collection including victormuryn/mpnet-use-combined-no-pt

Ukrainian Sentence Embeddings (USE)

Collection

16 items • Updated 27 days ago