Sentence Similarity
sentence-transformers
ONNX
Safetensors
bert
feature-extraction
gte
mteb
Eval Results (legacy)
text-embeddings-inference
Instructions to use Mihaiii/gte-micro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Mihaiii/gte-micro with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Mihaiii/gte-micro") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
gte-micro
This is a distill of gte-small.
Intended purpose
This model is designed for use in semantic-autocomplete (click here for demo).
Usage (same as gte-small)
Use in semantic-autocomplete OR in code
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
tokenizer = AutoTokenizer.from_pretrained("Mihaiii/gte-micro")
model = AutoModel.from_pretrained("Mihaiii/gte-micro")
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Use with sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('Mihaiii/gte-micro')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Limitation (same as gte-small)
This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
- Downloads last month
- 173
Spaces using Mihaiii/gte-micro 18
🥇
mteb/leaderboard
🥇
mteb/leaderboard_legacy
🥇
SmileXing/leaderboard
🥇
maxpar1/leaderboard
🥇
sq66/leaderboard_legacy
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported68.821
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported31.261
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported62.702
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported77.115
- ap on MTEB AmazonPolarityClassificationtest set self-reported71.290
- f1 on MTEB AmazonPolarityClassificationtest set self-reported77.023
- accuracy on MTEB AmazonReviewsClassification (en)test set self-reported40.936
- f1 on MTEB AmazonReviewsClassification (en)test set self-reported39.246
- v_measure on MTEB ArxivClusteringP2Ptest set self-reported35.237
- v_measure on MTEB ArxivClusteringS2Stest set self-reported31.087
- map on MTEB AskUbuntuDupQuestionstest set self-reported55.312
- mrr on MTEB AskUbuntuDupQuestionstest set self-reported69.508
- accuracy on MTEB Banking77Classificationtest set self-reported69.562
- f1 on MTEB Banking77Classificationtest set self-reported68.147
- v_measure on MTEB BiorxivClusteringP2Ptest set self-reported30.201
- v_measure on MTEB BiorxivClusteringS2Stest set self-reported27.380