✨ Overview

Boogr is derived from BAAI's bge-small-en-v1.5, part of the BGE (BAAI General Embedding) family.

The upstream model family is designed for dense retrieval and text embedding tasks such as:

  • semantic search
  • document retrieval
  • chunk similarity
  • passage ranking
  • clustering
  • sentence-level representation learning

Within Chonky, Boogr is the lightweight local English embedding option and is best suited for:

  • default local installations
  • offline embedding workflows
  • rapid experimentation
  • development and testing
  • vectorizing chunked corpora on lower-resource systems

⚙️ Code Respository

🧰 Streamlit UI

Open In Streamlit

🧠 Why Boogr Exists

Chonky supports both hosted and local embedding workflows. Boogr exists to give Chonky users a fully local, low-friction embedding path that avoids dependence on hosted provider APIs for common semantic-search tasks.

Boogr is especially useful when you want:

  • local-only embeddings
  • offline or restricted-network operation
  • lower memory use than larger embedding models
  • an English-first default embedder
  • a model that is straightforward to distribute with the application

🔬 Base Model Lineage

Boogr is derived from:

  • Upstream base model: BAAI/bge-small-en-v1.5
  • Model family: BGE / FlagEmbedding
  • Primary task family: feature extraction / text embeddings
  • Language focus: English
  • License: MIT

The v1.5 revision of the BGE family was introduced to improve retrieval behavior and address similarity-distribution issues observed in earlier releases.


Specs

  • Model
Model Name Dimension Sequence Length Introduction
boogr 1024 8192 multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised
BAAI/bge-m3-unsupervised 1024 8192 multilingual; contrastive learning from bge-m3-retromae
BAAI/bge-m3-retromae -- 8192 multilingual; extend the max_length of xlm-roberta to 8192 and further pretrained via retromae
BAAI/bge-large-en-v1.5 1024 512 English model
BAAI/bge-base-en-v1.5 768 512 English model
BAAI/bge-small-en-v1.5 384 512 English model
  • Data
Dataset Introduction
MLDR Docuemtn Retrieval Dataset, covering 13 languages
bge-m3-data Fine-tuning data used by bge-m3

FAQ

1. Introduction for different retrieval methods

  • Dense retrieval: map the text into a single embedding, e.g., DPR, BGE-v1.5
  • Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, unicoil, and splade
  • Multi-vector retrieval: use multiple vectors to represent a text, e.g., ColBERT.

2. How to use boogr in other projects?

For embedding retrieval, you can employ the same approach as BGE. The only difference is that the BGE-M3 model no longer requires adding instructions to the queries.

For hybrid retrieval, you can use Vespa and Milvus.

3. How to fine-tune boogr?

You can follow the common in this example to fine-tune the dense embedding.

If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the unified_fine-tuning example

Downloads last month
37
GGUF
Model size
33.2M params
Architecture
bert
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leeroy-jankins/boogr

Quantized
(20)
this model

Papers for leeroy-jankins/boogr