Solved: Re: Bge-m3 model, output dense, sparse and colbert embeddings

Florianoli · ‎01-05-2025

Hi everyone,

I'm currently trying to use the bge-m3 model to create dense and sparse embeddings to use them in a hybrid retrieval setting. unfortunately the openai embedding endpoint of the model server just outputs the dense embeddings. Is there a possibility to output all vectors?

Best,

Florian

dusktilldawn · ‎01-07-2025

As you've mentioned, the OpenAI embedding API typically returns dense embeddings, which are continuous vector representations of the input text.

These dense embeddings are typically optimized for similarity search, clustering, and other downstream tasks, but they don't provide sparse embeddings directly.

Hybrid Retrieval

A hybrid retrieval system typically combines both dense embeddings (which are more semantic and capture richer information) and sparse embeddings (which can be more traditional, like TF-IDF or BM25-based embeddings).

Using both types of embeddings together can improve retrieval accuracy, as dense embeddings capture deep semantic meaning and sparse embeddings preserve exact token matching.

Dense vs Sparse Embeddings:

Dense embeddings: Continuous vectors of fixed size (e.g., 768 dimensions for BERT-based models).
Sparse embeddings: High-dimensional vectors with a lot of zero values, typically used for traditional information retrieval (e.g., TF-IDF vectors).

Solution Options:

Generating Sparse Embeddings Separately: The OpenAI API doesn’t offer sparse embeddings, so you would need to generate them separately. Here's how:

TF-IDF or BM25: You can use libraries like scikit-learn, gensim, or FAISS to generate sparse embeddings like TF-IDF or BM25 vectors. These can be used in conjunction with the dense embeddings returned by the OpenAI endpoint.

FAISS: FAISS supports both dense and sparse retrieval. You can generate dense embeddings with OpenAI and sparse embeddings using scikit-learn/gensim, and then use FAISS to combine them for hybrid retrieval.

Using BGE-M3 Model for Dense Embeddings: If you're specifically working with BGE-M3, it is a bi-encoder model that should work well for dense retrieval. If you're using the BGE-M3 model via an API, it will typically return dense embeddings. If you want to access all vectors (both dense and sparse), you might need to:

Check Documentation: Look through the API documentation for the model you're using to see if there’s a way to request sparse embeddings (though most likely, dense embeddings will be provided by default).

Use Hybrid Approaches: You can combine the dense vectors with sparse vectors generated using traditional retrieval methods.
Generating Sparse Embeddings with BGE-M3: If you need sparse embeddings from BGE-M3, you would have to apply a custom method. BGE-M3 is not known for providing sparse embeddings directly, but you could use the following hybrid approach:

First, generate dense embeddings using the OpenAI API for BGE-M3.

Then, extract the sparse representations from the same corpus using a traditional retrieval model like TF-IDF or BM25 (from libraries like scikit-learn or gensim).

Finally, you can combine these two types of embeddings into a hybrid retrieval system, either using a multi-stage pipeline or through vector concatenation.

Example of Using Dense and Sparse Embeddings Together:

If you decide to use dense embeddings from OpenAI and sparse embeddings (e.g., from TF-IDF), here’s a conceptual approach:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from openai.embeddings_utils import get_embeddings # Assuming you're using OpenAI's Python library

# Example text corpus
documents = ["This is a test document.", "Another example sentence.", "Hybrid retrieval is powerful."]

# Step 1: Generate Dense Embeddings (using OpenAI's BGE-M3 model)
dense_embeddings = []
for doc in documents:
dense_embeddings.append(get_embeddings(doc, model="text-embedding-ada-002")) # Change model as needed

dense_embeddings = np.array(dense_embeddings) # Shape: (num_documents, embedding_size)

# Step 2: Generate Sparse Embeddings (using TF-IDF)
vectorizer = TfidfVectorizer()
sparse_embeddings = vectorizer.fit_transform(documents) # Shape: (num_documents, num_features)

# Step 3: Combine Both Types of Embeddings
# You can use various methods to combine them, e.g., concatenation or weighted fusion
combined_embeddings = np.concatenate([dense_embeddings, sparse_embeddings.toarray()], axis=1)

# Now you can use `combined_embeddings` for your hybrid retrieval model

View solution in original post

Peh_Intel · ‎01-06-2025

Hi Florian,

Thanks for reaching out to us.

Looking at the Embeddings API, there is no additional params to specify embedding outputs.

As such, I have to check with development team on this matter and get back to you.

Regards,

Peh

dusktilldawn · ‎01-07-2025