In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > Choosing the Right Embedding Model for RAG

Choosing the Right Embedding Model for RAG

Author: Venkata Sudhakar

The embedding model is the foundation of a RAG pipeline. It converts your documents and user queries into vectors, and the quality of those vectors determines how well your retrieval step finds the right content. Choosing the wrong model means even a perfect LLM cannot save your RAG system - if retrieval fails, generation fails. The choice comes down to three dimensions: quality (how accurately it captures semantic meaning), speed and cost (how fast and cheap it is to embed), and whether it needs to run locally or can use a cloud API.

OpenAI text-embedding-3-small is the best starting point for most applications. It is fast, cheap ($0.02 per million tokens), produces 1536-dimensional vectors, and performs well on English general-purpose retrieval. text-embedding-3-large (3072 dimensions) gives higher accuracy for complex technical content at higher cost. For applications that need to run fully on-premises, cannot send data to external APIs, or need multilingual support, open-source models like sentence-transformers all-MiniLM-L6-v2 (384 dimensions, very fast, runs on CPU) or BGE-large-en (1024 dimensions, higher quality) are strong alternatives that run locally for free.

The below example shows how to benchmark two embedding models on your own data using average retrieval accuracy, so you can make an evidence-based choice rather than guessing.

# pip install openai sentence-transformers
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import math, time

client = OpenAI(api_key="your-api-key")

# Sample corpus of data migration documents
docs = [
    "Debezium reads the MySQL binlog to capture CDC events in real time.",
    "Flyway applies versioned SQL scripts to manage schema changes.",
    "Blue-Green deployment enables zero-downtime cutover between environments.",
    "Kafka consumer offset tracks how far the consumer has read in a partition.",
    "pgvector adds vector similarity search to PostgreSQL databases.",
]

# Test queries with known correct answers (ground truth)
test_queries = [
    ("How does log-based replication work?", 0),   # should match doc 0 (Debezium)
    ("How do I manage database schema versions?", 1), # should match doc 1 (Flyway)
    ("What is a consumer lag?", 3),                # should match doc 3 (Kafka)
]

def cosine_sim(a, b):
    dot = sum(x*y for x,y in zip(a,b))
    return dot / (math.sqrt(sum(x*x for x in a)) * math.sqrt(sum(x*x for x in b)))

def benchmark(name, embed_docs_fn, embed_query_fn):
    doc_vecs = embed_docs_fn(docs)
    correct = 0
    for query, expected_idx in test_queries:
        q_vec = embed_query_fn(query)
        scores = [cosine_sim(q_vec, dv) for dv in doc_vecs]
        predicted = scores.index(max(scores))
        if predicted == expected_idx:
            correct += 1
    print(f"{name}: {correct}/{len(test_queries)} correct")

# OpenAI text-embedding-3-small
def openai_embed(texts):
    r = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [e.embedding for e in r.data]

benchmark("OpenAI text-embedding-3-small",
          openai_embed,
          lambda q: openai_embed([q])[0])

It gives the following output,

OpenAI text-embedding-3-small: 3/3 correct

It gives the following output,

all-MiniLM-L6-v2 (local): 2/3 correct
BGE-large-en (local, higher quality): 3/3 correct

# Decision guide for this corpus:
# - OpenAI small: 3/3, fast API, $0.02/M tokens, internet required
# - MiniLM: 2/3, instant, free, runs on CPU - good enough for simple queries
# - BGE-large: 3/3, free, runs locally, slower than MiniLM but better quality

Practical selection guide: start with text-embedding-3-small for English general-purpose RAG - it covers most use cases. Switch to text-embedding-3-large if retrieval accuracy on technical or specialised content is not good enough. Use all-MiniLM-L6-v2 for high-throughput local embedding where speed matters more than maximum accuracy. Use BGE-large or E5-large for high-quality local embedding when you cannot use external APIs. Always benchmark on your own domain data - generic benchmarks may not reflect how models perform on your specific content.

Send your comments, suggestions or queries regarding this site to [email protected].