In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > RAG Re-ranking with a Cross-Encoder

RAG Re-ranking with a Cross-Encoder

Author: Venkata Sudhakar

Retrieval in a RAG pipeline has two stages: recall (find the top-K potentially relevant documents fast) and precision (pick only the most relevant ones to send to the LLM). Vector search handles recall well but is not perfect at precision. A cross-encoder re-ranker improves precision by scoring each (query, document) pair together with full attention to both, producing a much more accurate relevance score than a bi-encoder embedding comparison. The trade-off is speed - cross-encoders are slower, so you use them to re-rank a small candidate set (e.g. top-20) down to the final context (e.g. top-3) that goes to the LLM.

The typical RAG pipeline with re-ranking is: retrieve top-20 candidates from the vector store (fast), score each with the cross-encoder (accurate but slower), take the top-3 highest-scoring documents, and pass those to the LLM. This two-stage approach is sometimes called retrieve-then-rerank or bi-encoder + cross-encoder. The sentence-transformers library provides several production-ready cross-encoder models, with cross-encoder/ms-marco-MiniLM-L-6-v2 being the best balance of speed and accuracy for English text.

The below example shows retrieving 5 candidates from ChromaDB and re-ranking them with a cross-encoder to select only the 2 most relevant for the LLM context.

# pip install sentence-transformers chromadb openai
from sentence_transformers import CrossEncoder
from openai import OpenAI
import chromadb

client = OpenAI(api_key="your-api-key")
chroma = chromadb.Client()
coll = chroma.create_collection("migration_docs")

docs = [
    "Debezium reads the MySQL binlog to capture CDC events without polling.",
    "CDC replication lag measures seconds the consumer is behind the source.",
    "MySQL binlog must be in ROW format for Debezium to capture changes.",
    "Kafka consumer groups distribute partition reads across multiple consumers.",
    "The Debezium MySQL connector requires a dedicated replication user with REPLICATION SLAVE privileges.",
]

# Index documents with embeddings
for i, doc in enumerate(docs):
    emb = client.embeddings.create(model="text-embedding-3-small", input=doc).data[0].embedding
    coll.add(ids=[f"d{i}"], embeddings=[emb], documents=[doc])

# Load cross-encoder model (runs locally, no API needed)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

Re-ranking candidates before passing to the LLM,

def rag_with_reranking(query: str, retrieve_k: int = 5, final_k: int = 2) -> str:
    # Stage 1: Fast vector retrieval - get top-5 candidates
    q_emb = client.embeddings.create(model="text-embedding-3-small", input=query).data[0].embedding
    results = coll.query(query_embeddings=[q_emb], n_results=retrieve_k)
    candidates = results["documents"][0]

print("Retrieved candidates (vector search):")
    for i, doc in enumerate(candidates):
        print(f"  {i+1}. {doc[:70]}")

# Stage 2: Cross-encoder re-ranking - score each (query, doc) pair together
    pairs = [(query, doc) for doc in candidates]
    scores = reranker.predict(pairs)  # float score per pair

# Sort by cross-encoder score, take top-2
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    top_docs = [doc for doc, score in ranked[:final_k]]

print("\nAfter re-ranking (top 2 sent to LLM):")
    for doc, score in ranked[:final_k]:
        print(f"  [{score:.3f}] {doc[:70]}")

# Stage 3: Generate answer with re-ranked context
    context = "\n".join(top_docs)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer using only the context provided."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ],
        max_tokens=150
    )
    return response.choices[0].message.content

answer = rag_with_reranking("What MySQL permissions does Debezium need?")
print("\nAnswer:", answer)

It gives the following output,

Retrieved candidates (vector search):
  1. Debezium reads the MySQL binlog to capture CDC events without polling.
  2. The Debezium MySQL connector requires a dedicated replication user...
  3. MySQL binlog must be in ROW format for Debezium to capture changes.
  4. CDC replication lag measures seconds the consumer is behind the source.
  5. Kafka consumer groups distribute partition reads across multiple consumers.

After re-ranking (top 2 sent to LLM):
  [9.823] The Debezium MySQL connector requires a dedicated replication user with REPLICATION SLAVE privileges.
  [7.241] MySQL binlog must be in ROW format for Debezium to capture changes.

Answer: Debezium requires a MySQL user with REPLICATION SLAVE privileges.
The MySQL binlog must also be configured in ROW format for Debezium to
capture row-level changes.

# Re-ranker correctly promoted the permissions doc from rank 2 to rank 1
# The Kafka doc (irrelevant) dropped out of the final context entirely

Re-ranking is most impactful when your vector search returns plausible-looking but slightly off-topic documents. Retrieve generously (top-20 or more) to ensure recall, then re-rank aggressively to ensure precision. The latency cost is typically 50-200ms for a batch of 20 documents with a small cross-encoder model like MiniLM - acceptable for most applications. For production use, run the cross-encoder on GPU or use a hosted re-ranking API like Cohere Rerank to reduce latency to under 50ms.

Send your comments, suggestions or queries regarding this site to [email protected].