tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > RAG Pipelines > RAG Re-ranking with a Cross-Encoder

RAG Re-ranking with a Cross-Encoder

Author: Venkata Sudhakar

Retrieval in a RAG pipeline has two stages: recall (find the top-K potentially relevant documents fast) and precision (pick only the most relevant ones to send to the LLM). Vector search handles recall well but is not perfect at precision. A cross-encoder re-ranker improves precision by scoring each (query, document) pair together with full attention to both, producing a much more accurate relevance score than a bi-encoder embedding comparison. The trade-off is speed - cross-encoders are slower, so you use them to re-rank a small candidate set (e.g. top-20) down to the final context (e.g. top-3) that goes to the LLM.

The typical RAG pipeline with re-ranking is: retrieve top-20 candidates from the vector store (fast), score each with the cross-encoder (accurate but slower), take the top-3 highest-scoring documents, and pass those to the LLM. This two-stage approach is sometimes called retrieve-then-rerank or bi-encoder + cross-encoder. The sentence-transformers library provides several production-ready cross-encoder models, with cross-encoder/ms-marco-MiniLM-L-6-v2 being the best balance of speed and accuracy for English text.

The below example shows retrieving 5 candidates from ChromaDB and re-ranking them with a cross-encoder to select only the 2 most relevant for the LLM context.


Re-ranking candidates before passing to the LLM,


It gives the following output,

Retrieved candidates (vector search):
  1. Debezium reads the MySQL binlog to capture CDC events without polling.
  2. The Debezium MySQL connector requires a dedicated replication user...
  3. MySQL binlog must be in ROW format for Debezium to capture changes.
  4. CDC replication lag measures seconds the consumer is behind the source.
  5. Kafka consumer groups distribute partition reads across multiple consumers.

After re-ranking (top 2 sent to LLM):
  [9.823] The Debezium MySQL connector requires a dedicated replication user with REPLICATION SLAVE privileges.
  [7.241] MySQL binlog must be in ROW format for Debezium to capture changes.

Answer: Debezium requires a MySQL user with REPLICATION SLAVE privileges.
The MySQL binlog must also be configured in ROW format for Debezium to
capture row-level changes.

# Re-ranker correctly promoted the permissions doc from rank 2 to rank 1
# The Kafka doc (irrelevant) dropped out of the final context entirely

Re-ranking is most impactful when your vector search returns plausible-looking but slightly off-topic documents. Retrieve generously (top-20 or more) to ensure recall, then re-rank aggressively to ensure precision. The latency cost is typically 50-200ms for a batch of 20 documents with a small cross-encoder model like MiniLM - acceptable for most applications. For production use, run the cross-encoder on GPU or use a hosted re-ranking API like Cohere Rerank to reduce latency to under 50ms.


 
  


  
bl  br