|
|
Hybrid Search - Combining Keyword and Vector Search
Author: Venkata Sudhakar
Vector search excels at semantic matching - it finds documents with the same meaning even if they use different words. But it can miss exact keyword matches. If a user searches for "Debezium 2.4 release notes", pure vector search might return general CDC articles instead of the specific version document. BM25 keyword search (the algorithm behind traditional search engines) is the opposite: it reliably finds exact terms but misses paraphrases and synonyms. Hybrid search combines both to get the best of both worlds - semantic understanding plus exact term matching. The standard approach is Reciprocal Rank Fusion (RRF). You run both a vector search and a BM25 keyword search in parallel, getting two ranked lists of documents. RRF combines these rankings using the formula score = 1/(k + rank) for each document in each list, where k is typically 60. A document that appears high in both lists gets a high combined score. Documents that only appear in one list still contribute, but less. This is simple to implement and requires no training data or thresholds. The below example shows implementing hybrid search with RRF using ChromaDB for vector search and the rank_bm25 library for keyword search.
Running hybrid search with Reciprocal Rank Fusion,
It gives the following output,
[RRF=0.03150] vec_rank=1 bm25_rank=1
Debezium 2.4 adds native support for PostgreSQL 15 logical replication.
[RRF=0.01573] vec_rank=2 bm25_rank=3
CDC with Debezium captures row-level changes by reading the MySQL binlog.
[RRF=0.01538] vec_rank=3 bm25_rank=5
Apache Kafka stores messages in immutable, ordered, partitioned logs.
# "Debezium 2.4 adds native support for PostgreSQL 15" ranked 1st in BOTH searches
# -> highest RRF score by a wide margin
# Pure vector search alone might have ranked generic Debezium doc higher
When to use hybrid search: whenever your corpus contains proper nouns, product names, version numbers, acronyms, or technical identifiers that vector search might miss. If users search for "error code ORA-01000" or "model gpt-4o-mini", BM25 will find exact matches that vector search might not rank highly. Start with pure vector search for simplicity, then switch to hybrid search if you see retrieval misses on exact-term queries. The RRF weighting between keyword and vector can also be tuned by multiplying each score by a weight before combining.
|
|