tl  tr
  Home | Tutorials | Articles | Videos | Products | Tools | Search
Interviews | Open Source | Tag Cloud | Follow Us | Bookmark | Contact   
 Generative AI > RAG Pipelines > RAG with Metadata Filtering

RAG with Metadata Filtering

Author: Venkata Sudhakar

A RAG system becomes much more useful when retrieval is scoped to the right subset of documents for each user. Imagine a company HR chatbot with policies for multiple countries - India, UAE, and Singapore. Without filtering, a question about maternity leave might pull documents from all three countries and confuse the LLM. With metadata filtering, you attach a country field to every document at index time, then filter at query time so only the documents matching the current employee country are retrieved. The LLM gets clean, relevant context and gives a precise, correct answer.

Metadata is stored alongside each document in the vector store as a simple dictionary. When you query, you pass a filter condition in addition to the query vector. The vector store applies the filter first - eliminating non-matching documents - and then runs similarity search only within the filtered subset. This combines exact metadata matching with semantic vector search. ChromaDB, Pinecone, Qdrant, and pgvector all support this pattern, each with slightly different filter syntax.

The below example builds an HR policy chatbot where the same knowledge base serves employees from different countries and seniority levels, each getting answers scoped only to their applicable policies.


Querying with metadata filters to scope results by country,


It gives the following output,

=== India employee asks about maternity leave ===
You are entitled to 26 weeks of paid maternity leave under the Maternity Benefit Act.

=== UAE employee asks about maternity leave ===
You are entitled to 60 days of paid maternity leave as per UAE Labour Law.

=== India employee asks about annual leave ===
You are entitled to 12 days of paid casual leave and 12 days of sick leave per year.

=== UAE employee asks India-only allowance question ===
The policy does not apply here.

# The UAE employee never sees India policy documents - the filter
# eliminates them before semantic search even runs.
# The car allowance question correctly returns nothing for UAE
# because that document is tagged country=India.

Metadata filtering is essential whenever your knowledge base spans multiple scopes that should never mix: country-specific regulations, department-specific procedures, product-specific FAQs, or role-specific access levels. The alternative - retrieving everything and asking the LLM to ignore irrelevant documents - is unreliable and wastes tokens. Tag documents richly at index time (country, department, product, version, date) because you cannot add metadata later without re-indexing. Good metadata design at the start saves significant rework as your knowledge base grows.


 
  


  
bl  br