In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > RAG Pipelines > RAG with Metadata Filtering

RAG with Metadata Filtering

Author: Venkata Sudhakar

A RAG system becomes much more useful when retrieval is scoped to the right subset of documents for each user. Imagine a company HR chatbot with policies for multiple countries - India, UAE, and Singapore. Without filtering, a question about maternity leave might pull documents from all three countries and confuse the LLM. With metadata filtering, you attach a country field to every document at index time, then filter at query time so only the documents matching the current employee country are retrieved. The LLM gets clean, relevant context and gives a precise, correct answer.

Metadata is stored alongside each document in the vector store as a simple dictionary. When you query, you pass a filter condition in addition to the query vector. The vector store applies the filter first - eliminating non-matching documents - and then runs similarity search only within the filtered subset. This combines exact metadata matching with semantic vector search. ChromaDB, Pinecone, Qdrant, and pgvector all support this pattern, each with slightly different filter syntax.

The below example builds an HR policy chatbot where the same knowledge base serves employees from different countries and seniority levels, each getting answers scoped only to their applicable policies.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o-mini", api_key="your-api-key", temperature=0)
emb = OpenAIEmbeddings(model="text-embedding-3-small", api_key="your-api-key")

# HR policy documents - each tagged with metadata for filtering
policy_docs = [
    Document(
        page_content="All employees in India are entitled to 26 weeks of paid maternity leave under the Maternity Benefit Act.",
        metadata={"country": "India", "topic": "maternity_leave"}
    ),
    Document(
        page_content="Employees in the UAE receive 60 days of paid maternity leave as per UAE Labour Law.",
        metadata={"country": "UAE", "topic": "maternity_leave"}
    ),
    Document(
        page_content="India Sales team members are eligible for a quarterly incentive of up to 20 percent of base salary.",
        metadata={"country": "India", "topic": "incentives", "department": "Sales"}
    ),
    Document(
        page_content="Senior Managers in India receive an annual car allowance of Rs 1,20,000 in addition to base salary.",
        metadata={"country": "India", "topic": "allowances", "level": "senior"}
    ),
    Document(
        page_content="All India employees are entitled to 12 days paid casual leave and 12 days sick leave per year.",
        metadata={"country": "India", "topic": "leave"}
    ),
    Document(
        page_content="UAE employees receive 30 days of annual leave after completing one year of service.",
        metadata={"country": "UAE", "topic": "leave"}
    ),
]

vectorstore = Chroma.from_documents(policy_docs, embedding=emb)

Querying with metadata filters to scope results by country,

rag_prompt = ChatPromptTemplate.from_template(
    "Answer the HR policy question using only the context below.\n"
    "If the answer is not in the context, say the policy does not apply here.\n\n"
    "Context:\n{context}\n\nQuestion: {question}"
)

def format_docs(docs):
    return "\n\n".join(
        f"[{d.metadata.get('topic', 'policy')}] {d.page_content}" for d in docs
    )

def hr_chatbot(question: str, employee_country: str, department: str = None) -> str:
    # Build the metadata filter - only retrieve docs for this employee context
    where_filter = {"country": employee_country}

retriever = vectorstore.as_retriever(
        search_kwargs={"k": 3, "filter": where_filter}
    )

chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | rag_prompt
        | llm
        | StrOutputParser()
    )
    return chain.invoke(question)

# Same question - different answers based on employee country
print("=== India employee asks about maternity leave ===")
print(hr_chatbot("How many weeks of maternity leave am I entitled to?", "India"))
print()
print("=== UAE employee asks about maternity leave ===")
print(hr_chatbot("How many weeks of maternity leave am I entitled to?", "UAE"))
print()
print("=== India employee asks about annual leave ===")
print(hr_chatbot("How many days of leave do I get per year?", "India"))
print()
print("=== UAE employee asks India-only allowance question ===")
print(hr_chatbot("Do I get a car allowance?", "UAE"))

It gives the following output,

=== India employee asks about maternity leave ===
You are entitled to 26 weeks of paid maternity leave under the Maternity Benefit Act.

=== UAE employee asks about maternity leave ===
You are entitled to 60 days of paid maternity leave as per UAE Labour Law.

=== India employee asks about annual leave ===
You are entitled to 12 days of paid casual leave and 12 days of sick leave per year.

=== UAE employee asks India-only allowance question ===
The policy does not apply here.

# The UAE employee never sees India policy documents - the filter
# eliminates them before semantic search even runs.
# The car allowance question correctly returns nothing for UAE
# because that document is tagged country=India.

Metadata filtering is essential whenever your knowledge base spans multiple scopes that should never mix: country-specific regulations, department-specific procedures, product-specific FAQs, or role-specific access levels. The alternative - retrieving everything and asking the LLM to ignore irrelevant documents - is unreliable and wastes tokens. Tag documents richly at index time (country, department, product, version, date) because you cannot add metadata later without re-indexing. Good metadata design at the start saves significant rework as your knowledge base grows.

Send your comments, suggestions or queries regarding this site to [email protected].