In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > LangChain > LangChain Document Loaders and Text Splitters

LangChain Document Loaders and Text Splitters

Author: Venkata Sudhakar

Before you can build a RAG pipeline or feed documents to an LLM, you need to load and split them correctly. LangChain provides a rich library of document loaders that read from PDFs, web pages, CSV files, databases, and more, and text splitters that divide content into chunks that fit within LLM context windows while preserving semantic coherence.

The choice of chunk size and overlap significantly impacts RAG quality. Small chunks (200-400 tokens) improve retrieval precision but lose context. Large chunks (1000-2000 tokens) preserve context but reduce retrieval accuracy. Chunk overlap (50-100 tokens) prevents losing information at chunk boundaries. The RecursiveCharacterTextSplitter is the recommended default - it splits on paragraphs, sentences, and words in order, preserving natural boundaries.

The below example loads ShopMax India product documentation from multiple sources and splits it for a RAG pipeline using LangChain document loaders and text splitters.

from langchain_community.document_loaders import TextLoader, CSVLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.schema import Document

# Option 1: Load from text file
loader = TextLoader("shopmax_policy.txt", encoding="utf-8")
# docs = loader.load()

# Option 2: Load directly from Document objects (for demo)
raw_docs = [
    Document(page_content="""ShopMax India Electronics Store - Customer Policy Guide.
Return Policy: All electronics can be returned within 10 days of purchase.
Items must be in original packaging with all accessories included.
To initiate a return, visit any ShopMax store or use the ShopMax app.
Refunds are processed within 5-7 business days to the original payment method.
Warranty Policy: All products carry a minimum 1-year manufacturer warranty.
Extended warranty of 2 years is available for an additional 5% of the product price.
For warranty claims, bring your invoice and the product to any service centre.
ShopMax has authorised service centres in Mumbai, Delhi, Bangalore, Hyderabad, and Pune.""")
]

# RecursiveCharacterTextSplitter: recommended for most use cases
splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=40,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(raw_docs)

print(f"Original docs: {len(raw_docs)}, Chunks after splitting: {len(chunks)}")
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1} ({len(chunk.page_content)} chars):")
    print(chunk.page_content[:100] + "...")

It gives the following output,

Original docs: 1, Chunks after splitting: 4

Chunk 1 (198 chars):
ShopMax India Electronics Store - Customer Policy Guide.
Return Policy: All electronics can be returned within 10 days...

Chunk 2 (187 chars):
To initiate a return, visit any ShopMax store or use the
ShopMax app. Refunds are processed within 5-7 business days...

Chunk 3 (193 chars):
Warranty Policy: All products carry a minimum 1-year
manufacturer warranty. Extended warranty of 2 years...

Chunk 4 (156 chars):
For warranty claims, bring your invoice and the product to any
service centre. ShopMax has authorised service centres...

The document was split into 4 semantically coherent chunks with 40-token overlap at boundaries. For ShopMax production RAG systems, use chunk_size=500 for policy documents and chunk_size=200 for product specifications. Always inspect your chunks visually before embedding to ensure no critical information is cut mid-sentence. Load CSVs with CSVLoader for product catalogues and use WebBaseLoader to index ShopMax help centre pages directly from URLs.

Send your comments, suggestions or queries regarding this site to [email protected].