Building RAG with LangChain & Chroma: Two Hidden Pitfalls That Cost Me 6 Hours

python dev.to

At 10 PM, my product manager dropped 200 PDFs in my lap: “We need to demo an internal knowledge base Q&A for the boss tomorrow morning—super urgent.” I thought, “RAG? I know this; LangChain plus a vector database, done in minutes.” I started coding at 4 PM and barely finished by 10 PM—not because the pipeline didn’t run, but because two subtle traps dragged the accuracy below 40% and had me debugging for six straight hours. In this article, I’ll walk you through the full RAG system build and pull the two pitfalls out by their roots.

Why you can’t just dump documents into GPT

The simplest idea for a system that answers questions like “What is the company holiday policy?” or “What were the conclusions of project X’s retrospective?” is to concatenate all the documents into one giant prompt and send it to GPT. Reality hits fast: 200 PDFs add up to over 800,000 characters. Even GPT-4’s 128K context window chokes, and the per‑call cost will make your finance team come after you. Fine‑tuning is even less realistic—the documents change daily, and you’re not going to burn thousands of dollars every time they do.

That leaves retrieval‑augmented generation (RAG) as the only viable path: split documents into small chunks, embed each chunk with an embedding model and store the vectors in a vector database. At query time, retrieve the most relevant chunks, stuff them into the prompt as context, and let the LLM generate an answer. The pattern looks simple, but every step—“how to split,” “how to store,” “how to search”—has its own sharp edges. The two that wrecked me were buried deep in the interaction between LangChain and Chroma.

Design choices: why Chroma over Pinecone or FAISS

Before picking a vector store, I asked myself three questions: does it cost money, does it support metadata filtering, and can I start/stop it locally with a single command?

Pinecone costs money and requires data to go to the cloud—immediately ruled out for internal documents. Weaviate is powerful, but deploying it means at least 30 minutes of Docker tinkering, a non‑starter when the demo is “tomorrow morning.” FAISS is blazing fast, but it doesn’t support metadata filtering (like filtering by document type or date range)—a feature we’d need as soon as the business side piles on more requirements. I landed on Chroma: it runs locally, installs with a single pip install chromadb, and has persistence, metadata filtering, and similarity search built right in. It also integrates with LangChain more smoothly than any other option.

The overall architecture is straightforward: load documents → split text → generate embeddings → write to Chroma → when a user asks, retrieve top‑k chunks → stuff into a prompt → LLM generates an answer. LangChain chains these steps together; you just need to manage the parameters and edge cases for each stage.

Core implementation: two scripts to run the full RAG pipeline

Script 1’s job: turn scattered PDFs into searchable vector chunks and persist them in Chroma so you don’t have to re‑index everything on the next run.

import os
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# 1. Load PDF directory
loader = PyPDFDirectoryLoader("./docs")          # Auto-scan all PDFs
documents = loader.load()
print(f"Loaded documents: {len(documents)} pages")

# 2. Split: chunk_size and overlap are the source of two major pitfalls, detailed later
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,         # Max characters per chunk
    chunk_overlap=200,       # Overlap to avoid cutting key info across boundaries
    separators=["\n\n", "\n", "", "", "", ""]  # Priority: paragraphs first
)
chunks = text_splitter.split_documents(documents)
print(f"Total chunks: {len(chunks)}")

# 3. Generate embeddings and store in Chroma (auto-persist to local dir)
embeddings = OpenAIEmbeddings()                  # Default: text-embedding-ada-002
vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"              # Reusable after restart, saves re-embedding cost
)
vectordb.persist()
print("Vector store built and persisted to ./chroma_db")
Enter fullscreen mode Exit fullscreen mode

Script 2’s job: using the stored vector store, build the full “ask → retrieve → generate” chain and force the LLM to answer strictly from the provided documents—no hallucinations.

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# 1. Custom prompt: force LLM to base answers only on the given context
prompt_template = """You are a rigorous internal knowledge base assistant. Answer the question strictly based on the context below.
If the answer cannot be found in the context, simply say "No relevant information found." Do not make anything up.

Context:
{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, 
    input_variables=["context", "question"]
)

# 2. Load the persisted vector store, connecting to the same embedding model
vectordb = Chroma(
    persist_directory="./chroma_db",
    embedding_function=OpenAIEmbeddings()
)

# 3. Create the QA chain: retrieves top-4 chunks by default, using our custom prompt
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)  # temperature=0 for deterministic results
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",                               # Stuffs retrieved chunks directly into the prompt
    retriever=vectordb.as_r
Enter fullscreen mode Exit fullscreen mode

Note: The final line of the code block is shown exactly as it appears in the original article.

Source: dev.to

arrow_back Back to Tutorials