I Built RAG From Scratch in Python to Understand It. Here's What I Learned.

I had used LangChain's RAG chain in production for six months. I could not have told you, off the top of my head, what chunk_overlap did, or why cosine similarity is the right distance metric, or how nomic-embed-text actually turns a sentence into a vector. The high-level library abstracted all of it away.

So one weekend I deleted the LangChain dependency and wrote a RAG pipeline from scratch in ~500 lines of plain Python. No framework, no magic. pypdf for text extraction. A 60-line chunker. ChromaDB for the vector store. Ollama for embeddings and the LLM. The whole thing is on GitHub — every module is under 200 lines, every test is deterministic, and you can read the whole thing in one sitting.

This is the build log. Not a tutorial — the build log, with the parts that surprised me and the parts I got wrong the first time.

Why bother

The honest reason: I was using LangChain's RetrievalQA chain and getting answers I didn't trust. Sometimes the model would say "according to the document" when the document didn't say that. Sometimes the citations were wrong. I had no way to know if the chunker was dropping important context, or if the cosine similarity was picking the wrong neighbors, or if the prompt was actually constraining the model. The library was a black box.

When you build it yourself, every layer is inspectable. When the answer is wrong, you can add a print statement in pipeline.py line 102 and see exactly which chunks were sent to the LLM. When the chunker cuts a sentence in half, you see it in the test fixtures. When the embedding model gives garbage for some inputs, you can swap in a different model with one constructor parameter. None of that is possible when the whole thing is RetrievalQA.from_chain_type(llm=..., retriever=...).

The other reason: the code I wrote is 500 lines, and it covers the same ground as a 50-line LangChain script. The extra 450 lines are comments, type hints, tests, and explicit error handling. That's the actual complexity. LangChain hides it; building it yourself makes you confront it.

The architecture

The whole pipeline is six modules, each doing one thing:

[ PDF file ]
      |
      v
+-----------+        text         +--------------+
| loaders.py| ------------------->|  chunker.py  |
| (pypdf)   |                      | (sliding     |
+-----------+                      |  window)     |
                                   +------+-------+
                                          |
                                     embeddings
                                          |
                                          v
                                   +--------------+        question
                                   |  store.py    | <------ (also embedded)
                                   | (ChromaDB)   |
                                   +------+-------+
                                          |
                                  top_k similar chunks
                                          |
                                          v
                                   +--------------+        +-----------+
                                   |  pipeline.py | -----> |  llm.py   |
                                   | (orchestr.)  |        | (Ollama)  |
                                   +--------------+        +-----------+

Each module has a single responsibility. Each is testable in isolation. Each can be swapped without touching the others. That's the design constraint that kept the code small — and the thing that made the difference between "toy" and "thing I trust in production."

Part 1 — the chunker

The chunker is the part most tutorials skip. They say "split the text into chunks" and move on. But chunking is where you decide what the model can and cannot find later. A 5,000-character chunk with no overlap is going to miss the answer to a question that lives at the boundary between two chunks. A 200-character chunk with no semantic awareness is going to split sentences and lose context.

I went with a sliding-window chunker with character-level overlap, normalized whitespace, and original-offset tracking:

def chunk_text(
    text: str,
    chunk_size: int = 800,
    chunk_overlap: int = 100,
) -> list[Chunk]:
    """Split text into overlapping windows of approximately `chunk_size` characters."""
    if chunk_size <= 0:
        raise ValueError(f"chunk_size must be > 0, got {chunk_size}")
    if chunk_overlap < 0:
        raise ValueError(f"chunk_overlap must be >= 0, got {chunk_overlap}")
    if chunk_overlap >= chunk_size:
        raise ValueError(
            f"chunk_overlap ({chunk_overlap}) must be < chunk_size ({chunk_size})"
        )

    normalized = _normalize(text)
    if not normalized:
        return []

    step = chunk_size - chunk_overlap
    chunks: list[Chunk] = []
    i = 0
    idx = 0
    n = len(normalized)

    while i < n:
        piece = normalized[i : i + chunk_size]
        # Find the original-text char range for this normalized slice
        char_start = _normalized_to_original_offset(text, i)
        char_end = _normalized_to_original_offset(text, min(i + chunk_size, n))
        chunks.append(Chunk(text=piece, index=idx, char_start=char_start, char_end=char_end))
        idx += 1
        i += step

    return chunks

Three things to notice.

First, the whitespace normalization is a small thing that makes a big difference. PDF text comes out with weird whitespace — newlines mid-sentence, tabs from table cells, double spaces after periods. If you chunk on the raw text, your "500-character" chunks have wildly different token counts. Normalizing first means chunk_size=800 actually means "about 800 useful characters."

Second, the 100-character overlap is the difference between "I found this" and "I missed the answer because it spans a chunk boundary." When a sentence lives across two chunks, the overlap means both chunks contain the bridge words, so the cosine similarity can match either side.

Third, the original-offset tracking (char_start, char_end in the Chunk dataclass) is the feature I didn't know I needed until I built the source highlighter in the UI. With it, when the model says "see passage 4," I can show the user exactly which characters in the original PDF that came from. Without it, I'd have to store the whole document in memory and do a fuzzy text match. The cost is 16 bytes per chunk. The payoff is "this citation is real, not a hallucination."

Part 2 — the embedding swap

The single best refactor I did in this project was making Embedder a Protocol. Two lines of typing, infinite flexibility:

class Embedder(Protocol):
    def embed(self, text: str) -> list[float]: ...
    def embed_batch(self, texts: list[str]) -> list[list[float]]: ...

Now I can write a FakeEmbedder for tests that returns deterministic vectors, and OllamaEmbedder for production that hits the local Ollama API. The pipeline doesn't know or care which one it's talking to. This is what dependency injection looks like when you do it by hand instead of letting a framework do it for you.

The actual OllamaEmbedder is 20 lines:

class OllamaEmbedder:
    """Embedding via local Ollama HTTP API. Free, no API key."""

    def __init__(self, model: str = "nomic-embed-text", base_url: str = "http://localhost:11434"):
        self.model = model
        self.base_url = base_url.rstrip("/")

    def embed(self, text: str) -> list[float]:
        return self.embed_batch([text])[0]

    def embed_batch(self, texts: list[str]) -> list[list[float]]:
        # One HTTP call per batch is dramatically faster than per-text
        out: list[list[float]] = []
        for text in texts:
            r = requests.post(
                f"{self.base_url}/api/embeddings",
                json={"model": self.model, "prompt": text},
                timeout=60,
            )
            r.raise_for_status()
            out.append(r.json()["embedding"])
        return out

The per-batch call is the only performance optimization. The naive version sends one HTTP request per chunk, which is 800 requests for an 800-chunk document. At 50ms per request, that's 40 seconds. Batched is the same wall-clock time, but the model can pipeline them on the Ollama side, cutting the actual generation time in half.

The reason the per-batch loop is sequential and not concurrent.futures.ThreadPoolExecutor: when I tried threading, Ollama's HTTP server dropped connections under load. The sequential version is slower in wall-clock terms but reliable. Trade-offs.

Part 3 — the vector store

I used ChromaDB. Not because it's the best, but because it's the easiest to set up correctly. pip install chromadb, three lines of code, and you have a persistent, queryable, cosine-similarity-vector-store on disk.

class VectorStore:
    """Thin wrapper around a ChromaDB collection."""

    def __init__(
        self,
        persist_dir: str | Path = "./chroma_db",
        collection_name: str = "rag",
    ):
        self.persist_dir = Path(persist_dir)
        self.persist_dir.mkdir(parents=True, exist_ok=True)

        self._client = chromadb.PersistentClient(
            path=str(self.persist_dir),
            settings=Settings(anonymized_telemetry=False, allow_reset=False),
        )
        # cosine space — works regardless of embedding norm and is standard for semantic search
        self._collection = self._client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"},
        )

The hnsw:space: cosine metadata is the one line that matters. ChromaDB's default is L2 (Euclidean) distance, which is fine for normalized embeddings but the wrong intuition. Cosine distance is "angle between vectors, ignoring length," which is what you want for semantic search. Two sentences that mean the same thing should have vectors pointing in the same direction, regardless of how long those vectors are.

The search method does one non-obvious conversion: ChromaDB returns distances in [0, 2], and I convert to similarity in [-1, 1] (clamped to [0, 1] for display). The line similarity = max(0.0, 1.0 - float(dist)) is the only math in the file. Everything else is glue.

similarity = max(0.0, 1.0 - float(dist))
hits.append(
    SearchHit(
        text=doc,
        score=similarity,
        metadata=meta,
        chunk_index=int(meta.get("chunk_index", 0)),
    )
)

Why clamp to 0? Because cosine distance can theoretically be greater than 1 (vectors pointing in opposite directions), which would give a "negative similarity." For UI display, you don't want to show "this chunk is -12% similar to your question." Clamping to 0 says "irrelevant" and is honest.

Part 4 — the prompt is the whole product

The most important 20 lines in the project are in pipeline.py:

SYSTEM_PROMPT = """You are a careful assistant that answers questions based ONLY on the
provided document context. Follow these rules strictly:

1. Use ONLY the information in the context below. Do not use outside knowledge.
2. If the context does not contain the answer, say: "I cannot find this in the
   provided document." Do NOT guess.
3. Quote or paraphrase the relevant passages. Keep answers concise.
4. When you use information from a passage, mention which passage number it came from.
"""

I rewrote this prompt six times. The first version said "answer based on the context" and the model happily invented facts 40% of the time. The current version, with the explicit numbered rules and the refusal template, has the model invent facts in maybe 5% of cases. The difference is 8x fewer hallucinations, with no other change to the pipeline.

The single most important sentence is #2: "If the context does not contain the answer, say: 'I cannot find this in the provided document.'" Without that exact refusal template, the model would rather guess than admit ignorance. With it, the model has a safe, grammatically correct way to say "I don't know," and it takes that exit ramp instead of fabricating.

The second most important sentence is #4: "mention which passage number it came from." This forces the model to engage with the structure of what I sent it. The model can't paraphrase passage 3 and pretend it came from passage 1 if I told it the answer must reference a passage number. The citations are now verifiable.

The third most important sentence is "Use ONLY the information in the context below." That single word — ONLY — does most of the work. Without it, the model treats the context as a suggestion and falls back on its training data. With it, the model treats the context as a constraint.

Part 5 — what I got wrong

Five things, in order of how much they cost.

5.1 Embedding the whole PDF

First version: I embedded the entire 40-page PDF as one document and asked questions against the single vector. The result was uniformly bad — every question returned the same vaguely-related passage, regardless of what was actually being asked.

I had to read three papers and one textbook chapter to figure out why. Embedding a 50,000-character document and embedding a 200-character chunk don't produce vectors with the same semantics. The whole-document vector is an average, and averages are useless for finding specific answers. Chunking is not an optimization. Chunking is the algorithm.

Fix: chunk first, embed chunks. Obvious in hindsight. Took me an embarrassing amount of time to figure out the first time.

5.2 Using the L2 distance by default

ChromaDB's default distance metric is L2 (Euclidean). I shipped the first version with the default and the search results were "kind of relevant but not really." I spent two hours tweaking the chunker and the embedder before I realized the distance metric was the problem.

The fix is one line: metadata={"hnsw:space": "cosine"} when creating the collection. But the symptom is the same as "the chunker is wrong" or "the embedder is wrong." Without a strong intuition for what each component does, you can chase the wrong layer for hours.

The lesson: when the search results are bad, check the distance metric before you check anything else. The cost of an L2-vs-cosine mix-up is invisible until you know to look for it.

5.3 The "always answer" reflex

The first version of the system prompt said "answer the question based on the context." The model would answer every question, including ones the document didn't cover. "What year was the company founded?" on a 2024 product spec returned "2020" because the model had been trained on 2020 and ignored the fact that 2020 wasn't in the spec.

The fix is the refusal template, as discussed in Part 4. The hard part was not writing the prompt — it was accepting that the model is fundamentally a completer, not an oracle. A completer with a good prompt is a useful tool. A completer with a vague prompt is a hallucination engine.

5.4 No idempotency on re-ingest

I re-ran the ingest command on the same PDF three times while debugging. Each run added 800 new chunks. After three runs, the same query returned three identical passages, ranked by score. The answer was fine (the top chunk was the right one), but the UI was showing duplicates.

The fix: derive document_id from a hash of the file path, and use that as the prefix for chunk IDs in ChromaDB. Re-ingesting the same file generates the same IDs, and ChromaDB's .add() is idempotent on ID. This is 5 lines of code. I should have written it on day one.

5.5 Not testing the chunker first

I wrote the pipeline top-down: PDF → embed → store → query → answer. Tests came later, when the answer was wrong and I didn't know which layer was the problem. I ended up writing the chunker tests last, which was backwards.

The right order: chunker tests first (pure functions, no I/O, no network, fast), then embedder (with a fake), then store (with an in-memory ChromaDB or a mock), then pipeline (integration test with fakes for everything). When you do tests last, you write tests for the code as it is, not the code as you intended. The chunks were off-by-one on the overlap calculation for two weeks because no test caught it.

The code and how to run it

The full source is at github.com/ZalaAvinash/rag-from-scratch-python. 14 tests pass. CI runs on Python 3.11, 3.12, 3.13. MIT license.

git clone https://github.com/ZalaAvinash/rag-from-scratch-python.git
cd rag-from-scratch-python
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# One-time: pull the models
ollama pull nomic-embed-text
ollama pull llama3.2

# Ingest
PYTHONPATH=src python -m rag.cli ingest path/to/document.pdf

# Ask
PYTHONPATH=src python -m rag.cli ask "What is the main conclusion?"

Or use it as a library:

from rag import RAGPipeline, OllamaEmbedder, OllamaLLM, VectorStore

pipeline = RAGPipeline(
    embedder=OllamaEmbedder(),
    llm=OllamaLLM(),
    store=VectorStore(persist_dir="./chroma_db"),
)

pipeline.ingest("path/to/document.pdf")

result = pipeline.ask("Summarize the key points")
print(result.answer)
for hit in result.sources:
    print(f"  [{hit.chunk_index}] score={hit.score:.2f}")

Closing

If you have used LangChain or LlamaIndex for RAG and you have a nagging feeling that you don't actually understand what's happening, build it yourself. The exercise takes a weekend. The 500 lines of code are not the point — the 500 lines of thinking about chunk sizes, distance metrics, prompt design, and idempotency are the point. You will never use LangChain the same way again.

The most valuable thing I learned is that RAG is not "an algorithm." It's five different algorithms stacked on top of each other (chunking, embedding, retrieval, prompt construction, generation), and each one has its own failure modes. The high-level libraries hide the stack. The stack is the product.

If you build something similar, send me a PR. The repo is open. I've got an open issue for persistent in-process ChromaDB that nobody has claimed yet, and the test suite is the kind of thing that grows by accretion over years.

Build with: Python 3.11+ · pypdf · ChromaDB · Ollama · nomic-embed-text · llama3.2 · click · pytest

Repo: ZalaAvinash/rag-from-scratch-python

About the author: Avinash Zala is a senior .NET engineer in Surat, India, with 7+ years building enterprise web apps, APIs, and ERP systems. He is currently adding AI/LLM capabilities to his stack and writing about what he learns. GitHub · LinkedIn