My eval harness paid for itself on the first run: 0.57 0.96, two bugs no unit test could catch

I almost shipped a RAG pipeline that, on certain questions, cited exactly the right document — and then told the user the answer wasn't in it.

Every unit test was green. The retrieval returned the correct chunk. The API returned 200. The citation was attached to the response. By every check I had, it worked. The first run of my eval harness scored it 0.57, and that number is the only reason I found out before users did.

This is the story of those two bugs, why no unit test I could have written would have caught them, and why I now believe an eval harness belongs in a GenAI project from day one — not "once it's stable."

What the eval harness actually does

For the RAG starter I was building, "chat with your documents," I wanted a test that exercised the thing users actually do, end to end. So the harness:

Ingests a small fixture corpus (a few public-domain documents) with real embeddings into a dedicated database — no mocks.
Runs the real agent loop for each question: the same tool-calling, the same retrieval, the same prompts as production.
Grades every answer with an LLM-as-judge, scoring two things: faithfulness (is the answer supported by the retrieved context?) and citation correctness (do the citations point to the right places?).
Includes negative cases — questions whose honest answer is "that's not in these documents" — where the correct behavior is a refusal, not a confident guess.

It's deterministic-ish (temperature 0, a fixed judge prompt), and the score is committed per-commit, so a regression in retrieval visibly drops the number in a PR.

Then I ran it for the first time. 0.57.

Bug #1: the model was shown a preview, not the evidence

Digging into the low-faithfulness cases, I found answers where the agent retrieved the correct chunk, attached it as a citation, and still said the information wasn't there. It was citing a source while denying its contents. That makes no sense — until you look at what the model was actually handed.

When the agent called the search tool, the tool result it received back was the 400-character UI snippet — the little preview you show in a citation chip — not the full chunk text. The snippet was built for humans skimming a sidebar. The model was being asked to answer from the preview while the actual evidence sat in a field it never saw. If the answer lived past character 400 of the chunk, the model genuinely couldn't see it, and faithfully reported it wasn't there.

The fix was to separate the two concerns: a snippet for the UI, and the full content for the model.

class Citation:
    snippet: str   # short preview, for the UI chip
    content: str   # full chunk text, for the model to reason over

Here's the part that matters for this post: every unit test still passed after I found this, and every unit test passed before. Retrieval returned the right chunk (✓). The citation was attached (✓). The endpoint returned 200 (✓). The bug wasn't in any unit — it was in the contract between units: what one component handed another, inside a loop, at runtime. The only way to see it was to look at the behavior of the whole system on a real question and ask "is this answer actually good?" That question is precisely what an eval harness asks and a unit test does not.

Bug #2: the agent answered from memory instead of searching

The negative cases surfaced the second bug. For general-knowledge questions, the agent would often just... answer. Confidently. From the model's own parametric memory, without ever calling the search tool. Sometimes it was even right — which is worse, because it means the behavior is unreliable in a way that looks fine in a demo.

For a "chat with your documents" product, that's a correctness bug: the whole value proposition is that answers are grounded in your documents, with citations. An answer that skips retrieval is off-contract even when it happens to be correct, and it's how you get confident hallucinations on the questions that matter.

The fix was prompt-level: a hardened, search-first system prompt and a sharper tool description that makes "search before you answer" the default, plus a relevance-score floor on citations so weak matches don't get dressed up as sources. The eval harness is what told me the fix worked rather than just felt better — the negative cases started passing without dragging down the answers that should be grounded.

After both fixes, the next run came back at 0.96 (faithfulness 0.99, citation correctness 0.93). Same code path, same corpus — the harness measured the difference instead of me guessing at it.

Why this changed how I build GenAI features

Unit tests verify that a function does what you wrote it to do. They're necessary, and both of my bugs slipped past a green suite because both functions did exactly what they were written to do — the system just behaved badly. LLM-powered features fail at a different layer: not "this function returned the wrong value" but "this non-deterministic pipeline produced an unfaithful answer." You can't assert your way to that with assertEqual.

An eval harness is the test for that layer. A few things I'd now treat as non-negotiable:

Run the real pipeline. Mocks would have hidden both bugs — the snippet/content split and the skipped search only exist in the real loop.
Grade behavior, not strings. "Is this answer faithful to the context?" is the question; an LLM judge with a fixed rubric and structured output is a practical way to ask it at scale.
Include negative cases. "The answer is not in the corpus" catches the confident-hallucination failure mode that positive cases never will.
Commit the score. A number in every PR turns "did retrieval regress?" from a vibe into a diff.
Do it on day one. The harness cost me an afternoon and caught two ship-blockers on its first run. "We'll add evals later" means shipping the 0.57 version.

The whole harness — runner, fixture corpus, judge, and the per-commit results — is open source (MIT) in the starter, if you want the exact shape of it:

👉 github.com/delmalih/saas-genai-starter

The uncomfortable takeaway: my pipeline was "done" and fully green at 0.57. The only thing standing between that and production was one number from a test that actually asked whether the answers were any good. If you're building anything RAG- or agent-shaped, write that test first.