Why RAG Pipelines Fail on East African Documents — and What I Built to Fix It

go dev.to

I have been watching AI teams in Nairobi build retrieval systems on Kenyan regulatory and legal documents and hit the same invisible wall. They tune their prompts. They swap embedding models. They adjust retrieval parameters. The answers are still wrong.

The failure is not in the LLM. It is not in the vector store. It is happening one layer below all of that, before any of it runs.

It is in the chunking step.


The problem with generic chunking

Every mainstream RAG framework — LangChain, LlamaIndex, Amazon Bedrock Knowledge Bases — chunks documents the same way: fixed character counts or sentence boundaries. This works reasonably well for English-language enterprise documents. It fails on East African regulatory and legal documents in ways that are hard to debug because the failure is invisible.

Here is a concrete example. Take a CBK (Central Bank of Kenya) circular. Its structure looks like this:

1. Purpose
   This circular provides guidelines on...

2. Scope
   This circular applies to all institutions...

3. Customer Data Protection Requirements
   3.1 All DCPs must implement end-to-end encryption...
   3.2 Customer data must be stored within Kenya...
   3.3 DCPs must appoint a dedicated Data Protection Officer...
Enter fullscreen mode Exit fullscreen mode

A generic chunker splitting at 500 characters will cut somewhere inside section 3. The chunk boundary lands mid-requirement. The embedding model encodes an incomplete regulatory instruction. When a compliance team queries "what are the data protection requirements for DCPs", the retrieval system returns a fragment — missing the subsections that define the actual obligations.

The LLM fills in the gap with its training data. The answer looks confident and is wrong.

The same failure mode hits every East African document type:

SACCO loan policies — penalty clauses separated from the grace period conditions they govern. A chunk containing "a penalty of 2% per month will be charged" with no surrounding context about when the grace period ends is not a useful retrieval unit.

Kenyan court judgments — the FINDINGS section separated from the ORDER that follows from them. A RAG system that retrieves the order without the findings cannot explain why the court ruled that way.

Kenyan legislation — the CBK Act (Cap. 491) has been amended repeatedly. It contains alphanumeric section numbers: 33N, 33O, 33U, 51B. A generic chunker treating these as arbitrary text has no way to know that 33U. is a section boundary and (1) is a subsection that should stay inside its parent chunk.


The root cause

Generic chunkers do not know what kind of document they are reading. They apply the same strategy to a CBK circular, a SACCO policy, a Kenyan Act, and a Silicon Valley terms-of-service document. Structure that is obvious to any lawyer or compliance officer — numbered sections, ALL-CAPS headings, legislative part headers — is invisible to a chunker that only sees character counts.

The fix is obvious once you see it: detect the document type first, then apply the correct structural cutting grammar for that type.


Building Hekima

I built Hekima to solve this. The name means "wisdom" in Swahili.

The architecture is simple:

Input: document (.txt or .pdf)
          ↓
[ Document Type Detector ]
          ↓
[ Structure-Aware Chunker ]
          ↓
Output: JSON chunks with section labels, token counts, metadata
Enter fullscreen mode Exit fullscreen mode

Detection

Detection is deterministic and stateless. No ML model. Each document type has a set of lexical fingerprints — phrases that appear in that type and not in others.

signatures := map[models.DocumentType][]string{
    models.TypeLegislation: {
        "Laws of Kenya",
        "An Act of Parliament",
        "Cap.",
        "Short title",
        "Interpretation",
        "PRELIMINARY",
    },
    models.TypeCBKCircular: {
        "Central Bank of Kenya",
        "Governor",
        "Ref. No. CBK",
        "pursuant to",
        "Banking Act",
        "all institutions",
    },
    // ...
}
Enter fullscreen mode Exit fullscreen mode

The detector scores each type by counting phrase matches. Minimum score of 2 required — prevents single accidental matches from mislabeling. Ties return TypeUnknown. The same document always produces the same result.

One subtlety worth noting: CBK circulars and Kenyan legislation share vocabulary. A CBK circular cites "Central Bank of Kenya" and "Banking Act" because it is issued under those authorities. The CBK Act itself also contains those phrases. The disambiguation comes from score gap — legislation uniquely contains "Laws of Kenya", "An Act of Parliament", "Cap.", "Short title", "Interpretation", and "PRELIMINARY". A real Act scores 5-6 on TypeLegislation and 1-2 on TypeCBKCircular. A circular scores 4-5 on TypeCBKCircular and 0-1 on TypeLegislation. The gap is decisive.

Chunking strategies

Each document type has its own splitting strategy:

CBK Circulars — splitByNumberedSections

Splits at top-level integer sections only. isNumberedSection() uses rune iteration (not byte indexing) for unicode safety, and explicitly rejects subsections:

// "3. Requirements" → new chunk boundary
// "3.1 Specific requirement" → stays inside section 3's chunk
func isNumberedSection(line string) bool {
    runes := []rune(line)
    i := 0
    // must start with digits
    for i < len(runes) && unicode.IsDigit(runes[i]) { i++ }
    // must be followed by dot
    if i >= len(runes) || runes[i] != '.' { return false }
    i++
    // must NOT be followed by another digit (that would be a subsection)
    if i >= len(runes) || unicode.IsDigit(runes[i]) { return false }
    return len(strings.TrimSpace(string(runes[i:]))) > 1
}
Enter fullscreen mode Exit fullscreen mode

Kenyan Legislation — splitLegislation

This was the most interesting to build. The CBK Act (Cap. 491) PDF, extracted via pdftotext -layout, contains:

  1. A Table of Contents with the same Part and Section patterns as the body — but every TOC entry has dot-leaders ("1. Short title ............... 1"). Detection: strings.Contains(line, "...")

  2. Part headers with Roman numerals and sub-part suffixes introduced by amendment: Part VIA, Part VIB, Part VIC, Part VID.

  3. Section numbers with uppercase letter suffixes: 33N, 33O, 33U, 51B.

  4. Repealed sections that match the section header regex but contain only a bracketed repeal notice below minChunkLength — correctly producing no chunk.

The regex that handles all of this:

// Handles: "1.", "4A.", "33U.", "51B."
var sectionHeader = regexp.MustCompile(`^(\d+[A-Z]*)\.\s+\S`)

// Handles: "Part I", "Part VIA", "Part VIB"
var partHeader = regexp.MustCompile(
    `(?i)^Part\s+[IVXLCDM]+(A|B|C|D)?\s+[–-]\s+\S`,
)
Enter fullscreen mode Exit fullscreen mode

Part identity is preserved in Metadata["part"] on every section chunk beneath it, so a RAG pipeline can filter or boost by Part without needing a dedicated Part-only chunk.

The chunk output

Every chunk carries:

{"id":3,"text":"3.1 All DCPs must implement end-to-end encryption...\n3.2 Customer data must be stored within Kenya...","section":"3. Customer Data Protection Requirements","doc_type":"cbk_circular","filename":"cbk_circular.pdf","token_count":89,"metadata":{}}
Enter fullscreen mode Exit fullscreen mode

token_count uses a word count × 1.3 heuristic (approximates BPE token counts for English and Swahili prose). Use it to enforce context window limits when batching chunks for an embedding model.

overlap_words is available via the API — repeats the last N words of each chunk at the start of the next. Applied as a single post-processing pass after splitting, written once and shared across all document types. Recommended for dense regulatory prose: 15–25 words.


The HTTP API

Hekima runs as an HTTP server. Integration into any RAG pipeline is one call:

curl -X POST https://hekima-production.up.railway.app/chunk \
  -F "file=@cbk_circular.pdf" \
  -F "overlap_words=20"
Enter fullscreen mode Exit fullscreen mode

Response is a JSON array of chunks, ready to feed to any embedding model — Titan Embeddings, OpenAI Embeddings, Cohere Embed, whatever you are using.

The server runs behind a per-IP rate limiter (10 requests/minute, burst of 5) with structured request logging. The Docker image is 19MB — multi-stage build, non-root user, poppler-utils included for PDF extraction.


Technical choices worth explaining

Why Go?

Single binary. No runtime. No dependency hell. The Docker image is 19MB including PDF extraction. A Python equivalent with FastAPI and dependencies would be 300MB+ just to start. For infrastructure that needs to run reliably in constrained East African deployment environments, that matters.

Outside stdlib, Hekima has one dependency: golang.org/x/time/rate for the token bucket rate limiter. Everything else — HTTP server, JSON, multipart parsing, file I/O — is stdlib.

Why pdftotext and not a Go PDF library?

pdftotext -layout preserves the spatial layout of the document, which is critical for section detection. The -layout flag reconstructs the visual column structure from the PDF's character position data. A pure Go PDF parser that extracts raw text loses this layout information, and section headers become indistinguishable from body text.

The tradeoff: poppler-utils must be installed as a system dependency. The Dockerfile handles this. For CLI users it is one apt install command.

Why deterministic detection and not an ML classifier?

East African regulatory documents have stable structural conventions that have not changed significantly in decades. "Ref. No. CBK" appears on every CBK circular. "Laws of Kenya" appears on every revised statute. "REPUBLIC OF KENYA" appears in every court judgment header. These are not probabilistic signals — they are definitive identifiers.

A trained classifier would require a labeled dataset, introduce a dependency on a model file, add latency, and produce probabilistic outputs that need a confidence threshold. The lexical fingerprint approach is faster, requires no training data, produces deterministic results, and is trivially explainable. For this use case it is the right choice.


What is next

The engine covers the five highest-value East African document types. The pattern for adding new types is established and documented in CONTRIBUTING.md — if you work with NTSA forms, KRA notices, or county government circulars and have real document samples, contributions are welcome.

The live demo is at https://hekima-production.up.railway.app — upload any supported document type and see the chunks.

The repo is at https://github.com/the-veez/hekima — open source, MIT licensed, CI green.

If you are building RAG systems on East African documents and this solves a problem you have been fighting, I would like to hear about it.


Built in Nairobi. Solving African problems with African context.

Source: dev.to

arrow_back Back to Tutorials