How I Built an AI Document Ingestion Pipeline

Symport is an AI document ingestion pipeline that turns a phone photo of any paper document — receipt, EOB, prescription, utility bill — into structured JSON, then stores it in Postgres with embeddings for semantic search. The full flow is: image upload → Sharp preprocessing → GPT-4o vision extraction → normalized JSON → Postgres + pgvector. I built it because I hate paper and I also lose paper.

This post walks through how the pipeline actually works, including the prompt engineering decisions that make extraction reliable enough to trust and the fallback layers that keep the app useful when extraction fails.

TL;DR

Stack: Sharp for image preprocessing, GPT-4o for vision extraction, Prisma + Postgres + pgvector for storage and semantic search.
The extraction prompt does most of the work: explicit date context to fight year hallucinations, constrained type/category enums for predictable downstream branching, and a strict "JSON only, no markdown" tail.
User correction loop: Users can add freeform feedback ("the drug name is metformin, not metFORMIN") and re-run extraction; the feedback gets injected back into the system prompt.
Schema choice: A single extractedData JSON column instead of per-type tables, with a denormalized searchText field for fast keyword search and an embedding column for semantic search.
Two fallback layers: Document still saves if there's no API key, and still saves with an error summary if extraction throws — nothing is ever lost because AI had a bad day.

What the pipeline does

The flow is straightforward:

Image upload → sharpen + encode → GPT-4o vision → structured JSON → Postgres + embeddings

A user photographs a receipt, an insurance EOB, a prescription, a utility bill — anything on paper. The app returns a structured JSON object with the relevant fields extracted, tagged, and ready to query. No manual data entry.

Step 1: Image preprocessing

Raw phone photos are large and often noisy. Before sending to the vision model, every image gets sharpened and re-encoded using Sharp:

const rawBuffer = Buffer.from(await file.arrayBuffer());
const buffer = await sharpenAndEncode(rawBuffer);

Sharp handles resizing, sharpening, and JPEG re-encoding in one pass. This serves two purposes: it reduces the payload size for the API call, and sharpening improves OCR accuracy on text-heavy documents like receipts. A blurry photo of small print is genuinely harder for vision models — a little preprocessing pays off.

The processed image gets saved to disk as the source of truth, then the buffer goes to the extraction pipeline:

const filename = `${randomBytes(12).toString("hex")}.jpg`;
await writeFile(fullPath, buffer);
const extracted = await extractFromImageBuffer(buffer);

Random hex filename prevents collisions and avoids leaking any metadata about the document in the path.

Step 2: The extraction prompt

This is where most of the real engineering lives. The system prompt does a lot of work to make the model's output consistent and parseable.

The prompt has three parts assembled at startup:

const EXTRACTION_SYSTEM_HEAD = `You are a document extraction assistant. Analyze the image and extract structured data.

Current date context: We are in 2025. Use 2025 (not 2023 or other past years) for any ambiguous or partial dates when no stronger clue is present.

Use context clues from the document text to infer the correct year:
- "2025 taxes due in 2026" → tax year 2025
- "Plan year 2025", "Coverage year 2025" → use 2025
- "Due in 2026" on a tax-related doc often refers to tax year 2025

Respond with a single JSON object. Include "type", "category", "title", and "tags" in every response.

- "type": one of rx_receipt, eob, utility_bill, general
- "category": one of receipt, financial, medical, government, legal, identity, general
- "title" (required, 2-5 words max): Short label only. No sentences.
- "tags": array of 3–8 short labels. No spaces; use underscores if needed.
`;

A few decisions worth calling out here:

Explicit date context. Vision models can hallucinate dates, especially on documents where the year is ambiguous. Anchoring the prompt with the current year and showing examples of how to reason about year context dramatically reduces date errors. Without this, a 2025 tax document might come back with 2023 dates because the model defaulted to its training data.

Constrained type and category values. Giving the model an explicit enum for type and category means you get predictable values you can branch on in code. Open-ended classification produces inconsistent strings that are annoying to handle downstream.

Short title constraint. "2-5 words max, no sentences" prevents the model from writing a summary disguised as a title. You want "Prescription receipt" not "This document appears to be a receipt from Walgreens for a prescription medication."

The tail of the prompt closes with:

const EXTRACTION_SYSTEM_TAIL = `
Use null for missing values. Amounts as numbers. Dates as YYYY-MM-DD; use context clues for year. Output only valid JSON, no markdown or explanation.`;

"Output only valid JSON, no markdown or explanation" is load-bearing. Without it, GPT-4o will frequently wrap the response in a markdown code block. The extraction code handles that case anyway, but telling the model not to do it reduces the cleanup work:

// Strip optional markdown code block
let jsonStr = raw;
const match = raw.match(/```
{% endraw %}
(?:json)?\s*([\s\S]*?)
{% raw %}
```/);
if (match) jsonStr = match[1].trim();

Step 3: User feedback loop

One of the more useful features is the ability to correct extractions. If the model gets something wrong — misreads a drug name, gets the date wrong, miscategorizes the document — the user can add a correction note and re-run extraction. That feedback gets injected directly into the system prompt:

if (options?.userFeedback?.trim()) {
  systemContent += `\n\nIMPORTANT - User feedback on this document (apply these corrections):
${options.userFeedback.trim()}`;
}

This means the model gets a second pass with explicit correction instructions. In practice it works well — "the drug name is metformin not metFORMIN" or "this is a 2025 EOB not 2024" gets applied reliably.

The feedback also gets stored in the database as extractionNotes on the document, so you have a record of what was corrected.

Step 4: The data model

The Prisma schema keeps things straightforward:

model Document {
  id            String   @id @default(cuid())
  imagePath     String?
  noteText      String?
  status        String   @default("pending")
  extractedData Json
  searchText    String?
  embedding     Unsupported("vector(1536)")?
  tags          String[]
  extractionNotes String?
  createdAt     DateTime @default(now())
  updatedAt     DateTime @updatedAt
}

A few design choices here:

extractedData is a JSON blob. Rather than creating separate tables for each document type (receipts, EOBs, utility bills), all extracted data lives in a single JSON column. This makes the schema flexible — different document types have different fields, and a rigid relational schema would be a constant maintenance burden as new types are added.

searchText is denormalized. After extraction, key fields get pulled out and concatenated into a single searchText string for full-text search. This is faster to query than parsing JSON at search time:

export function buildSearchText(data: ExtractedDoc): string {
  const parts: string[] = [];
  parts.push(effectiveTitle(data));
  if ("summary" in data && data.summary) parts.push(String(data.summary));
  if ("drug_name" in data && data.drug_name) parts.push(String(data.drug_name));
  if ("insurer" in data && data.insurer) parts.push(String(data.insurer));
  if ("tags" in data && Array.isArray(data.tags)) {
    parts.push(...data.tags.map(t => String(t).trim()).filter(Boolean));
  }
  return parts.join("");
}

embedding for semantic search. After the document is saved, an embedding gets generated from searchText and stored in a pgvector column. This enables semantic search — finding "cholesterol medication" when the document says "lipitor" — without a separate vector database. Just pgvector as a Postgres extension.

Step 5: Graceful degradation

The pipeline has two fallback layers. First, if there's no API key configured, the document still gets saved — just without extraction:

if (!process.env.OPENAI_API_KEY) {
  await prisma.document.create({
    data: {
      imagePath: filename,
      status: "pending",
      extractedData: { type: "general", title: "Document", summary: "Extraction skipped (no OPENAI_API_KEY)" },
      searchText: "Extraction skipped",
      tags: [],
    },
  });
}

Second, if extraction throws, the document still gets saved with an error summary rather than failing the whole request:

try {
  const extracted = await extractFromImageBuffer(buffer);
  extractedData = extracted as Record<string, unknown>;
} catch (err) {
  extractedData = {
    type: "general",
    summary: "Extraction failed: " + (err instanceof Error ? err.message : "Unknown error"),
  };
}

The image is always saved. The extraction is best-effort. Users can re-trigger extraction manually, or add correction notes and re-run. Nothing gets lost because AI had a bad day.

The model

The extraction model is configurable via environment variable with gpt-4o as the default:

const model = process.env.OPENAI_EXTRACTION_MODEL || "gpt-4o";

GPT-4o is the right choice here — it's genuinely better than smaller models at reading degraded document images, handwriting, and small print. For this specific task the quality difference is noticeable enough to justify the cost. Document extraction is a write-time operation (not a search-time one), so the latency and cost are acceptable.

What I'd do differently

A few things I'd change with hindsight:

Add a confidence score. The model sometimes hedges on fields it's uncertain about — a low-confidence flag on individual fields would let the UI highlight things that need user review rather than silently storing potentially wrong data.

Chunk large documents. A single-page receipt is fine. A multi-page insurance EOB or medical record is harder — the model gets less accurate as documents get longer or more complex. Chunking multi-page documents and merging the extracted JSON would improve accuracy on longer content.

Store the raw extraction response. Right now only the normalized result gets stored. Keeping the raw model output alongside it would make debugging extraction issues much easier.

The full source is on GitHub at github.com/meownoirsoft/symport. The extraction logic lives in lib/extract.ts and the ingestion endpoint is app/api/documents/route.ts if you want to dig in.

FAQ

Why GPT-4o for vision instead of a cheaper model or open-source alternative?

GPT-4o reads degraded phone photos, handwriting, and small print noticeably better than smaller or open-source vision models. For document extraction, getting the dates and amounts wrong is a much bigger problem than the per-call cost, so paying for the better model is worth it. Extraction runs once at write time, not on every read.

How do you stop the model from hallucinating dates or amounts?

The biggest wins are anchoring "current year" context in the system prompt with explicit examples, asking the model to use context clues from the document itself ("plan year 2025", "due in 2026" → tax year 2025), and constraining types/categories to enums so the model can't drift. The user-feedback loop catches anything that still slips through.

Why store extracted data as a single JSON column instead of typed tables?

Different document types have different fields — a prescription receipt has drug_name and pharmacy, an EOB has insurer and claim_id, a utility bill has account_number and service_period. A relational schema for every variant would be a constant migration treadmill. JSON keeps the schema flexible, and the denormalized searchText and embedding columns make queries fast where it matters.

What happens if the model returns invalid JSON?

The extraction code strips optional markdown code fences (

json ...

) and parses the rest. If parsing still fails, the document saves with an error summary in the extractedData.summary field rather than throwing — the user can re-run extraction or add a correction note. The image and metadata are never lost.

Can I use this with Anthropic's Claude vision instead of GPT-4o?

Yes. The extraction prompt is provider-agnostic and the model is configurable via OPENAI_EXTRACTION_MODEL. Swap the SDK call for the Anthropic SDK (or route through OpenRouter to avoid a code change) and Claude's vision models work as a drop-in alternative. The "JSON only, no markdown" instruction is even more important on Claude — it likes to explain itself by default.

How do you handle multi-page documents?

Today the pipeline treats each photo as a single document, which is fine for single-page items (receipts, prescriptions). For multi-page EOBs or medical records, the right next step is to chunk the document into pages, run extraction per page, and merge the resulting JSON into a single record. Adding that is on the to-do list.