Extracting T4 Data from PDFs in Python — A Canadian Developer's Guide

python dev.to

Cross-posted from caseonix.ca


Every Canadian fintech team eventually hits this problem. Users upload their T4 slips. Your backend gets a PDF. Somewhere between that PDF and your database you need to pull out box 14, box 22, the SIN, the employer name — correctly, reliably, across documents from dozens of different payroll software vendors.

The obvious tools get recommended: AWS Textract, LlamaParse, pdfplumber, PyMuPDF. They're good at what they do. But none of them know what a T4 is. They don't know that box 14 is employment income, that box 22 is income tax deducted, that a nine-digit formatted number is a Social Insurance Number, or that CRA publishes an XML specification for this document every year. They hand you text. The domain knowledge you write yourself.

That ends up being more work than people expect. I've seen it written three or four different ways at different companies, none with tests, none with audit trails, all slightly wrong at the edges. This is the guide I wish existed before I started.

What's a T4? For non-Canadian readers: a T4 (Statement of Remuneration Paid) is the Canadian equivalent of a US W-2. Every employer issues one annually to report employment income, CPP contributions, EI premiums, and income tax withheld. It's one of the most common documents in Canadian fintech, mortgage underwriting, and tax software.


Why Regex Isn't Enough

The first instinct is regex. T4s are standardized CRA forms — surely field positions are consistent?

import re
import pdfplumber

with pdfplumber.open("t4_2024.pdf") as pdf:
    text = pdf.pages[0].extract_text()

box_14 = re.search(r"14\s+[\$]?([\d,]+\.?\d*)", text)
if box_14:
    income = float(box_14.group(1).replace(",", ""))
Enter fullscreen mode Exit fullscreen mode

This works on the T4 you tested it on. It breaks on the next one because a different payroll vendor laid the PDF out differently, the box number is on a different line than the value, or the document is a scanned image with no text layer.

Regex extraction of financial documents is essentially a parser that only works on documents you've already seen. Every new employer format becomes a special case. The maintenance cost compounds.


Step 1 — Get Clean Text with Docling

Docling is IBM's open-source document intelligence toolkit. It handles PDF text extraction, layout analysis, table recognition, and OCR fallback. Runs entirely locally, no API keys, MIT licensed.

pip install docling
Enter fullscreen mode Exit fullscreen mode

Then convert any PDF:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("t4_2024.pdf")

# Clean markdown-formatted text, reading order preserved
text = result.document.export_to_markdown()
print(text)
Enter fullscreen mode Exit fullscreen mode

What comes out is structured text with layout preserved. Docling understands the difference between a table cell and a paragraph. It handles scanned documents through an OCR pipeline and correctly orders multi-column layouts. For T4 PDFs — which vary significantly between payroll vendors — the output quality is consistent in a way raw pdfplumber isn't.

First-run note: Docling downloads its layout models from HuggingFace on first run (~500MB). This is expected — models are cached locally after that. For production, pre-pull them during your Docker build step.

Docling gives you clean text. It still doesn't know what box 14 means. That's the next layer.


Step 2 — Extract Fields with pydantic-ai

Once you have clean text, you need to pull out specific typed fields reliably. The right tool for this today is an LLM with structured output — you give it a Pydantic model and it fills it in. pydantic-ai handles this cleanly and is model-agnostic: Claude, OpenAI, and local Ollama all work behind the same interface.

pip install pydantic-ai
Enter fullscreen mode Exit fullscreen mode

Define your T4 model and agent:

from pydantic import BaseModel
from pydantic_ai import Agent

class T4Fields(BaseModel):
    employer_name: str
    tax_year: int
    box_14_employment_income: float
    box_22_income_tax_deducted: float
    box_16_cpp_contributions: float | None = None
    box_18_ei_premiums: float | None = None
    box_52_pension_adjustment: float | None = None
    province_of_employment: str | None = None

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    result_type=T4Fields,
    system_prompt="""
    You are extracting fields from a Canadian T4 Statement of Remuneration Paid.
    Return monetary values as plain floats (87500.0, not "$87,500.00").
    Return null for any field not present in the document.
    Province of employment should be a 2-letter code (ON, BC, QC, etc.).
    Do not hallucinate values — if a field is not visible, return null.
    """,
)

result = agent.run_sync(f"Extract T4 fields:\n\n{text}")
fields = result.output

print(fields.box_14_employment_income)   # → 87500.0
print(fields.box_22_income_tax_deducted) # → 21340.0
print(fields.province_of_employment)     # → "ON"
Enter fullscreen mode Exit fullscreen mode

To run fully locally with no external API calls, swap the model string:

agent = Agent(
    "ollama:llama3.2",   # no ANTHROPIC_API_KEY needed
    result_type=T4Fields,
    system_prompt=...,
)
Enter fullscreen mode Exit fullscreen mode

For environments with data residency requirements — most regulated Canadian financial services — that matters. The document never leaves your infrastructure.


Step 3 — The Part Most Implementations Skip

Docling plus pydantic-ai gets you surprisingly far. In testing on T4 PDFs from major Canadian payroll providers, field extraction accuracy sits above 90% on the primary income and tax boxes.

But two things are missing that matter for production use in regulated industries.

Confidence scoring and a review queue

The LLM will be more certain about box 14 (employment income, usually prominent and clearly labeled) than about box 52 (pension adjustment, often blank or formatted inconsistently). If you're pre-filling a tax form with extracted values, you need to know which fields are safe to pass through automatically and which ones need a human to confirm.

Without confidence scores, low-quality extractions silently enter production. That's how incorrect T4 data gets submitted to CRA.

PII handling and an audit trail

A T4 contains a Social Insurance Number. Before you send that document text to any external API, you should know what PII is in it. Canada's PIPEDA requires that organizations limit the collection, use, and disclosure of personal information to what's necessary for the identified purpose — sending a full T4 text to a US-based cloud LLM for extraction is hard to defend under that standard unless you've taken steps to identify and handle the PII.

⚠️ The SIN problem: A Canadian SIN in the format XXX-XXX-XXX is sensitive personal information under PIPEDA. Every T4 contains one. If you're sending raw T4 text to a US-based cloud API without detecting and handling this, you're creating a compliance exposure that most legal teams would not be comfortable with.


Putting It Together: Docling + pydantic-ai + Presidio

Microsoft Presidio is an open-source PII detection and anonymization library. It supports custom recognizers — you can teach it what a Canadian SIN looks like, what a CRA Business Number looks like, and what Canadian postal codes look like. None of these ship in Presidio's defaults.

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg
Enter fullscreen mode Exit fullscreen mode

Then add the Canadian recognizers and scan:

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

analyzer = AnalyzerEngine()

# Add Canadian SIN recognizer — not in Presidio defaults
sin_recognizer = PatternRecognizer(
    supported_entity="CA_SIN",
    patterns=[Pattern("CA_SIN", r"\b\d{3}-\d{3}-\d{3}\b", score=0.9)],
    context=["sin", "social insurance"],
)
analyzer.registry.add_recognizer(sin_recognizer)

# Scan before sending to LLM
results = analyzer.analyze(text=document_text, language="en")
pii_found = [{"entity_type": r.entity_type, "score": r.score} for r in results]

# Optionally redact before the LLM call
anonymizer = AnonymizerEngine()
redacted = anonymizer.anonymize(
    text=document_text,
    analyzer_results=results,
    operators={"CA_SIN": OperatorConfig("replace", {"new_value": "***-***-***"})},
)
# Send redacted.text to the LLM instead
Enter fullscreen mode Exit fullscreen mode

Now you know what PII was in the document before extraction ran, you have a record of it, and you can choose whether to redact before the LLM call.


The Full Stack in One Place: FinLit

Wiring Docling, pydantic-ai, Presidio, confidence scoring, audit logging, and CRA-specific schemas together is the kind of plumbing every team building on Canadian documents ends up writing. I built FinLit — an open-source Python library that does exactly this, with pre-built YAML schemas for T4, T5, T4A, NR4, and Canadian bank statements.

pip install finlit
python -m spacy download en_core_web_lg
Enter fullscreen mode Exit fullscreen mode

Then run the pipeline:

from finlit import DocumentPipeline, schemas

pipeline = DocumentPipeline(
    schema=schemas.CRA_T4,
    extractor="claude",      # or "openai" or "ollama"
    audit=True,
    pii_redact=False,        # set True to redact SINs in audit log output
    review_threshold=0.85,
)

result = pipeline.run("john_doe_t4_2024.pdf")
Enter fullscreen mode Exit fullscreen mode

The result object has everything:

# Typed, validated fields — monetary values are always float
result.fields["box_14_employment_income"]      # → 87500.0
result.fields["box_22_income_tax_deducted"]    # → 21340.0
result.fields["province_of_employment"]        # → "ON"

# Per-field confidence — box 52 came back uncertain
result.confidence["box_52_pension_adjustment"] # → 0.71

# Fields below the 0.85 threshold go here instead of silently through
result.needs_review    # → True
result.review_fields
# [{"field": "box_52_pension_adjustment", "confidence": 0.71, "raw": "4,200.00"}]

# Trace any extracted value back to its page and location
result.source_ref["box_14_employment_income"]
# {"page": 1, "bbox": [120, 340, 280, 360], "doc": "john_doe_t4_2024.pdf"}

# Immutable audit log — every event from load to completion
result.audit_log
# [
#   {"event": "document_loaded",     "sha256": "abc...", "ts": "..."},
#   {"event": "pii_detected",        "count": 1, "entities": ["CA_SIN"], "ts": "..."},
#   {"event": "extraction_complete", "fields_returned": 13, "ts": "..."},
#   {"event": "review_flagged",      "count": 1, "ts": "..."},
#   {"event": "pipeline_complete",   "fields_extracted": 13, "ts": "..."}
# ]

# Raw PII detections on the source document (Presidio output)
result.pii_entities
# [{"entity_type": "CA_SIN", "score": 0.9, "start": 142, "end": 153}]
Enter fullscreen mode Exit fullscreen mode

For batch processing — say, a payroll integrator running hundreds of T4s at year-end:

from finlit import BatchPipeline, schemas
from glob import glob

batch = BatchPipeline(schema=schemas.CRA_T4, extractor="ollama", workers=8)

for path in glob("uploads/*.pdf"):
    batch.add(path)

results = batch.run()
results.export_csv("extracted/t4s_2024.csv")

print(f"Processed:    {results.total}")
print(f"Needs review: {results.review_count}")
Enter fullscreen mode Exit fullscreen mode

The extractor="ollama" configuration means no document leaves your infrastructure. The pipeline runs entirely on-premises, which removes the PIPEDA third-party disclosure question entirely.


Build vs Buy vs Open-Source

Approach Time to first extraction Canadian schemas Audit trail Data residency Cost
Regex + pdfplumber Hours You write them None On-prem Free
AWS Textract Hours None Partial US only $1.50/1000 pages
LlamaParse Minutes None None US SaaS $3–$10/1000 pages
Docling alone Hours You write them None On-prem Free
FinLit Minutes T4, T5, T4A, NR4, bank statements Built in On-prem or cloud Free + LLM costs

What the Schema YAML Looks Like

Every built-in schema in FinLit is a versioned YAML file. The T4 schema maps directly to CRA's published XML specification. Here's a simplified excerpt:

name: cra_t4
version: "2024"
document_type: "CRAT4StatementofRemunerationPaid"

fields:
  - name: box_14_employment_income
    dtype: float
    required: true
    description: "Box14:Totalemploymentincomebeforedeductions"

  - name: employee_sin
    dtype: str
    required: true
    pii: true
    regex: '^\d{3}-\d{3}-\d{3}$'
    description: "Employee'sSocialInsuranceNumber"

  - name: province_of_employment
    dtype: str
    required: false
    description: "Provinceorterritoryofemployment(2-lettercode)"
Enter fullscreen mode Exit fullscreen mode

The pii: true flag tells the pipeline this field is sensitive — it gets flagged in the audit log and can be redacted depending on your pii_redact configuration. The regex field enforces format validation after extraction, so a malformed SIN raises a validation error rather than silently passing through.

Adding a new schema for a document type that isn't in the registry yet takes about 20 minutes if you know the document. Schema contributions are the highest-value PRs the project gets.


Practical Notes from Building This

A few things that aren't obvious until you've processed a few thousand real T4s:

  • Scanned T4s are common. Many smaller employers still print and scan. Docling's OCR pipeline handles these, but accuracy drops — budget for a higher review threshold (0.90 vs 0.85) on scanned documents.
  • Box 52 (pension adjustment) is almost always uncertain. It's blank for most employees, optionally present for others, and formatted inconsistently across payroll vendors. Flag it for review at any confidence below 0.95 if your use case relies on it.
  • Quebec T4s have additional fields. RL-1 slips carry Quebec provincial tax information that a standard T4 schema doesn't cover. If you're processing documents from Quebec employees, you'll want a separate RL-1 schema.
  • CRA updates its XML specification annually. Field names and codes are stable, but new boxes get added. Pin your schema version and test against new documents at the start of each tax year.
  • Multi-page T4s exist. Most T4s are single-page, but amended T4s can span two pages. Docling handles this correctly; regex approaches often don't.

The Short Version

Use Docling for parsing, pydantic-ai for field extraction, Presidio for PII detection, and either wire it together yourself or use FinLit to skip the plumbing. Run it locally with Ollama if you can't send documents to a cloud API. Build an audit log from the start — retrofitting one later is painful.


Further Reading


Built by Caseonix · Waterloo, Ontario 🍁

FinLit is the extraction engine inside LocalMind Sovereign, Caseonix's document intelligence platform for Canadian regulated industries.

Source: dev.to

arrow_back Back to Tutorials