Strip PHI from Clinical Text Before It Reaches Your LLM

python dev.to

You're building a clinical AI tool. Your pipeline looks like this:

patient transcript → LLM → structured output
Enter fullscreen mode Exit fullscreen mode

The problem: that transcript contains a patient's name, phone number, date of birth, SSN. Every time you send it to an LLM, you're potentially violating HIPAA Safe Harbor.

Here's how to fix it in two lines.

The Problem

HIPAA Safe Harbor requires removing 18 types of identifiers before sharing patient data. Most teams handle this with:

  • Full de-identification libraries (heavy, slow to integrate)
  • Manual regex (fragile, incomplete)
  • Hoping the LLM won't memorize it (not the point)

What you actually need is a fast pre-processing step that strips identifiers before the text leaves your system.

The Fix

import requests

def scrub_phi(text: str) -> str:
    """Strip PHI before sending to any LLM."""
    result = requests.post(
        'https://the-service.live/api/scrub',
        json={'text': text}
    ).json()
    return result['scrubbed_text']

# Before: risky
response = llm.complete(raw_transcript)

# After: safe
clean = scrub_phi(raw_transcript)
response = llm.complete(clean)
Enter fullscreen mode Exit fullscreen mode

What Gets Stripped

The API catches 12 identifier types:

Input Output
415-555-1234 [PHONE]
jane@hospital.com [EMAIL]
SSN 123-45-6789 SSN [SSN]
DOB 07/22/1960 DOB [DOB]
MRN 98765 MRN [MRN]
NPI 1234567890 NPI [NPI]
192.168.1.1 [IP]
ZIP 94105 [ZIP]

Full example:

raw = "Patient Jane called 415-555-1234. DOB 07/22/1960. MRN 98765. SSN 123-45-6789."
clean = scrub_phi(raw)
# → "Patient Jane called [PHONE]. DOB [DOB]. MRN [MRN]. SSN [SSN]."
Enter fullscreen mode Exit fullscreen mode

LangChain Integration

If you're using LangChain, wrap it as a preprocessing step:

from langchain.schema.runnable import RunnableLambda
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
import requests

def scrub_phi(inputs: dict) -> dict:
    text = inputs['text']
    result = requests.post(
        'https://the-service.live/api/scrub',
        json={'text': text}
    ).json()
    return {'text': result['scrubbed_text']}

scrubber = RunnableLambda(scrub_phi)
llm = ChatOpenAI(model='gpt-4o-mini')
prompt = ChatPromptTemplate.from_template(
    'Summarize this clinical note: {text}'
)

# Chain: scrub → prompt → llm
chain = scrubber | (lambda x: {'text': x['text']}) | prompt | llm

result = chain.invoke({'text': raw_clinical_note})
Enter fullscreen mode Exit fullscreen mode

Batch Processing

For bulk de-identification:

def scrub_batch(texts: list[str]) -> list[str]:
    results = []
    for text in texts:
        r = requests.post(
            'https://the-service.live/api/scrub',
            json={'text': text}
        ).json()
        results.append(r['scrubbed_text'])
    return results

# De-identify 1000 notes before training
clean_notes = scrub_batch(raw_notes)
Enter fullscreen mode Exit fullscreen mode

Pricing

  • Free tier: 100 calls/day, no key required
  • Production: $0.005/call (pay-per-use, no monthly minimums)

Demo: the-service.live/playground

Docs: the-service.live/docs

What This Doesn't Do

This is not a complete HIPAA compliance solution. It's a fast pre-processing step for stripping structured identifiers from text. You still need:

  • Business Associate Agreements with your LLM providers
  • Proper audit logging
  • Access controls on who can query patient data
  • Expert legal review of your specific use case

But for teams building clinical AI tools who need a quick, reliable way to strip identifiers before LLM calls — this solves that specific problem.


Built by EnergenAI — autonomous AI infrastructure

Source: dev.to

arrow_back Back to Tutorials