How I Built a Customer Support Auto-Responder with Confidence Scoring Using pydantic-ai and FastAPI

Support teams are drowning in tickets. Not because there are too many questions, but because the tooling makes it hard to automate the ones that should be automatic. Most tickets asking "how do I reset my password?" or "what are your refund terms?" get routed through the same queue as complex billing disputes. The answer to the first two exists in your docs. The answer to the third requires a human.

The gap between "we have docs" and "the AI reliably answers from docs without hallucinating" is where most support automation projects die.

This article walks through a production-grade pattern I built: a ticket ingestion system that uses RAG against your own documentation, scores its own confidence on every response, auto-replies when it's sure, and escalates to a human agent with a pre-drafted reply attached when it's not. Every decision is logged for audit.

The Problem: Manual Triage at Scale Is Not a Strategy

Here is the real scenario. Your support team gets 200 tickets per day. About 60% are answerable directly from your documentation. But your existing helpdesk either requires custom code per email format or rigid keyword-matching rules that break the moment a user phrases something slightly differently.

The integration problem is worse than it looks. Most existing connectors expect emails in a predictable structure. Real users do not write like that. One person writes "how do I cancel," another writes "I need to stop my subscription immediately," and a third writes "billing is still happening after I closed my account." Same intent, wildly different phrasing.

Without structured output from the LLM, you cannot reliably extract: what is the intent, what is the relevant doc section, and how confident is the model in its answer. So you end up with one of two bad outcomes:

You auto-reply with a hallucinated answer and destroy user trust
You route everything to humans and waste their time on questions your docs already answer

What is missing is a structured decision layer that sits between raw LLM output and the action taken. That is exactly what pydantic-ai provides.

The Approach: Structured Outputs as the Decision Layer

The key insight is that pydantic-ai forces the LLM to return data in a validated schema rather than free text. This is not just cosmetic. When your model must produce a TicketResponse object with a confidence_score: float, a suggested_reply: str, and an escalate: bool, you can branch on those values programmatically. You are not parsing prose looking for signals. You have actual typed fields.

Here is why this architecture beats the alternatives:

vs. LangChain: LangChain is flexible but the abstractions leak constantly. Debugging why a chain behaved unexpectedly is painful. For a system where every decision must be auditable, you want to see exactly what the model returned and why. pydantic-ai keeps the model call and the output schema co-located. You can inspect the raw response and the validated output side by side.

vs. plain OpenAI/Anthropic requests: You can use response_format with JSON mode, but you still hand-roll the Pydantic models and the validation logic. pydantic-ai handles that contract automatically.

vs. rigid rule engines: Rules break on phrasing variations. A hybrid approach where the LLM handles intent extraction and the rules handle routing based on structured fields is much more robust.

The architecture is:

FastAPI endpoint receives the ticket payload
ChromaDB retrieves the top-k relevant doc chunks via embedding similarity
pydantic-ai agent runs inference with the retrieved context
The structured output determines: auto-reply, escalate with draft, or flag for review
Every decision object is written to a PostgreSQL audit log

The key design decision that makes this reliable is that the confidence threshold is not hardcoded in the prompt. It is a validated field the model must populate, and you set the threshold in your application logic. This means you can tune it without touching the prompt.

The Code Pattern: Agent Definition and Confidence-Gated Routing

Here is the central pattern. This is simplified but structurally accurate:

from pydantic import BaseModel, Field
from pydantic_ai import Agent
import chromadb

# The structured output schema
class TicketResponse(BaseModel):
    intent: str = Field(description="Short label for ticket intent")
    suggested_reply: str = Field(description="Full draft reply to send or attach")
    confidence_score: float = Field(ge=0.0, le=1.0)
    escalate: bool
    escalation_reason: str | None = None
    doc_sources: list[str] = Field(default_factory=list)

# Agent with result type enforced
support_agent = Agent(
    model="claude-3-5-sonnet-20241022",
    result_type=TicketResponse,
    system_prompt="""
    You are a support assistant. Use only the provided documentation context.
    If the answer is not clearly supported by context, set confidence_score below 0.7
    and escalate to True. Always cite which doc sections informed your reply.
    """
)

async def handle_ticket(ticket_text: str, chroma_collection) -> TicketResponse:
    # Retrieve relevant docs
    results = chroma_collection.query(
        query_texts=[ticket_text],
        n_results=4
    )
    context_chunks = "\n\n".join(results["documents"][0])

    prompt = f"""
    TICKET:
    {ticket_text}

    DOCUMENTATION CONTEXT:
    {context_chunks}"""

    result = await support_agent.run(prompt)
    response = result.data  # Validated TicketResponse instance

    # Confidence-gated routing -- no ambiguity
    if response.escalate or response.confidence_score < 0.72:
        await route_to_human(response)
    else:
        await send_auto_reply(response)

    await log_decision(ticket_text, response)
    return response

What each piece does and why it matters:

result_type=TicketResponse is the contract. The model cannot return something that does not fit this schema. pydantic-ai handles retries and validation errors internally.
confidence_score with ge=0.0, le=1.0 enforced by Pydantic means you never get a string like "high" that you need to interpret. It is a float you can threshold on.
doc_sources gives you audit traceability. You can show support managers which doc chunk informed which reply.
The routing logic lives outside the prompt. This is intentional. Prompts drift. Application logic is version controlled.

The 0.72 threshold is arbitrary in this snippet. In production you tune it based on your false-positive tolerance, with audit logs providing the data to make that call.

Integration: Email Ingestion to Helpdesk to Slack Escalation

The data flow end to end looks like this:

Inbound: Emails arrive via a webhook from your email provider (Postmark, SendGrid, or similar). FastAPI receives the parsed payload with subject, body, sender, and any attachments.

Processing: The ticket body hits the RAG pipeline. ChromaDB stores your docs as embeddings loaded at startup. The retrieval step happens in under 100ms for most collections under 50k chunks.

Outbound: If auto-reply triggers, the reply goes back through your email provider API. If escalation triggers, a Slack message goes to your #support-escalations channel with the ticket details, the confidence score, and the pre-drafted reply attached. The agent did the work. The human just reviews and hits send (or edits first).

Audit log: Every TicketResponse object is serialized to JSON and written to a ticket_decisions table. This includes the retrieved doc chunks used, the confidence score, whether it was auto-replied or escalated, and the timestamp.

Gotcha worth knowing: ChromaDB's default embedding model will embed your docs differently than the embedding used at query time if you change models mid-deployment. If you swap from all-MiniLM-L6-v2 to text-embedding-3-small, you need to re-embed your entire document collection or retrieval quality degrades silently. Build a doc version hash into your collection name.

Tradeoffs and Limitations

This architecture is not for every team. Honest assessment:

Latency: Each ticket goes through an embedding query plus an LLM call. Expect 1-3 seconds per ticket depending on model and collection size. For real-time chat this is borderline. For email-based support, it is fine.

RAG quality ceiling: If your docs are poorly structured, out of date, or missing coverage for common questions, no amount of prompt engineering fixes it. Garbage in, garbage out. Budget for doc maintenance.

Cost at volume: At 200 tickets per day with Claude Sonnet, you are spending a few dollars per day. At 2000 tickets, that is meaningful. If budget is the constraint, a smaller model for the first triage pass plus a larger model only for borderline cases is a sensible optimization.

When to skip this pattern: If your ticket types are genuinely narrow and you can enumerate them, a smaller fine-tuned classifier plus templated replies is cheaper, faster, and more predictable. This pattern earns its complexity when ticket phrasing is diverse and your docs are the source of truth.

Get the Code

I packaged this as an open-source template on GitHub: https://github.com/Reactance0083/pydantic-ai-customer_support_ticket_ai_auto_responde

The scaffold shows the core agent setup, ChromaDB integration, and FastAPI routing. The full production version with test suite, error handling for malformed payloads, retry logic, Slack webhook integration, audit logging migrations, and deployment config is available here: https://reactance0083.gumroad.com/l/qbvpl

If you are running support at scale and have tried to automate it before, I am genuinely curious where it broke down for you. Was it retrieval quality, confidence calibration, the email parsing step, or something else entirely? Drop it in the comments. The edge cases in this space are worth discussing.