How We Built Agent Observability at 100K Events/Sec

This is the first ever post I have ever created in dev.to, I am learning to follow the rules and provide content.

At Stealth, we built AgentTrace, observability infrastructure for AI agent workflows. The premise sounds simple: capture what agents do, when, and why. In practice, getting there required three iterations of transport architecture, a schema pivot away from JSONB, and a production incident that silently lost 87% of trace data before anyone noticed.

Here's what we learned.

The Data Model: Spans and Trace Missions

Traditional distributed tracing uses spans, where a span is a single unit of work with a parent reference, a timestamp, and a bag of attributes. We kept that primitive.

In AgentTrace, a span captures: parent span ID, parent trace ID, start time, and a set of type-dependent attributes including client IP, LLM IP, status, error codes, LLM system prompt, user prompt, and model response. The type determines which attributes are present.

But agent workflows aren't a flat chain of service calls. An agent run involves branching decisions, sub-agent invocations, tool calls, and multi-hop reasoning — all nested under a single intent. A raw span tree doesn't capture this.

We introduced the trace mission as the organizing unit: a complete record of one agent workflow, including which agent initiated it, what the agent's purpose was, which nodes were involved, and the full span tree underneath. Spans are leaves and branches; the trace mission is the trunk.

This distinction matters for querying. When an engineer asks "why did this agent fail?", they're querying at the trace mission level. When they ask "what exactly happened at this LLM call?", they're drilling into a span. The data model has to support both patterns without conflating them.

Getting the Data In: Proxies, Not SDKs

The naive approach is to ship an SDK and ask developers to instrument their agent code. This creates friction: enterprise customers don't always control the agent runtimes, and requiring a code change for observability means it often doesn't happen.

We solved this with Envoy proxies at the network layer. Any agent communicating over standard protocols emits traces transparently because the proxy intercepts the traffic, extracts the relevant fields, and emits the span without the agent needing to know observability exists. For customers who could use the SDK, it integrated directly. For customers who couldn't, the proxy worked without code changes.

The Pipeline: Collector and Processor

Between the proxy and the database sit two distinct services.

The Trace Collector is the ingestion side: it receives spans from sidecars and proxies, validates them, and publishes to the event transport layer. This is where the throughput and reliability work lives.

The Trace Processor is the consumption side: it reads events off the transport, assembles spans into their trace mission tree structure, handles detail records, and writes to PostgreSQL.

The split matters because the two services have different failure modes and scaling needs. The Collector needs to absorb bursty ingest without blocking; the Processor needs to write SQL correctly and maintain tree structure consistency. Coupling them would mean a slow write path backs up the ingest pipeline.

One non-obvious problem: how do you know when a trace is complete? If agent workflows don't send a guaranteed "done" signal, spans just keep arriving. The Collector needs to decide when to close a trace and hand it to the Processor, but closing too early means losing late-arriving spans.

The solution was a WebSocket hybrid approach: the Collector stays listening on a live connection for each active trace. A trace finalizes when either a terminal span arrives signaling the workflow is complete, or when no new spans have arrived within a timeout window — whichever comes first. This handles both clean completions and the messier cases: agent crashes, dropped connections, or slow nodes that simply stop sending without a shutdown signal. It also reduced load for span updates on existing traces, since the open connection handles incremental updates without re-establishing state.

The Throughput Problem: Three Iterations to Pub/Sub

In our local environment, 9 active agents generated 16,300 traces in a single hour. Scale to an enterprise deployment with 400+ concurrently active agents, and the volume becomes untenable without deliberate design.

Each iteration had to solve two problems in parallel: throughput (can the system keep up with event volume?) and error handling (what happens when something fails mid-pipeline?). The two dimensions are linked because a system that drops data under load has solved neither.

Iteration 1: In-memory buffer

Throughput: adequate in development, collapsed under production load. Error handling: none. Any process restart or traffic spike drained the buffer and lost whatever hadn't been flushed. For observability infrastructure, losing data on failure defeats the purpose because you need the most complete picture exactly when things go wrong.

Iteration 2: FIFO queue

Throughput: improved, but still hit a ceiling as enterprise traffic scaled. Error handling is meaningfully better: events processed in order, never deleted on failure, persistent across process boundaries. The durability problem was largely solved. The throughput problem wasn't.

Iteration 3: GCP Pub/Sub

Both problems solved. Pub/Sub handles fan-out natively, decouples producers from consumers, and provides at-least-once delivery guarantees with built-in retry and dead-letter queue semantics. Throughput scales horizontally without the application managing that complexity. The result was 100K+ events per second in production.

The Schema Problem: Why We Moved Off JSONB

Agent trace data is deeply nested and hierarchical. The first instinct was JSONB: it handles complex relationships without forcing a rigid table structure, and it integrated cleanly with our TypeScript types.

It was the wrong call.

JSONB has real costs at scale: queries against nested fields are slower, storage footprint is larger, and ACID guarantees weaken at the query planning level. A technical advisor who had run a similar experiment on Chrome's data layer put it directly: after a month under load, storage costs scaled faster than they could provision servers. Vertical scaling wasn't viable.

The replacement was a two-table normalized schema, driven by access patterns rather than data shape.

Table 1: Summary — the data users hit on dashboards and heatmaps. Fast, frequently queried. The composite primary key uses Trace ID as the primary dimension with Agent ID as secondary. Agent ID alone was rejected because agents represent an activity context, not a queryable unit, because very few access patterns require pulling everything for an agent without first selecting a specific trace.
Table 2: Detail — the full trace drilldown: hierarchical span relationships, span metadata, prompts, responses, IP addresses. Only loaded when a user opens a specific trace.

The detail table avoids the wide-column trap. Instead of a separate column for every field type: parent_span_id, system_prompt, response, ip_address. It uses two columns: Detail_Name and Detail_Value. A parent span relationship is one row. A system prompt is another. A model response is another.

This is the Entity-Attribute-Value pattern. The tradeoff is intentional: you lose column-level type enforcement, but you gain schema flexibility — adding a new span detail type requires no migration, just a new row. Every row has the same shape regardless of what it holds, which makes horizontal distribution straightforward. Querying a full trace is a single indexed scan on Trace ID; filtering to a specific attribute is a WHERE on Detail Name.

The Incident: 87% of Records Orphaned

Several weeks after launch, 87% of trace records were being orphaned — captured but never assembled into their parent traces. No alerts had fired. No errors thrown. The system silently stopped linking spans.

The root cause was a Kubernetes interaction nobody had anticipated.

Agent connections were established to specific pods on specific ports. When a pod restarted — routine in Kubernetes — it came back on a different port. In-flight records had nowhere to land. They'd been captured by the proxy, sent down the pipeline, but the destination mapping was stale.

The fix was routing logic that detected port reassignment on pod restart and reattached active connections before in-flight records could be lost. Full tracking restored without downtime.

The broader lesson: observability systems are the last to be monitored. If AgentTrace fails silently, nothing alerts you — because AgentTrace is the alerting infrastructure. Explicit health checks on the observability layer itself aren't optional.

What We Learned

Agent observability is structurally different from service observability. The span tree isn't enough — you need a higher-order unit that captures agent intent and workflow context, not just execution steps.

Flexible schemas are expensive at scale. JSONB feels right for hierarchical data. It isn't, once query volume and storage pressure arrive.

Design for access patterns, not data shape. The two-table split came from asking how users actually query, not how the data is structured.

Monitor the monitoring. Silent data loss in observability infrastructure is worse than explicit failure.