You're running Claude in production. Requests are slow, costs are spiking, some responses are garbage — and you have no idea why.
That's the problem OpenTelemetry solves. And it works with AI agents just as well as with HTTP services.
Here's how to add proper distributed tracing to a Claude-powered agent.
Why Tracing Matters for AI Agents
Traditional API monitoring breaks down for LLM applications because:
- Latency is multi-component: TTFT (time-to-first-token), generation speed, and post-processing are all distinct phases
- Cost attribution is opaque: Which prompt consumed 80% of your tokens last night?
- Errors are soft: A response that's factually wrong isn't a 500 — tracing is the only way to catch it
OpenTelemetry gives you spans, attributes, and traces that survive across async boundaries — exactly what you need for agents that make multiple LLM calls per request.
The Setup
Install the packages:
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-trace-otlp-http
Initialize the tracer (do this before anything else):
// tracing.ts — import this first
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Wrapping Claude Calls in Spans
The Anthropic SDK doesn't auto-instrument. You wrap it manually:
import { trace, SpanStatusCode, context } from '@opentelemetry/api';
import Anthropic from '@anthropic-ai/sdk';
const tracer = trace.getTracer('claude-agent', '1.0.0');
const client = new Anthropic();
export async function tracedCompletion(
prompt: string,
systemPrompt: string,
model = 'claude-sonnet-4-6'
) {
return tracer.startActiveSpan('claude.completion', async (span) => {
span.setAttributes({
'llm.model': model,
'llm.prompt_length': prompt.length,
'llm.system_prompt_length': systemPrompt.length,
'llm.vendor': 'anthropic',
});
try {
const startTime = Date.now();
const response = await client.messages.create({
model,
max_tokens: 4096,
system: systemPrompt,
messages: [{ role: 'user', content: prompt }],
});
const latency = Date.now() - startTime;
span.setAttributes({
'llm.input_tokens': response.usage.input_tokens,
'llm.output_tokens': response.usage.output_tokens,
'llm.total_tokens': response.usage.input_tokens + response.usage.output_tokens,
'llm.latency_ms': latency,
'llm.stop_reason': response.stop_reason,
});
span.setStatus({ code: SpanStatusCode.OK });
return response;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error instanceof Error ? error.message : 'Unknown error',
});
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
}
Tracing Multi-Step Agent Flows
For agents with tool use or multi-turn conversations, nest your spans:
export async function runAgentTask(task: string) {
return tracer.startActiveSpan('agent.task', async (taskSpan) => {
taskSpan.setAttribute('agent.task', task);
let step = 0;
let continueLoop = true;
while (continueLoop) {
await tracer.startActiveSpan(`agent.step.${step}`, async (stepSpan) => {
stepSpan.setAttribute('agent.step_number', step);
const response = await tracedCompletion(
buildPrompt(task, step),
SYSTEM_PROMPT
);
const toolCalls = extractToolCalls(response);
stepSpan.setAttribute('agent.tool_calls', toolCalls.length);
for (const tool of toolCalls) {
await tracer.startActiveSpan(`tool.${tool.name}`, async (toolSpan) => {
toolSpan.setAttributes({
'tool.name': tool.name,
'tool.input_size': JSON.stringify(tool.input).length,
});
const result = await executeTool(tool);
toolSpan.setAttribute('tool.success', result.success);
toolSpan.end();
});
}
continueLoop = toolCalls.length > 0 && step < 10;
step++;
stepSpan.end();
});
}
taskSpan.setAttribute('agent.total_steps', step);
taskSpan.end();
});
}
Sending to Langfuse (Recommended for LLM Tracing)
Langfuse is purpose-built for LLM observability and has native OTLP ingestion:
OTEL_EXPORTER_OTLP_ENDPOINT=https://cloud.langfuse.com/api/public/otel
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic <base64(pk:sk)>
This gives you:
- Per-model cost breakdowns
- Prompt/response logging with diff views
- Latency percentiles per agent step
- Score tracking (you can attach evals to traces)
Grafana Tempo and Honeycomb also work if you're already using those.
The Attributes That Actually Matter
After running this in production, these are the attributes worth instrumenting:
| Attribute | Why |
|---|---|
llm.input_tokens |
Cost is dominated by input — track it per-call |
llm.cache_read_input_tokens |
If you're using prompt caching, this tells you your hit rate |
llm.latency_ms |
P99 matters more than average for user-facing flows |
agent.task_type |
Lets you slice costs and latency by what the agent is doing |
user.id |
Attribution for per-user cost limits or billing |
Prompt Caching and Tracing
If you're using Anthropic's prompt caching (you should be — it cuts costs 80%+), the API returns cache hit data in the usage object:
span.setAttributes({
'llm.input_tokens': response.usage.input_tokens,
'llm.output_tokens': response.usage.output_tokens,
'llm.cache_creation_input_tokens': response.usage.cache_creation_input_tokens ?? 0,
'llm.cache_read_input_tokens': response.usage.cache_read_input_tokens ?? 0,
});
Track your cache hit ratio over time. If it drops below 70%, your session architecture is broken.
Cost Calculation in Spans
Add a derived cost attribute so you can alert on expensive sessions:
const COSTS = {
'claude-sonnet-4-6': { input: 3.0, output: 15.0, cache_read: 0.30 },
'claude-opus-4-7': { input: 15.0, output: 75.0, cache_read: 1.50 },
};
function estimateCost(model: string, usage: Anthropic.Usage): number {
const rates = COSTS[model] ?? COSTS['claude-sonnet-4-6'];
const inputCost = (usage.input_tokens / 1_000_000) * rates.input;
const outputCost = (usage.output_tokens / 1_000_000) * rates.output;
const cacheReadCost = ((usage.cache_read_input_tokens ?? 0) / 1_000_000) * rates.cache_read;
return inputCost + outputCost + cacheReadCost;
}
// Then in your span:
span.setAttribute('llm.estimated_cost_usd', estimateCost(model, response.usage));
Now you can alert in Grafana when a single agent task costs more than $0.50.
What This Unlocks
Once you have traces:
- Performance regression detection — new prompt version slowed P95 by 40%? You'll see it
- Cost attribution — "the research step costs 3x the writing step — should we use a cheaper model there?"
- Error correlation — "responses with empty tool outputs always follow cache miss patterns"
- User-level billing — if you're selling AI-powered features, you can charge per-token with accurate data
Building AI agents for production? The AI SaaS Starter Kit includes prompt caching, token tracking, and structured agent logging patterns baked in — skip the 2-week instrumentation ramp.