OpenTelemetry for AI Agents: Tracing Claude API Calls in Production

You're running Claude in production. Requests are slow, costs are spiking, some responses are garbage — and you have no idea why.

That's the problem OpenTelemetry solves. And it works with AI agents just as well as with HTTP services.

Here's how to add proper distributed tracing to a Claude-powered agent.

Why Tracing Matters for AI Agents

Traditional API monitoring breaks down for LLM applications because:

Latency is multi-component: TTFT (time-to-first-token), generation speed, and post-processing are all distinct phases
Cost attribution is opaque: Which prompt consumed 80% of your tokens last night?
Errors are soft: A response that's factually wrong isn't a 500 — tracing is the only way to catch it

OpenTelemetry gives you spans, attributes, and traces that survive across async boundaries — exactly what you need for agents that make multiple LLM calls per request.

The Setup

Install the packages:

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-trace-otlp-http

Initialize the tracer (do this before anything else):

// tracing.ts — import this first
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Wrapping Claude Calls in Spans

The Anthropic SDK doesn't auto-instrument. You wrap it manually:

import { trace, SpanStatusCode, context } from '@opentelemetry/api';
import Anthropic from '@anthropic-ai/sdk';

const tracer = trace.getTracer('claude-agent', '1.0.0');
const client = new Anthropic();

export async function tracedCompletion(
  prompt: string,
  systemPrompt: string,
  model = 'claude-sonnet-4-6'
) {
  return tracer.startActiveSpan('claude.completion', async (span) => {
    span.setAttributes({
      'llm.model': model,
      'llm.prompt_length': prompt.length,
      'llm.system_prompt_length': systemPrompt.length,
      'llm.vendor': 'anthropic',
    });

    try {
      const startTime = Date.now();

      const response = await client.messages.create({
        model,
        max_tokens: 4096,
        system: systemPrompt,
        messages: [{ role: 'user', content: prompt }],
      });

      const latency = Date.now() - startTime;

      span.setAttributes({
        'llm.input_tokens': response.usage.input_tokens,
        'llm.output_tokens': response.usage.output_tokens,
        'llm.total_tokens': response.usage.input_tokens + response.usage.output_tokens,
        'llm.latency_ms': latency,
        'llm.stop_reason': response.stop_reason,
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return response;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error instanceof Error ? error.message : 'Unknown error',
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Tracing Multi-Step Agent Flows

For agents with tool use or multi-turn conversations, nest your spans:

export async function runAgentTask(task: string) {
  return tracer.startActiveSpan('agent.task', async (taskSpan) => {
    taskSpan.setAttribute('agent.task', task);

    let step = 0;
    let continueLoop = true;

    while (continueLoop) {
      await tracer.startActiveSpan(`agent.step.${step}`, async (stepSpan) => {
        stepSpan.setAttribute('agent.step_number', step);

        const response = await tracedCompletion(
          buildPrompt(task, step),
          SYSTEM_PROMPT
        );

        const toolCalls = extractToolCalls(response);
        stepSpan.setAttribute('agent.tool_calls', toolCalls.length);

        for (const tool of toolCalls) {
          await tracer.startActiveSpan(`tool.${tool.name}`, async (toolSpan) => {
            toolSpan.setAttributes({
              'tool.name': tool.name,
              'tool.input_size': JSON.stringify(tool.input).length,
            });

            const result = await executeTool(tool);
            toolSpan.setAttribute('tool.success', result.success);
            toolSpan.end();
          });
        }

        continueLoop = toolCalls.length > 0 && step < 10;
        step++;
        stepSpan.end();
      });
    }

    taskSpan.setAttribute('agent.total_steps', step);
    taskSpan.end();
  });
}

Sending to Langfuse (Recommended for LLM Tracing)

Langfuse is purpose-built for LLM observability and has native OTLP ingestion:

OTEL_EXPORTER_OTLP_ENDPOINT=https://cloud.langfuse.com/api/public/otel
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic <base64(pk:sk)>

This gives you:

Per-model cost breakdowns
Prompt/response logging with diff views
Latency percentiles per agent step
Score tracking (you can attach evals to traces)

Grafana Tempo and Honeycomb also work if you're already using those.

The Attributes That Actually Matter

After running this in production, these are the attributes worth instrumenting:

Attribute	Why
`llm.input_tokens`	Cost is dominated by input — track it per-call
`llm.cache_read_input_tokens`	If you're using prompt caching, this tells you your hit rate
`llm.latency_ms`	P99 matters more than average for user-facing flows
`agent.task_type`	Lets you slice costs and latency by what the agent is doing
`user.id`	Attribution for per-user cost limits or billing

Prompt Caching and Tracing

If you're using Anthropic's prompt caching (you should be — it cuts costs 80%+), the API returns cache hit data in the usage object:

span.setAttributes({
  'llm.input_tokens': response.usage.input_tokens,
  'llm.output_tokens': response.usage.output_tokens,
  'llm.cache_creation_input_tokens': response.usage.cache_creation_input_tokens ?? 0,
  'llm.cache_read_input_tokens': response.usage.cache_read_input_tokens ?? 0,
});

Track your cache hit ratio over time. If it drops below 70%, your session architecture is broken.

Cost Calculation in Spans

Add a derived cost attribute so you can alert on expensive sessions:

const COSTS = {
  'claude-sonnet-4-6': { input: 3.0, output: 15.0, cache_read: 0.30 },
  'claude-opus-4-7': { input: 15.0, output: 75.0, cache_read: 1.50 },
};

function estimateCost(model: string, usage: Anthropic.Usage): number {
  const rates = COSTS[model] ?? COSTS['claude-sonnet-4-6'];
  const inputCost = (usage.input_tokens / 1_000_000) * rates.input;
  const outputCost = (usage.output_tokens / 1_000_000) * rates.output;
  const cacheReadCost = ((usage.cache_read_input_tokens ?? 0) / 1_000_000) * rates.cache_read;
  return inputCost + outputCost + cacheReadCost;
}

// Then in your span:
span.setAttribute('llm.estimated_cost_usd', estimateCost(model, response.usage));

Now you can alert in Grafana when a single agent task costs more than $0.50.

What This Unlocks

Once you have traces:

Performance regression detection — new prompt version slowed P95 by 40%? You'll see it
Cost attribution — "the research step costs 3x the writing step — should we use a cheaper model there?"
Error correlation — "responses with empty tool outputs always follow cache miss patterns"
User-level billing — if you're selling AI-powered features, you can charge per-token with accurate data

Building AI agents for production? The AI SaaS Starter Kit includes prompt caching, token tracking, and structured agent logging patterns baked in — skip the 2-week instrumentation ramp.