Production Patterns for Claude API in Next.js Apps

typescript dev.to

Lessons from shipping two live AI SaaS products solo.


About This Guide

In early 2026, I built and shipped two production AI applications from scratch on the same stack: Next.js (App Router) + Supabase + Vercel + Anthropic Claude API.

  • OTONAMI — an AI-powered music pitch matching platform connecting Japanese independent artists with international curators. Multi-weight matching engine, bilingual pitch editor, curator dashboard, Stripe billing, Resend transactional email.
  • STYLE SYNC — an AI dance costume stylist service that takes context about a performer (genre, venue, vibe, body type) and generates tailored styling recommendations with shopping links.

Along the way I made every mistake you can make with an LLM-backed product. The patterns below are the ones that survived contact with real users, real latency constraints, and real billing.

This guide isn't a tutorial. It assumes you've built against the Claude API at least once and are now trying to make it reliable, cheap, and fast in production. It focuses specifically on the patterns that matter inside Next.js App Router applications deployed to Vercel.

All pricing and model references are current as of April 2026.


1. Model Selection: The Three-Tier Strategy

The single most impactful decision in a Claude-backed app is which model handles which request. Get this wrong and you'll either burn money or ship a slow product.

Current generation (April 2026):

Model Input Output When to use
Claude Opus 4.7 $5 / MTok $25 / MTok Agentic reasoning, complex planning, highest-stakes decisions
Claude Sonnet 4.6 $3 / MTok $15 / MTok Default workhorse — most production work
Claude Haiku 4.5 $1 / MTok $5 / MTok Classification, routing, extraction, moderation, simple generation

The naïve pattern is "use Sonnet for everything." This works, but it's typically 2–3x more expensive than it needs to be.

The pattern that actually scales is task-level routing: decide per-request which model to invoke based on what the task actually requires.

Real example from OTONAMI. The pitch matching pipeline has three distinct stages:

// lib/claude/models.ts
export const MODELS = {
  // Classify incoming pitches into genre/mood/tempo buckets
  CLASSIFY: "claude-haiku-4-5",
  // Default: generate curator-facing pitch summaries in EN/JA
  GENERATE: "claude-sonnet-4-6",
  // Only for the multi-step matching reasoning that combines
  // audio similarity + curator preferences + historical signals
  REASON: "claude-opus-4-7",
} as const;
Enter fullscreen mode Exit fullscreen mode

The routing decision lives in one place:

// lib/claude/route.ts
import { MODELS } from "./models";

type TaskType = "classify" | "generate" | "reason";

export function modelFor(task: TaskType) {
  switch (task) {
    case "classify":
      return MODELS.CLASSIFY;
    case "generate":
      return MODELS.GENERATE;
    case "reason":
      return MODELS.REASON;
  }
}
Enter fullscreen mode Exit fullscreen mode

Rough cost impact in OTONAMI's pipeline: moving classification from Sonnet to Haiku cut that stage's cost by ~70%. Moving the reasoning stage from Sonnet to Opus increased cost but demonstrably improved match quality — and reasoning is less than 5% of total requests, so the absolute bill barely moved.

Rule of thumb: default to Sonnet 4.6, and only route upward or downward when you've measured a reason to.


2. System Prompt Architecture

A system prompt isn't a paragraph you write once and forget. In production it's a configuration file that gets versioned, A/B tested, and reviewed like any other critical system component.

The structure that I've found survives six months of iteration without turning into spaghetti:

1. ROLE       — who the model is acting as
2. RULES      — what it must always/never do
3. CONTEXT    — facts about the current user/session
4. EXAMPLES   — 2–4 input/output pairs
5. OUTPUT     — exact format expected
Enter fullscreen mode Exit fullscreen mode

Concrete example from STYLE SYNC:

// lib/claude/prompts/stylist.ts
export const STYLIST_SYSTEM_PROMPT = `
You are a professional dance costume stylist with deep knowledge of
Japanese performance scenes including K-pop covers, hip-hop battles,
jazz dance, kawaii idol, and street styles.

RULES
- Always provide 3 distinct styling options, not variations of one.
- Respect the performer's stated body type and comfort preferences.
- Never recommend items requiring specialized custom tailoring unless
  the user has explicitly asked for custom work.
- All product suggestions must be available through major Japanese
  retailers (Rakuten, ZOZOTOWN, Amazon.co.jp).

OUTPUT FORMAT
Return strictly valid JSON matching this schema:
{
  "options": [
    {
      "name": string,
      "description": string,
      "top": { "item": string, "search_keywords": string },
      "bottom": { "item": string, "search_keywords": string },
      "accessories": string[],
      "reasoning": string
    }
  ]
}
Do not include markdown fences, commentary, or preamble.
`.trim();
Enter fullscreen mode Exit fullscreen mode

A few non-obvious things going on here:

  • The rules are explicit prohibitions, not vague guidelines. "Never recommend items requiring specialized custom tailoring" is falsifiable. "Be practical about recommendations" is not.
  • The output format specifies exactly what wraps the payload. "Do not include markdown fences" saves you from JSON.parse() failing on ``` json blocks, which is the single most common structured-output bug. - The prompt is static. Per-user context goes in the user message, not the system prompt. This lets us cache the system prompt (see section 6).

3. Structured Outputs: Three Reliable Patterns

"Just ask for JSON" works about 95% of the time with Sonnet 4.6. The other 5% is what kills you in production.

Three patterns I rely on, in increasing order of strictness:

Pattern A: JSON-via-prompt with defensive parsing

Good for: outputs where occasional retries are fine and the schema is loose.


ts
function parseResponse(raw: string) {
  // Strip markdown fences if the model added them despite instructions
  const cleaned = raw.replace(/^

```(?:json)?\s*|\s*```

$/g, "").trim();
  try {
    return JSON.parse(cleaned);
  } catch {
    // Fall back: extract the first {...} block
    const match = cleaned.match(/\{[\s\S]*\}/);
    if (!match) throw new Error("No JSON object in response");
    return JSON.parse(match[0]);
  }
}


Enter fullscreen mode Exit fullscreen mode

Pattern B: Tool use as a schema enforcer

When you absolutely need a specific shape, define a "tool" the model is forced to call. The model never actually executes the tool — you use the tool_use block as the structured output.


ts
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  tools: [
    {
      name: "record_match_score",
      description: "Record the match score between an artist and curator.",
      input_schema: {
        type: "object",
        properties: {
          score: { type: "number", minimum: 0, maximum: 100 },
          breakdown: {
            type: "object",
            properties: {
              genre: { type: "number" },
              audio_similarity: { type: "number" },
              mood: { type: "number" },
              tempo: { type: "number" },
            },
            required: ["genre", "audio_similarity", "mood", "tempo"],
          },
          reasoning: { type: "string" },
        },
        required: ["score", "breakdown", "reasoning"],
      },
    },
  ],
  tool_choice: { type: "tool", name: "record_match_score" },
  messages: [{ role: "user", content: prompt }],
});

const toolUse = response.content.find((b) => b.type === "tool_use");
const result = toolUse.input; // <-- already typed as your schema


Enter fullscreen mode Exit fullscreen mode

This is substantially more reliable than free-form JSON and the schema is validated by the API itself. It's my default for any output that downstream code depends on.

Pattern C: Two-pass generation

For long-form outputs where you want both narrative quality and structured metadata, run two passes:

  1. First pass: generate the long-form content with a Sonnet call.
  2. Second pass: Haiku call with the generated content as input, using Pattern B to extract structured metadata (tags, summaries, sentiment).

This is 30–40% cheaper than asking one Opus call to do both, and each pass does one thing well.


4. Tool Use in Production

Once you move beyond single-shot calls, you're in agent territory, and tool use becomes the core primitive.

A few patterns that save pain:

Validate tool inputs before executing the tool. The model occasionally fabricates inputs that satisfy the JSON schema but don't correspond to real data (invalid IDs, out-of-range dates). Validate against your database before running side effects.


ts
if (toolName === "update_pitch") {
  const pitch = await db.pitch.findUnique({ where: { id: input.pitchId } });
  if (!pitch) {
    return {
      type: "tool_result",
      tool_use_id: toolUseId,
      content: `No pitch found with id ${input.pitchId}`,
      is_error: true,
    };
  }
  // ... proceed
}


Enter fullscreen mode Exit fullscreen mode

Return errors as tool_result, not as thrown exceptions. Claude can recover gracefully from a tool error if you return it as a result block with is_error: true. If you throw, you've lost the conversation.

Cap tool use loops. Every agent loop should have a hard cap on iterations. I use 8 as a default for OTONAMI's matching pipeline. If the model hasn't finished in 8 tool-use rounds, something is wrong.


ts
const MAX_ITERATIONS = 8;

for (let i = 0; i < MAX_ITERATIONS; i++) {
  const response = await anthropic.messages.create({ /* ... */ });

  if (response.stop_reason === "end_turn") {
    return response;
  }

  if (response.stop_reason !== "tool_use") {
    throw new Error(`Unexpected stop reason: ${response.stop_reason}`);
  }

  // ... execute tools, append tool_results, loop
}

throw new Error("Agent loop did not terminate within iteration cap");


Enter fullscreen mode Exit fullscreen mode

5. Streaming in Next.js App Router

Users notice latency. For anything longer than ~2 seconds of generation, you need to stream.

Here's the pattern I use in Next.js App Router with a Route Handler:


ts
// app/api/pitch/generate/route.ts
import Anthropic from "@anthropic-ai/sdk";

export async function POST(req: Request) {
  const { prompt } = await req.json();
  const client = new Anthropic();

  const stream = client.messages.stream({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    messages: [{ role: "user", content: prompt }],
  });

  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      try {
        for await (const event of stream) {
          if (
            event.type === "content_block_delta" &&
            event.delta.type === "text_delta"
          ) {
            controller.enqueue(encoder.encode(event.delta.text));
          }
        }
        controller.close();
      } catch (err) {
        controller.error(err);
      }
    },
  });

  return new Response(readable, {
    headers: {
      "Content-Type": "text/plain; charset=utf-8",
      "X-Content-Type-Options": "nosniff",
    },
  });
}


Enter fullscreen mode Exit fullscreen mode

On the client:


tsx
"use client";
import { useState } from "react";

export function PitchStream() {
  const [output, setOutput] = useState("");

  async function generate(prompt: string) {
    setOutput("");
    const res = await fetch("/api/pitch/generate", {
      method: "POST",
      body: JSON.stringify({ prompt }),
    });
    const reader = res.body!.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      setOutput((prev) => prev + decoder.decode(value));
    }
  }

  return (
    <>
      <button onClick={() => generate("Write a pitch for a jazz fusion band")}>
        Generate
      </button>
      <pre>{output}</pre>
    </>
  );
}


Enter fullscreen mode Exit fullscreen mode

Three production details people miss:

  1. Set X-Content-Type-Options: nosniff. Vercel's edge infrastructure occasionally tries to sniff streaming content as HTML, which breaks things.
  2. Don't use Server Actions for streaming text. Route Handlers are the right primitive. Server Actions' return value is fully serialized before transmission.
  3. Guard against client disconnects. If the user navigates away, abort the Claude call. The SDK accepts an AbortSignal.

6. Prompt Caching for Cost Control

This is the single biggest cost lever in the Claude API. If you're not using it, you're leaving money on the table.

Prompt caching reuses previously-processed portions of your prompt across requests. Cache hits bill at about 10% of the standard input rate. That's a 90% discount on repeated context.

What to cache:

  • Static system prompts (almost always)
  • Long reference documents that multiple user queries will touch (product catalogs, knowledge bases)
  • Few-shot example banks

Pattern:


ts
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: STYLIST_SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: userQuery }],
});


Enter fullscreen mode Exit fullscreen mode

On STYLE SYNC, the system prompt plus few-shot examples is ~4,000 tokens. Without caching, every request pays the full input cost. With caching, the first request pays full input + write premium; every subsequent request within the cache TTL pays ~10%.

For a product doing 10,000 requests/day with a 4K-token system prompt on Sonnet 4.6, that's a difference of roughly $90/day vs. $9/day on the cached portion alone. Over a month that's ~$2,400 in the difference.

Common mistake: putting user-specific data in the cached prefix. The cache only hits when the prefix is byte-identical, so any per-user variability in the cached portion defeats the point. Keep the cached portion static; put dynamic content in the user message.


7. Context Window Management

Sonnet 4.6 and the Opus 4.x line support 1M-token context windows. This doesn't mean you should fill them.

Rules I follow:

Summarize before 50K tokens. Conversation quality degrades well before the context limit. When a conversation crosses ~50K tokens, I run a Haiku summarization pass that compresses the history into a "conversation so far" note, then continue with the compressed context.


ts
async function compressHistory(messages: Message[]) {
  const summary = await anthropic.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 1024,
    messages: [
      {
        role: "user",
        content: `Summarize this conversation preserving key facts, decisions, and open questions:\n\n${JSON.stringify(messages)}`,
      },
    ],
  });
  return summary.content[0].type === "text" ? summary.content[0].text : "";
}


Enter fullscreen mode Exit fullscreen mode

Use 1M context only when the task genuinely requires it. Requests above 200K input tokens on 1M-context models switch to long-context pricing on current generation. This is the right tradeoff for legitimate long-document work, but it's expensive if you're using it as a substitute for retrieval.

Stream RAG context, don't stuff it. If you have a 500-document knowledge base, don't dump it all in. Retrieve 3–8 relevant documents per query and include those.


8. Error Handling & Retries

The Claude API is reliable, but production apps need to handle the edges.

Errors you'll see:

  • 429 rate_limit_error — retry with exponential backoff
  • 529 overloaded_error — retry with exponential backoff (this is an Anthropic-side capacity signal)
  • 400 invalid_request_error — don't retry; fix your request
  • 500 api_error — retry once, then fail loudly

Pattern:


ts
async function callWithRetry<T>(
  fn: () => Promise<T>,
  maxAttempts = 3,
): Promise<T> {
  let lastErr: unknown;
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err: any) {
      lastErr = err;
      const status = err?.status;
      const retryable = status === 429 || status === 529 || status === 500;
      if (!retryable || attempt === maxAttempts - 1) throw err;
      const backoffMs = Math.min(1000 * 2 ** attempt, 8000) + Math.random() * 250;
      await new Promise((r) => setTimeout(r, backoffMs));
    }
  }
  throw lastErr;
}


Enter fullscreen mode Exit fullscreen mode

The + Math.random() * 250 is jitter. Without it, retries from multiple simultaneous failures hit the API in lockstep and all fail together.

User-facing errors: never show API error messages to users. Map them to clean messages like "We're generating a lot right now — try again in a moment."


9. Observability

You cannot optimize what you don't measure. Every production Claude call in my stack logs:

  • Request ID (for matching logs to API responses)
  • Model used
  • Input tokens, output tokens, cache-read tokens
  • Latency (first token, total)
  • User ID and feature ID
  • Whether any tool calls ran, and how many

ts
await supabase.from("claude_calls").insert({
  user_id: userId,
  feature: "pitch_generation",
  model: response.model,
  input_tokens: response.usage.input_tokens,
  output_tokens: response.usage.output_tokens,
  cache_read_tokens: response.usage.cache_read_input_tokens ?? 0,
  latency_ms: Date.now() - startTime,
  request_id: response.id,
});


Enter fullscreen mode Exit fullscreen mode

Two queries I run weekly:

Cost by feature:


sql
SELECT feature,
       SUM(input_tokens) / 1e6 * 3 +
       SUM(output_tokens) / 1e6 * 15 +
       SUM(cache_read_tokens) / 1e6 * 0.3 AS usd_cost,
       COUNT(*) AS calls
FROM claude_calls
WHERE created_at > now() - interval '7 days'
GROUP BY feature
ORDER BY usd_cost DESC;


Enter fullscreen mode Exit fullscreen mode

Latency outliers by feature:


sql
SELECT feature,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY latency_ms) AS p50,
       percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
       percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99
FROM claude_calls
WHERE created_at > now() - interval '7 days'
GROUP BY feature;


Enter fullscreen mode Exit fullscreen mode

When a feature's P95 latency spikes overnight with no code change, it's almost always because a prompt grew unnoticed. The query tells you which one.


10. Closing

The common thread across everything above: treat Claude calls as production infrastructure, not as a clever add-on. That means routing, caching, validation, observability, and cost attribution from day one — not as an afterthought when the bill gets scary.

The two products I built on this stack were shipped solo, from zero to live, in a matter of weeks. That's only possible because the primitives in the Claude API and Next.js App Router are genuinely excellent. The patterns above are the ones that turn those primitives into something a real business can run on.


If you're building an AI-powered SaaS on this stack and want a senior full-stack dev to either help you ship faster or review what you've already built, I take on both project-based work and retainers.

Satoshi Yamashita
Tokyo / Yokohama, Japan
Reach me on Upwork, or via otonami.io.


This guide was written based on patterns used in production at otonami.io and stylesync.vercel.app as of April 2026. Pricing and model availability are subject to change; verify current rates at docs.claude.com.

Source: dev.to

arrow_back Back to Tutorials