Tokenizer Changes Can Break AI-Agent Budget Assumptions

Claude Sonnet 5 is a good reminder that AI-agent cost control is not only about model pricing.

Anthropic says Sonnet 5 launches at $2 per million input tokens and $10 per million output tokens through August 31, 2026, then moves to $3 per million input tokens and $15 per million output tokens.

Vercel’s AI Gateway changelog adds an important implementation detail: Sonnet 5 uses an updated tokenizer, and the same input can map to more tokens.

That matters if you build agents.

A model can have attractive pricing and still change your runtime cost assumptions.

The problem

Many agent systems estimate cost like this:

const estimatedCost =
inputTokens * inputPrice +
maxOutputTokens * outputPrice;

That is fine as a start.

But it assumes the runtime knows:

which model is being used
how the input is tokenized
the correct input price
the correct output price
how many retries are allowed
how many steps are allowed
whether the agent is making progress

If any of those assumptions drift, the estimate becomes weaker.

A tokenizer update is one way this happens.

The prompt text may be the same.

The token count may not be.

Agents amplify small estimation errors

For a single request, a token-count surprise is usually manageable.

For an agent, it can compound.

An agent may:

call the model
inspect files
add context
run tools
retry
ask a similar prompt
switch models
continue for more steps

Small cost drift per call becomes bigger across a run.

Now add parallel agents, fallback paths, or long-context workflows.

The problem is no longer “What does this model cost?”

The problem is:

Should this next provider call be allowed?

Add a pre-call decision

Before the provider call, add a guard decision.

type BeforeCallInput = {
runId: string;
model: string;
prompt: string;
estimatedInputTokens: number;
maxOutputTokens: number;
stepCount: number;
retryCount: number;
budgetRemaining: number;
previousPrompts: string[];
};

type GuardDecision =
| { allowed: true }
| {
allowed: false;
reason:
| "unknown_model_pricing"
| "budget_exceeded"
| "max_steps_exceeded"
| "retry_storm"
| "prompt_loop";
};

Then use it before execution:

const decision = guard.beforeCall({
runId,
model,
prompt,
estimatedInputTokens,
maxOutputTokens,
stepCount,
retryCount,
budgetRemaining,
previousPrompts,
});

if (!decision.allowed) {
return {
status: "stopped",
reason: decision.reason,
};
}

const result = await provider.call({
model,
prompt,
});

The important part is placement.

The check happens before the provider call.

Not after the bill.

Fail closed on unknown pricing

If the runtime does not know the model price, it cannot enforce a reliable budget.

if (!pricingCatalog.has(model)) {
return {
allowed: false,
reason: "unknown_model_pricing",
};
}

Do not guess.

A model alias, fallback, gateway rewrite, or typo can break your assumptions.

Unknown pricing should stop the run before the call executes.

Recalculate tokens after model migration

When changing models, do not assume the same prompt has the same token count.

A basic migration checklist:

run representative prompts through the new tokenizer
compare input token counts
compare output length behavior
update pricing metadata
retest max-step limits
retest retry behavior
check fallback paths
measure cost per successful task

Cost per token is useful.

Cost per successful task is more useful.

Watch retry and prompt-loop behavior

A smarter model may finish tasks in fewer steps.

It may also continue longer because it can pursue more complex plans.

Your runtime should still stop obvious waste.

if (stepCount >= maxSteps) {
return { allowed: false, reason: "max_steps_exceeded" };
}

if (retryCount >= maxRetries) {
return { allowed: false, reason: "retry_storm" };
}

if (similarToRecentPrompt(prompt, previousPrompts)) {
return { allowed: false, reason: "prompt_loop" };
}

These checks do not make the model better.

They make the system safer to operate.

Where AI CostGuard fits

AI CostGuard is the local-first TypeScript/Node.js runtime safety layer I’m building for AI-agent projects.

It focuses on pre-call safety checks for:

retry storms
prompt loops
max-step explosions
runaway execution
unknown model pricing
budget overruns
uncontrolled provider calls

It is not a billing ledger.

It is not a hard security boundary.

It does not replace provider dashboards.

The goal is to catch obviously risky calls before they execute.

Takeaway

Model pricing pages tell you the unit price.

They do not tell you whether your agent should make the next call.

Tokenizer changes, fallback models, retries, and prompt loops all affect real runtime cost.

For AI agents, cost safety belongs before the provider call.