Claude Had Elevated Errors. Your AI Agent Should Not Turn That Into a Retry Storm.

The event

On June 23, Anthropic’s Claude status page listed elevated error rates affecting Claude.ai, resolved at 18:32 UTC. It also showed elevated error rates across multiple models around the same period. On June 24, the status page listed no incidents for the day.

This kind of incident is easy to frame as a provider problem.

And it is.

But for AI-agent developers, it is also a runtime-design problem.

Because provider instability exposes what your agent does when the model layer becomes unreliable.

The hidden problem

A normal chatbot failure is visible.

The user asks a question.

The model fails.

The interface shows an error.

The user stops.

Agents are different.

An agent may be running in a CLI, background job, coding workflow, CI task, support automation, or internal tool.

It may not stop cleanly.

It may retry.

It may call tools again.

It may add context.

It may switch models.

It may repeat the same prompt with minor edits.

It may continue even though the task is not making progress.

That turns provider instability into local runtime waste.

Bad retry logic

A naive agent loop might look like this:

for (let attempt = 0; attempt < 3; attempt++) {
try {
return await provider.call({
model,
messages,
});
} catch (err) {
continue;
}
}

This is simple.

Too simple.

It does not know whether the provider is degraded.

It does not know whether the next prompt is similar to the previous one.

It does not know whether the run has exceeded budget.

It does not know whether the fallback model is more expensive.

It does not know whether the agent is making progress.

For a normal HTTP call, this may be acceptable.

For an AI agent, it is weak.

Agents need retry budgets

A retry count says:

“Try three times.”

A retry budget says:

“Retry only while the run remains safe.”

That means the runtime should check more than the attempt number.

For example:

const decision = guard.beforeCall({
runId,
model,
messages,
stepCount,
retryCount,
providerStatus,
estimatedInputTokens,
estimatedOutputTokens,
budgetRemaining,
previousPrompts,
progressState,
});

if (!decision.allowed) {
throw decision.error;
}

const result = await provider.call({
model,
messages,
});

The exact API is not the point.

The important part is placement.

The guard runs before the provider call.

That means it can stop a bad run before more spend happens.

What should the runtime check?
Budget

Every agent run should have a task-level budget.

If the next estimated call would exceed that budget, stop.

Max steps

If the agent has already used too many steps, stop.

A confused agent should not be allowed to run forever.

Retry storms

Retries are useful.

Retry storms are dangerous.

If the agent keeps hitting similar failures, the runtime should block the next attempt.

Prompt loops

If the prompt is almost the same as previous failed attempts, the agent may be stuck.

Changing a few words is not progress.

Unknown model pricing

Fallback logic can be dangerous during provider incidents.

If the runtime does not know the price of the fallback model, it should fail closed.

Do not guess inside an autonomous loop.

No progress

A run can be active and still useless.

If the agent is consuming steps and budget without moving closer to a valid result, stop and return a structured error.

A safer retry shape

A better agent retry loop treats every provider call as a decision.

while (!task.done) {
const decision = guard.beforeCall({
runId: task.id,
model: task.model,
messages: task.messages,
stepCount: task.steps.length,
retryCount: task.retryCount,
budgetRemaining: task.budgetRemaining,
previousPrompts: task.previousPrompts,
progressState: task.progress,
});

if (!decision.allowed) {
return {
status: "stopped",
reason: decision.reason,
error: decision.error,
};
}

const response = await provider.call({
model: task.model,
messages: task.messages,
});

task = updateTaskState(task, response);
}

This does not make the model smarter.

It makes the runtime less naive.

That matters when providers have elevated errors.

Where AI CostGuard fits

This is the layer I am building with AI CostGuard.

AI CostGuard is a local-first TypeScript runtime safety layer for AI agents.

It is designed to stop expensive agent failure modes before provider calls execute:

retry storms
prompt loops
max-step explosions
no-progress runs
budget overruns
unknown model pricing
runaway agent behavior

It is not a billing dashboard.

It is not a hard security boundary.

It is not a cloud control plane.

It is a pre-call runtime kill switch for agent cost and loop failures.

The takeaway

Every provider will have incidents.

Claude.

OpenAI.

Gemini.

Any model API.

That is normal infrastructure reality.

The important question is what your agent does during the incident.

If it retries blindly, it can turn provider instability into local cost waste.

If it has runtime guardrails, it can stop cleanly.

AI agents need retry budgets.

Not just retries.
https://github.com/salimassili62-afk/ai-costguard