I Cut My AI Agent's Token Bill by 62% in One Weekend. Here's the Receipts.

dev.to

My agent spent $5.40 to do what a 200-line script does for free. Then I spent a weekend fixing it, and brought the same workflow down to $2.05 per run — a 62% drop with no measurable quality regression. This is the breakdown, with the actual prompt diffs and the benchmarks that mattered.

The starting point: a single agent, three habits, $847/mo

The agent I run most is a research-and-summarize loop. It searches the web, scrapes ~20 pages, drafts a structured summary, and writes a file. Sounds harmless. The bill said otherwise.

Three things were quietly hemorrhaging tokens:

  1. Full-page scraping with no chunking. I was dumping entire 50k-character pages into the context window, then asking the model to extract the relevant 2,000 characters. That's a textbook case of "context stuffing" — paying for the whole haystack when you only need the needle.
  2. Verbose system prompts that re-stated the same instructions three different ways. The model already knows how to summarize. I was paying for it to re-read my instructions every call.
  3. Using the most expensive model for every step. Reasoning-tier models were being used for "summarize this paragraph" — the same task a smaller model handles fine.

A 2026 Stevens Institute analysis pegs unconstrained agents at $5–$8 per task. Mine was $5.40. Textbook.

What I changed (the actual diffs)

1. Chunk-then-extract, not dump-then-ask

The old pattern:

# BAD: pay for the whole page, then ask the model to find the bit you wanted
page = fetch(url)  # ~50,000 chars
response = llm(f"Summarize this page, focusing on {topic}:\n\n{page}")
Enter fullscreen mode Exit fullscreen mode

The new pattern:

# GOOD: filter first, then send only what's relevant
page = fetch(url)
chunks = chunk(page, max_chars=4000)
relevant = [c for c in chunks if keyword_score(c, topic) > 0.3]
relevant = relevant[:5]  # hard cap
response = llm(f"Summarize these excerpts for {topic}:\n\n" + "\n---\n".join(relevant))
Enter fullscreen mode Exit fullscreen mode

Token usage dropped from ~12,500 input tokens per page to ~3,200. Quality went up — fewer hallucinations, because the model wasn't drowning in noise.

2. System prompt audit: cut 740 tokens, lost nothing

Old system prompt: 1,180 tokens.
New: 440 tokens.

The win wasn't in what I added — it was in removing redundancy. Three things got deleted:

  • Re-stated tool descriptions. The model already knows what web_search does. One short line is enough.
  • "Think step-by-step" boilerplate. This is now a default behavior of the better reasoning models; you don't need to pay to remind them.
  • Style guides the model already follows. "Be concise," "use markdown," "cite sources" — most of these are baked into the model's training. Repeating them costs tokens every single call.

I ran the same 50-task eval suite before and after. Output quality was statistically indistinguishable. The 740 tokens saved per call added up to about $180/month on my volume.

3. Multi-model routing: GPT-5.4 isn't always the answer

This was the biggest single win. I split my agent's steps into three tiers:

Step Old model New model Cost per call
Extract key facts from chunks GPT-5.4 Claude 4 Sonnet $0.003 → $0.0008
Draft structured summary GPT-5.4 GPT-5.4 $0.018 (unchanged)
Quality check + rewrite GPT-5.4 Claude 4 Sonnet $0.003 → $0.0008

The reasoning-tier model only touches the synthesis step. Everything else runs on a cheaper, faster model that's still good enough for extractive work.

Routing logic, in 15 lines:

def route(step):
    if step.requires_reasoning:
        return "gpt-5.4"      # synthesis, planning, judgment calls
    if step.requires_long_context:
        return "claude-4-sonnet"  # chunk summarization, fact extraction
    return "gpt-5-mini"         # formatting, light edits
Enter fullscreen mode Exit fullscreen mode

The benchmarks that actually mattered

I didn't trust my gut on quality. I ran a 50-task eval suite with three different rubrics:

  • Fact accuracy (ground truth vs. agent output, scored by an LLM-as-judge with citations)
  • Citation coverage (% of claims that trace back to a source URL)
  • User satisfaction (binary thumbs from my own review of the last 30 days)

Numbers, before vs. after:

Metric Before After Change
Cost per task $5.40 $2.05 -62%
Median latency 41s 28s -32%
Fact accuracy 0.81 0.83 +0.02 (noise)
Citation coverage 67% 89% +22pp
User satisfaction 0.74 0.78 +0.04

Citation coverage went up because chunk-then-extract gives the model cleaner evidence to cite. Latency dropped because smaller models respond faster. Fact accuracy was a wash — which is what you want, because the whole point was to cut cost without hurting quality.

What I'd do differently if I started today

Three things, in order of ROI:

  1. Add a token budget per task, hardcoded. If your agent's burn rate exceeds $X per task, kill it and return a partial result. This single line of guardrail logic is worth more than any prompt tweak.
  2. Cache aggressively. Anything you've fetched or extracted before is free to reuse. I was re-scraping the same URLs on every run. A 24-hour cache on extraction results cut my external API spend by another 18%.
  3. Log every model's contribution separately. If you can't see which step costs what, you can't optimize it. A simple CSV log of {task_id, step, model, input_tokens, output_tokens, cost} per run is the highest-leverage observability you'll add this year.

The reflex in 2026 is to reach for a bigger model when quality dips. Most of the time, the answer is a smaller model with a tighter context.

What I learned

  • Context stuffing is the silent killer. More context isn't always better — it's almost always more expensive.
  • System prompts rot. The thing you wrote six months ago is probably three times longer than it needs to be.
  • Multi-model routing is the highest-leverage cost optimization available — and it's not even close.
  • Quality benchmarks beat intuition every time. "It feels about the same" is a coin flip. A 50-task eval suite is an answer.
  • Citation coverage is the under-rated metric. It correlates with user trust more than any other number I tracked.

The agent didn't get smarter. The pipeline got more honest about what each step actually needs.

If you're running agents in production and you haven't looked at your per-step token breakdown in the last 30 days, that's where I'd start. The $847/month I'm saving came from one weekend and three files changed.

Source: dev.to

arrow_back Back to News