I Almost Cancelled Claude: I Ran My Own Benchmarks Before Pulling the Trigger

I was reviewing a PR from my team on Tuesday afternoon when I caught the Hacker News thread. "I cancelled Claude" — 874 points, 400+ comments, the kind of conversation that explodes because it puts words to something a lot of people had been feeling but hadn't articulated. I read the whole thing. Then I closed the tab and opened my own logs.

I've had Claude Code running against the same set of test cases since March. Not an academic benchmark — these are the real scenarios I throw at it in my actual workflow: TypeScript module refactoring, SQL migration generation, code path analysis in my monorepo on Railway. If there's degradation, my logs have it. And if they don't, then the HN thread is mostly emotional noise.

Spoiler: the degradation is real. Just not where most people are complaining.

Claude Quality Degradation in 2025: What My Logs Say vs. What HN Says

My tracking setup is simple. Since the post on Claude Code quality reports I've been running a fixed set of 23 test cases against Claude Code. The cases are split into three categories: reasoning about existing code, generating new code, and bug detection in snippets I deliberately injected with known errors.

Every run gets logged with a timestamp, model, tokens used, and a manual score from me — 1 to 5. It's not automated. I do it by hand, once a week, takes 40 minutes. Boring but honest.

Here are the numbers from March through July 2025:

# Scoring summary — Claude Code (Sonnet base)
# Scale: 1-5 per case, weekly average

Week 2025-03-10:  avg=4.2  failed_cases=3/23
Week 2025-04-07:  avg=4.1  failed_cases=3/23
Week 2025-05-05:  avg=3.8  failed_cases=5/23  # First notable drop
Week 2025-06-02:  avg=3.6  failed_cases=7/23
Week 2025-06-30:  avg=3.5  failed_cases=8/23
Week 2025-07-21:  avg=3.7  failed_cases=6/23  # Slight bounce

There's degradation. Going from 4.2 to 3.5 over four months isn't statistical variation — it's a trend. But when I look at which cases failed, the story gets complicated.

Where It Got Worse, Where It Didn't, and Why That Matters More Than the Average

The 8 cases that failed the week of June 30th: six are new TypeScript code generation with complex constraints. Two are code path analysis with more than three levels of indirection. The 15 that passed: reasoning about existing code, known bug detection, refactoring of bounded modules.

My thesis before opening the logs was that degradation would show up in complex reasoning. I was wrong. It's in generation under multiple simultaneous constraints. The model performs worse when I say "generate a hook that's compatible with React 18, no local state, uses context X, doesn't break type Y, and is testable with vitest." Five constraints at once and quality drops noticeably compared to March.

What did NOT get worse — and what nobody in the HN thread mentions — is bug detection. In March it found 11 of 13 injected bugs. In July it finds 12. Slight improvement, even if it's a small delta. Reasoning about existing code didn't degrade either — which is, ironically, the most common use case in my day-to-day as Head of Development.

// Example of a case that GOT WORSE — generation with multiple constraints
// Original prompt (summarized):
// "Generate a TypeScript custom hook that:
//  - Is compatible with React 18 concurrent mode
//  - Does not use useState or useReducer (only useRef for mutable state)
//  - Consumes AuthContext without unnecessary re-renders
//  - Returns a discriminated type (Success | Loading | Error)
//  - Is testable without mocking the context"

// March response: working hook, correct types, ref used properly
// July response: working hook BUT return type poorly discriminated,
// unnecessary re-render on the Error case, comment in the code
// suggests useReducer as an alternative (ignoring the explicit constraint)

// Concrete difference: didn't collapse, but ignored one of the five constraints

That pattern of "ignore one constraint when there are five or more" is consistent across the failed cases. It's not that the model regressed in general — it's that handling multiple simultaneous restrictions seems to have degraded.

The Gotcha Nobody's Measuring: The Long-Context Coherence Regression

Here's the part that was most uncomfortable to document, and it connects to what I'd already seen in the post on async agents and observability.

In my long context window cases — conversations over 15,000 tokens where the model has to stay coherent with decisions made early on — the degradation is more pronounced than the overall average. In March those cases had an avg of 4.0. In July, 3.1. That's nearly a full point of drop on the same test set.

The specific symptom: the model contradicts in turn 12 a decision it made in turn 3. Not a reasoning error in the moment — it's loss of coherence across the conversation. For my agent workflows, that's worse than a point error because it's silent. Debugging async agents already taught me that silent failures are the ones that hurt most. This qualifies.

I also connect this to what I observed when I built the CC-Canary setup: the LLM-as-a-judge proxy I put in front of the agent started detecting coherence inconsistencies more frequently starting in May. I hadn't explicitly linked it to model degradation until now.

# CC-Canary log — coherence failures detected per month
# (extracted from alerting system, simplified format)

grep "coherence_fail" /var/log/canary/2025-*.log | \
  awk '{print substr($1,1,7)}' | sort | uniq -c

# Output:
#   12 2025-03
#   14 2025-04
#   19 2025-05
#   31 2025-06
#   28 2025-07  # Slight drop but still high

From 12 to 31 in three months. That number matters more to me than any synthetic benchmark.

Common Mistakes When Measuring LLM Degradation (Including Mine)

Mistake 1: Comparing against memory. "It used to answer better" is a trap. Human memory optimizes toward cases that impressed or frustrated you. Without logs, you're comparing against an idealized version of the past. I fell into this before I started tracking systematically.

Mistake 2: Not controlling the prompt. If you change the prompt between runs, you're not measuring the model — you're measuring your prompt. My 23 cases have fixed prompts, in plain text, saved in a file I don't touch between weeks. If I want to test a variant, I add it as a new case.

Mistake 3: Conflating UX friction with quality degradation. The HN thread mixes both. Some of the most upvoted complaints are about the Claude.ai interface — shorter responses, changed UI, behavior of the "new conversation" button. That's not model degradation, it's product change. Legitimate to complain about, but different categories.

Mistake 4: Only measuring the cases that matter to you. My TypeScript generation cases got worse. My security analysis cases improved slightly (relevant after what I saw with the Bitwarden CLI supply chain attack — I started including trust surface analysis cases). If I only measured TypeScript, I'd conclude total degradation. If I only measured security analysis, I'd conclude improvement. The heterogeneous average is more honest.

Mistake 5: Not distinguishing model from temperature/sampling. A change in sampling parameters can look like capability degradation. I have no visibility into that from the outside, but it's a real confounder to keep in mind before attributing everything to the model.

FAQ: Claude Quality Degradation 2025

Is Claude's degradation in 2025 real or perception?
With my logs: real in generation under multiple constraints and in long-context coherence. Not real (or slightly positive) in bug detection and reasoning about existing code. The total degradation perceived by the HN thread mixes actual model degradation with UX changes and with the bias that people report frustrations, not satisfactions.

How reliable are my homegrown benchmarks?
More reliable than memory, less reliable than a setup with automated judges and multiple evaluators. Manual 1-5 scoring has variance. What makes it useful is consistency: same prompts, same evaluator (me), same frequency. It's not science — it's field engineering.

Does cancelling Claude have an empirical basis or is it herd behavior?
Depends on the use case. If you work primarily with code generation under multiple simultaneous constraints, the degradation I'm measuring is pronounced enough to warrant rethinking. If you work with reasoning about existing code or debugging, my numbers don't justify cancellation. The HN thread has 874 points because it captured a real frustration — but the technical reason to cancel varies by use case.

What alternatives did you try?
I ran the same case set against GPT-4o in June as a comparison point. On TypeScript generation with multiple constraints, GPT-4o scored avg=3.9 vs Claude's 3.5 — a real difference but not dramatic. On long-context coherence, GPT-4o scored avg=3.4 vs Claude's 3.1 — basically even. Neither won by enough of a margin to make the migration friction worth it, plus the cost of retraining my workflows and prompts. That could change. I keep measuring.

Did previous posts about Claude Code quality change what you measure?
Yes. After the post on LLMs generating security reports, I added specific security analysis cases to my suite. After the post on Agent Vault, I added cases for reasoning about credentials and permissions in agent contexts. The suite grows. The denominator changes. That makes historical comparisons slightly noisy — I acknowledge that.

Are you cancelling or not?
Not for now. But I have a defined threshold: if the overall average drops below 3.3 for two consecutive weeks, or if coherence inconsistencies in CC-Canary exceed 40 events per month for two months running, I reevaluate. I'm not deciding based on a viral thread — I'm deciding based on my own numbers.

What I'd Do Differently: Don't Cancel on Instinct, Measure Before You Move

Here's my point: the HN thread is right that something changed. It's wrong in the collective diagnosis because it mixes real signals with UX noise, confirmation bias, and the effect that frustration goes viral more than satisfaction does.

The degradation I'm measuring is specific and bounded. Generation under multiple constraints, coherence in long context. If those are the cases that dominate the work of whoever cancelled, the decision has empirical grounding. If they cancelled because "I feel like it used to be better" or because the UI changed, they're paying a migration cost for a perception they never measured.

The uncomfortable thing about this conclusion is that it gives more work to anyone trying to decide. "Is it worth cancelling?" doesn't have a global answer — it has an answer that depends on which use cases dominate your own work. And that requires measurement, not Hacker News consensus.

I'm staying with Claude because my numbers don't justify the friction of moving. But I have the threshold set, the logs running, and CC-Canary watching. If the numbers change, I move. No drama.

Are you measuring Claude response quality in production? Do you have your own regression setup? I'd love to compare methodologies — especially if you've found degradation in cases I'm not covering.

This article was originally published on juanchi.dev