I've been building with Claude Code for two months. The first few weeks taught me something that took longer than it should to name.
I tried for weeks to write the best and most effective prompts... but Claude doesn't drift because of a bad prompt. It drifts because there's no structure beneath the session. Every conversation starts fresh. Decisions get re-made. Context you thought was stable evaporates. You end up not with a codebase shaped by clear intent, but with a layered record of every session that started over.
That's also not a model problem. It's a workflow problem.
Before I explain what I built — the part that happened two days ago
Two days before publishing this, the workflow caught two bugs in itself.
The strategic-PM agent was running an autonomous sprint and hit an ambiguous instruction in the project's CLAUDE.md: "one phase = one session." It interpreted this as requiring human approval between phases, which was exactly the opposite of what autonomous orchestration should do. Sessions are technical CLI isolation for context purity, not human gates. The agent had been operating with a subtly wrong mental model, and nothing had surfaced it until the system ran against itself.
The second bug: the /sprint-plan skill instructed Claude to enter plan mode inside claude -p sessions. In non-interactive mode, plan mode triggers an exit waiting for human approval. Exit code 0. Nothing written. Silent failure.
Both are fixed in v3.5.1, documented in the CHANGELOG. I mention this here, at the top, because it's the most honest thing I can say about why this workflow matters: a system that catches its own inconsistencies before they reach production is the actual goal. Not a system that looks clean.
What I built
A sprint workflow — mainly inspired by the agile methods most teams already use. I built this because I kept seeing frameworks and CLAUDE.md templates being released, the kind you paste once and gradually forget. This is different: a methodology. The agent team runs the sprint, you do two things: invoke and validate.
The team:
- 9 specialized agents in 3 groups: 3 strategic (PM, independent QA challenger, marketing strategist) · 5 technical (architect, code reviewer, security auditor, ops engineer, QA tester) · 1 ops monitor.
- Each has a defined role, persistent memory across sessions, and instructions it can't override.
- No agent reviews its own work.
The cycle:
/sprint-plan → /build → /review → /fix → (optional /red-team) → /capture-lessons
The skills:
18 skills encode every phase. You stop prompt-engineering the same context every sprint. The skill runs it.
Two modes:
- Manual: you invoke each phase, validate, move on.
- Autonomous: the strategic-PM orchestrates end-to-end. You review the PR.
Everything is plain markdown and JSON. Nothing to install beyond Claude Code.
The proof
Two months. 55+ sprints on a real production SaaS — multi-tenant, AWS Lambda + ECS Fargate, Stripe billing, currently serving live traffic.
| Metric | Value |
|---|---|
| Total API cost | $1,394.38 (via ccusage) |
| Without prompt caching | ~$13,000 |
| Cache hit rate | 95.5% |
| Sprint duration (autonomous mode) | 30–45 min end-to-end |
The ~9.3x ratio between what I paid and what it would have cost without caching comes entirely from workflow discipline. CLAUDE.md stays small and stable — it's the cache anchor. Skills load deferred. Every session opens with a warm cache. That's it. No API trick.
Every number is verifiable. ccusage parses Claude Code's local usage logs with daily breakdowns by model. The sprint history lives in the CHANGELOG with commit SHAs you can inspect.
Why the ratio is so high: cached tokens cost roughly 10x less than fresh input tokens. A session that opens with a warm cache costs a fraction of a cold one. Keeping CLAUDE.md stable and loading skills deferred is the only discipline that makes every session start warm. That's it — no API trick, just consistency.
What I'm not claiming
This isn't a claim that AI replaced my team. I'm one person operating with the discipline of a small team.
The 30–45 minutes applies to focused, well-scoped features in autonomous mode. Sprint duration scales with what you ask for; if you scope a complex MVP in a single sprint, you'll burn proportionally more tokens and time. Manual mode takes longer, but gives you finer control over each phase (useful when you want to inspect or redirect between steps). Large refactors can take hours. The workflow doesn't change that math; it makes it more predictable.
The $1,394.38 assumes a high cache hit rate and this specific usage pattern. Your numbers will differ.
If you want to go deeper — how the autonomous mode actually works
You write a direction file: what to build, what success looks like. The strategic-PM reads it, proposes a sprint plan, and launches each phase in its own isolated claude -p subprocess for context purity.
The strategic-QA agent independently reviews each sprint. It challenges PM decisions, not just validates them. The PM defends its choices with evidence. "I agree" without explanation is explicitly forbidden. After 3 rounds without consensus, it escalates to a blockers file for you to decide.
You're not sitting there waiting. The separation between phases is technical isolation for context purity — not a human gate. The PM chains all phases in isolated subprocesses, you can close your terminal and come back. When it's done, there's a PR: plan, build output, review findings, fix log, retrospective. You decide if it ships.
The repo
Open-sourced today: github.com/rbah31/claude-code-workflow
If you've been building with Claude Code and something feels off — the drift, the inconsistency, the feeling that each session starts from zero — it's not you. It's the single-session model. There's a better one. This is mine. Clone it, run a few sprints, tell me what you'd change.
This will be a part of a series on building in production with AI, I mean if I don't flop.