How I use Claude Code to refactor a legacy codebase — a complete workflow
Every developer has that codebase. The one inherited from a previous team, or written three years ago when you "didn't know better." The one where touching one thing breaks five others.
I've been using Claude Code to systematically refactor legacy codebases, and I want to share the exact workflow that's saved me hours of painful manual archaeology.
The problem with legacy refactoring
Legacy refactoring sessions are long. Really long. You're context-switching between:
- Understanding what the code actually does (vs what the docs say)
- Writing characterization tests to lock in current behavior
- Identifying the strangler fig entry points
- Actually making the changes
- Verifying nothing broke
These sessions can easily run 4-6 hours. If you're paying by the token or hitting rate limits mid-session, you lose all your context right when the AI has finally understood the messy domain model.
My workflow: the 5-phase approach
Phase 1: Archaeology (20-30 min)
Start by giving Claude Code a map of the disaster:
Read through src/legacy/ and tell me:
1. What does this module actually do? (not what the README says)
2. What are the data flows — what comes in, what goes out?
3. What are the most dangerous parts to touch?
4. What are the hidden dependencies I might miss?
This produces a "reality map" — often revealing that the README is 2 years out of date.
Phase 2: Characterization tests (45-60 min)
Before touching a line of production code:
Now write characterization tests for the UserProcessor class.
Don't test what it SHOULD do — test what it ACTUALLY does right now.
Include the weird edge cases. Include the behavior that looks like bugs
but might be intentional. I need these tests to catch regressions.
Characterization tests are your safety net. They capture current behavior exactly — even the bugs — so you know if you accidentally change something.
Example output:
describe('UserProcessor (characterization)', () => {
it('returns null for users with no email (not an error, just null)', async () => {
const result = await processor.process({ id: 1, email: null });
expect(result).toBeNull(); // Yes, this is weird. Don't change it.
});
it('adds 24 hours to timestamps in UTC but displays in Pacific', async () => {
// This is almost certainly a bug but 3 other systems depend on it
const result = await processor.process(testUser);
expect(result.displayTime).toBe('2024-01-02T08:00:00'); // Not UTC
});
});
Phase 3: Strangler fig planning (15 min)
Based on the characterization tests, design a strangler fig refactor.
Give me:
1. The new interface this module should expose
2. A migration path that lets old and new code coexist
3. The order of operations — what to refactor first, what last
4. Which tests need to change vs which are permanent safety nets
The strangler fig pattern means you never have a "big bang" refactor. Old code continues working while new code grows around it.
Phase 4: Incremental refactoring (the bulk of the work)
Now you execute the plan — but incrementally:
Refactor step 1: Extract the email validation logic into EmailValidator class.
Keep the old behavior exactly. All characterization tests must still pass.
Add new unit tests for the extracted class.
Run tests after every step. Never refactor more than one thing at a time.
npm test -- --watch src/legacy/
Phase 5: Verification and documentation
Now that we've completed the refactor:
1. Update the README to reflect what the module actually does now
2. Document the intentional quirks we preserved
3. Flag the technical debt we chose NOT to fix and why
4. Write a migration guide for anyone else touching this code
The rate limit problem
Here's the painful reality: a full legacy refactor session will likely exhaust your AI's context or hit rate limits. Right when Claude has internalized the messy domain model and understands why that weird timestamp behavior exists, the session ends.
I switched to SimplyLouie for exactly this reason. Flat ✌️2/month with no token counting, no rate limits per session. Legacy refactoring requires long, uninterrupted conversations — you can't afford to lose your AI's understanding of the codebase halfway through.
The results
Using this workflow on a 15,000-line Node.js codebase:
- Week 1: Characterization tests written, 0 regressions
- Week 2: Core business logic extracted into clean modules
- Week 3: Old code removed, test coverage went from 12% to 67%
- Total time: ~18 hours of Claude Code sessions vs estimated 60+ hours manual
The characterization tests alone saved us from 3 production incidents during the refactor.
Prompts to save
Here's my legacy refactoring prompt kit:
Archaeology prompt:
Read [file/directory] and give me a reality map: what it does, data flows,
dangers, and hidden dependencies. Ignore the documentation — tell me what
the code actually does.
Characterization test prompt:
Write characterization tests for [class/module]. Capture actual behavior,
not intended behavior. Include weird edge cases. Mark anything that looks
like a bug but might be intentional.
Strangler fig prompt:
Design a strangler fig refactor for [module]. New interface, migration path,
order of operations, which characterization tests are permanent safety nets.
Step prompt:
Execute refactor step [N]: [specific thing]. Preserve exact behavior.
All characterization tests must pass. Add unit tests for new code only.
The legacy codebase is survivable. The characterization tests give you courage to change things. The strangler fig gives you a path. And Claude Code gives you a pair programmer who can hold the whole mess in context.
Using SimplyLouie for flat-rate Claude access — ✌️2/month, no rate limits. Long sessions for long refactors.