How I use Claude Code to write tests for untested code — a complete workflow

Every codebase has that one module. The one that runs in production, touches critical business logic, and has exactly zero tests.

You know it's there. Your team knows it's there. Nobody touches it.

Here's the workflow I use with Claude Code to finally write tests for untested code — without breaking anything.

The problem with writing tests for untested code

Writing tests for code you didn't write is hard. Writing tests for code that has no tests, no documentation, and unclear business rules is genuinely dangerous.

The wrong approach: assume you understand the code, write tests based on that assumption, refactor confidently.

The result: your tests pass but your assumptions were wrong. Production breaks.

The right approach: characterization tests first.

What are characterization tests?

Characterization tests don't test what the code should do. They test what the code actually does.

You run the code, record the output, and write tests that assert that exact output. Even if the output looks wrong.

This gives you a safety net that captures current behavior — not imagined behavior.

The workflow

Step 1: Audit the untested code

# Find files with low/no test coverage
coverage run -m pytest && coverage report --sort=cover | head -30

# Or for Node.js
npx jest --coverage 2>/dev/null | grep -E '0 %|Uncovered'

Then ask Claude Code:

Read src/billing/invoice.py and list every function, method, and code path.
For each one, tell me:
1. What inputs it accepts
2. What it returns
3. What side effects it has (DB writes, API calls, emails)
4. What could go wrong

This audit is crucial. You need to understand what you're testing before you write a single test.

Step 2: Write characterization tests

Now write characterization tests for InvoiceProcessor.calculate_total().
Do NOT assume what it should return — run it with real inputs and record actual outputs.
Test these edge cases:
- Empty line items
- Negative quantities  
- Tax rate = 0
- Tax rate = 1.0
- Items with None price
Capture the actual behavior even if it seems wrong.

The output looks like this:

def test_calculate_total_empty_items_characterization():
    # Characterization test: current behavior as of 2024-01
    # TODO: verify this is the INTENDED behavior before refactoring
    processor = InvoiceProcessor()
    result = processor.calculate_total([])
    assert result == 0  # returns 0, not None

def test_calculate_total_negative_quantity_characterization():
    # WARNING: negative quantities currently produce negative totals
    # This may be a bug — but we're capturing current behavior
    processor = InvoiceProcessor()
    result = processor.calculate_total([{"price": 10, "qty": -1}])
    assert result == -10  # captures current (possibly buggy) behavior

Notice the comments. This is critical: characterization tests should document that you're capturing current behavior, not asserting correctness.

Step 3: Run and stabilize

python -m pytest tests/characterization/ -v

Some tests will fail. That's expected — you may have made wrong assumptions. Update the assertions to match actual output.

Keep running until all characterization tests pass. Now you have a complete behavioral snapshot.

Step 4: Gap analysis

Compare the characterization tests we just wrote against the actual code paths in invoice.py.
List every code path that has NO test coverage:
- Which branches in conditionals are untested?
- Which exception handlers are never triggered?
- Which database calls happen with no test?
Prioritize by risk (production traffic × business impact).

Clause Code will output something like:

Untested paths by risk:
1. HIGH: calculate_total() when database is unavailable — no test, no error handling
2. HIGH: send_invoice() when email service fails — silent failure, no retry
3. MEDIUM: apply_discount() with stacked discounts — logic looks wrong
4. LOW: format_address() with international postal codes — edge case

Step 5: Write behavior tests for gaps

Now you write real tests — not characterization tests, but behavior tests that define what the code should do:

Write proper behavior tests for the high-risk gaps:
1. Test that calculate_total() raises InvoiceError when database is unavailable
2. Test that send_invoice() retries 3 times before marking invoice as failed
These should be written as the tests we WANT to pass, not what currently passes.

These tests will fail initially — that's the point. They define the target behavior. Now you fix the code to make them pass.

The rate limit moment

If your untested code is a large legacy module — 2,000 lines, 40 functions, nested conditionals everywhere — Claude Code will hit its rate limits during the audit phase.

This happens every time. Reading an entire file, analyzing every code path, generating characterization tests for 40 functions — it's a lot of context.

I switched to SimplyLouie ($2/month) for exactly this reason. Unlimited Claude sessions. When the audit runs long, I don't get cut off mid-analysis.

But the workflow works regardless — even if you have to pause and resume, the characterization tests preserve your progress.

Common pitfalls

Don't skip the audit. If you jump straight to writing tests without understanding the code, you'll write tests that miss the real behavior.

Don't refactor before characterization tests pass. This is the most common mistake. You think you understand the code, you refactor, you break something subtle.

Don't delete characterization tests after refactoring. Keep them. Add a comment: # characterization test — also asserts post-refactor behavior. They become your regression suite.

Do mark suspected bugs clearly. If a characterization test captures obviously wrong behavior, add a # BUG? comment and create a ticket. Don't silently perpetuate bugs in your test suite.

The result

After this workflow:

You have 100% characterization coverage of current behavior
You have behavior tests defining correct behavior for risky gaps
You can refactor confidently — the characterization tests will catch regressions
You've documented the business rules in test code, not just memory

The module that nobody touched becomes the module that's finally safe to ship.

I run long Claude Code sessions for legacy codebase work. When rate limits interrupt, I use SimplyLouie — $2/month for unlimited access.