Your AI Writes Tests That Can Never Fail

go dev.to

You ask the AI for tests. It hands you twelve, all green. CI passes. You merge. Three days later a bug ships, on a function those tests were supposed to cover. You reopen the test file and it clicks: it ran, it passed, and it tested nothing.

A green test isn't a proof. It's a hypothesis. And an AI, left to its own devices, is very good at writing hypotheses that can never be disproved.

The phantom test

Take a dead-simple function, a discount above 100 euros:

func Discount(total int) int {
    if total > 100 {
        return total - 10
    }
    return total
}
Enter fullscreen mode Exit fullscreen mode

Here's the kind of test an AI produces when you ask "write me a test for this" with no further framing:

func TestDiscount(t *testing.T) {
    got := Discount(150)
    if got < 0 {
        t.Errorf("result should not be negative")
    }
}
Enter fullscreen mode Exit fullscreen mode

This test is green. It does run the discount branch (so your coverage climbs). But look at the assertion: got < 0 is never true, whatever Discount does. Replace total - 10 with total + 10, with total * 2, with 42: the test stays green. It doesn't check behavior, it checks that the lights are on.

Coverage doesn't measure what you think

The trap is that this phantom test inflates your coverage. Coverage counts lines executed, not assertions that bite. A line crossed by a test that asserts nothing useful counts as much as a line genuinely verified. So a 90% coverage report can hide half a suite of tests that will never fall, even if you break the code on purpose.

That's exactly an LLM's playground. Its reward signal is "the tests pass". Not "the tests catch a bug". With no external oracle to stop it, it drifts toward the shortest path to green: soft assertions, mocks that test themselves, cases that never exercise the risky branch.

The red-check: break the code, demand the red

The counter is one move, and it's as old as TDD: before trusting a test, check that it knows how to fail. Mutate the line it's meant to protect, rerun, and expect to see it go red. If it stays green, it's vacant.

On our function, I change the discount for one second:

// temporary mutation: - becomes +
return total + 10
Enter fullscreen mode Exit fullscreen mode

The phantom test stays green. Verdict: bin it. Here's the one that earns your trust:

func TestDiscount(t *testing.T) {
    if got := Discount(150); got != 140 {
        t.Errorf("Discount(150) = %d, want 140", got)
    }
}
Enter fullscreen mode Exit fullscreen mode

With the same mutation, Discount(150) returns 160, the test goes red instantly. It bites. That's a test: not one that passes, one that knows why it might not.

Automating the red-check: mutation testing

Doing this by hand on every test doesn't scale. That's precisely what mutation testing automates: the tool applies hundreds of small mutations to your code (a > that becomes >=, a + that becomes -, a gutted return) and reruns your suite after each one. Every mutation that makes no test go red is a surviving mutant: a hole your tests can't see.

In Go, gremlins does the job:

go install github.com/go-gremlins/gremlins/cmd/gremlins@latest
gremlins unleash ./...
Enter fullscreen mode Exit fullscreen mode

It gives you a mutation score: the percentage of mutants killed. Where coverage tells you "this line is crossed", the mutation score tells you "this line is actually tested". The two numbers have nothing to do with each other, and it's the second that counts.

How I wire it into an AI loop

When I let an agent write code and its tests, I don't let it declare itself done. Before any review, an objective gate runs: build, lint, test suite, then a red-check on the critical tests. The agent mutates the target line itself, checks the test goes red, restores it. A test still green after mutation gets rewritten, not negotiated. The LLM doesn't get a vote on "does this actually test something": the mutation decides, it only observes.

The rule that falls out is simple: no generated test enters the suite without proving it can fail. The cost is tiny, the payoff huge, because a vacant test is worse than no test. The absence, you see it. The vacant one lulls you.

Conclusion

We've learned to distrust AI-written code, so we review it. We still extend blind trust to the tests it writes, because they're green. But green doesn't prove itself: a test is only worth the red it's able to produce. Until you've watched a test fail at least once, you don't have a test, you have a decoration.

Source: dev.to

arrow_back Back to Tutorials