I ran my own dead-code tool on itself and found 8 duplicate code blocks

python dev.to

A few weeks ago I shipped Archaeologist, a codebase intelligence toolkit.
This week I finally pointed it at its own source code.

It found 8 duplicate groups in scanner.py — the file responsible for
parsing 9 different languages. Three of them were 84-89% identical:

walk() in the JS scanner — 89% similar to walk() in the Go scanner
walk() in the Rust scanner — 87% similar
walk() in the Java scanner — 86% similar

Turns out every one of my 9 language scanners (Python, JS, Ruby, Go,
Java, Rust, Kotlin, Swift) had copy-pasted the same ~40 line walk/
recurse/build-FunctionDef boilerplate, with only a few lines actually
different per language — which node types matter, how to pull a name
out of one, and the exact is_test/is_entry_point rules.

The refactor

I pulled the shared walking logic into one generic engine and made each
language define only what's genuinely different via small config objects:

def scan_python_file(filepath):
return _scan_with_spec(filepath, PYTHON_SPEC)

def scan_go_file(filepath):
return _scan_with_spec(filepath, GO_SPEC)

Each LangSpec just says: here's how to extract a name from a definition
node, here's how to extract a call target, here's the is_test rule.

The two bugs I found doing this

The annoying part of refactoring "obviously duplicated" code is discovering
it wasn't quite as duplicated as it looked. Two real differences I almost
flattened by accident:

JS and Go both have asymmetric is_test rules depending on which AST node
type matched — a top-level function named testFoo() is flagged as a test
in JS, but the same name on a variable-assigned arrow function isn't.
Easy to miss if you're refactoring by skimming rather than diffing.

I verified the refactor was behavior-identical by running both versions
side by side against 9 real open source projects across all 9 languages
and diffing every single function definition and call extracted — about
21,000 functions total. Caught both bugs that way before shipping.

Result

8 duplicate groups → 3, and the 3 remaining are tiny call-extraction
helpers under 10 lines each, not 40-80 line near-duplicate blocks.

pip3 install archaeologist
github.com/prathik-arun/archaeologist

If you maintain a multi-language static analysis tool, I'd genuinely
recommend running it on itself occasionally. Mine had been quietly
duplicating logic for months before I looked.

Source: dev.to

arrow_back Back to Tutorials