Why I Let a Machine Judge My Code

dev.to

There's a moment in every growing codebase where you realize you can no longer hold all of it in your head. For me, it was when I opened a file I'd written three weeks earlier and couldn't immediately follow the control flow. Not because the code was wrong. Because it had grown past the point where any single person could casually review it all with the same attention.

I run 5 Python services, a React frontend, and a growing collection of utility scripts. No team. No pull request reviewers besides myself. The codebase grows weekly. And I decided that trying to maintain consistency through discipline alone is a losing strategy at this scale. The machine needs to enforce the standards I care about, so my attention goes to the decisions that actually need a human brain.

The Pipeline

Three layers:

Guidelines file    →    Linter config    →    Pre-commit hook
(sets expectations)     (checks the code)     (blocks the commit)
Enter fullscreen mode Exit fullscreen mode

The guidelines file defines the standards: naming conventions, SQL patterns, pitfalls specific to this codebase. The linter config translates those standards into automated checks. The pre-commit hook makes them non-negotiable. Code that doesn't pass doesn't enter the repository. Not warned. Blocked.

This isn't about catching catastrophic bugs. It's about preventing the slow accumulation of inconsistency that turns a clean codebase into one you dread opening. Individually, an unused import is harmless. A hundred of them spread across fifty files is how a project starts to feel unmaintained.

Documentation vs Enforcement

There's an interesting insight from the recent Claude Code source leak. Anthropic's internal tooling moved rules out of documentation files and into enforced hooks. Their reasoning: documentation competes for attention. The more guidelines you write, the more they blend into the background. Hooks are executed by the system. They don't care whether anyone read the documentation.

My experience matches. The guidelines file says "use $1, $2, $3 for SQL placeholders." That convention gets followed most of the time. The linter catches it every time. "Most" vs "every" is the difference between a suggestion and an enforcement. Both have a place, but I know which one I trust when I'm not actively watching.

The Interesting Rules

I run 23 Ruff rule prefixes. The basics (formatting, import ordering, dead imports, stray print() calls) are table stakes. They keep the codebase clean but they're not worth writing about individually. Here's what actually gets interesting.

SQL Injection Detection

My API layer uses raw asyncpg with parameterized queries. No ORM. That means every database query is a hand-written SQL string. The safe way looks like this:

query = "SELECT * FROM transactions WHERE category = $1"
rows = await fetch_all(query, category)
Enter fullscreen mode Exit fullscreen mode

The $1 is a placeholder. The database receives the query and the variable separately, so it always treats category as data, never as executable SQL. Even if someone passes '; DROP TABLE transactions; -- as the value, nothing bad happens. It's just a weird category name that matches zero rows.

The dangerous version:

query = f"SELECT * FROM transactions WHERE category = '{category}'"
Enter fullscreen mode Exit fullscreen mode

This pastes the variable directly into the SQL string. If the input is malicious, the database executes it as code. Classic SQL injection. The linter flags this pattern and blocks the commit.

The complication: because I write raw SQL strings (not an ORM building them), the linter also flags my safe queries as "hardcoded SQL." I have to explicitly tell it that SQL strings are intentional in this codebase:

ignore = [
    "S608",   # asyncpg uses $1 params, raw SQL strings are intentional
]
Enter fullscreen mode Exit fullscreen mode

Every ignore has a documented reason. When I'm reviewing this config months from now, I need to know whether each exception was a deliberate decision or a shortcut.

Complexity Scoring

McCabe complexity, max of 10. Roughly 10 independent code paths through a function. When I introduced the linter to the existing codebase, some older functions exceeded this threshold. That's normal when you retrofit standards onto code that was already running. The policy: existing code that exceeds the limit gets an exemption. New code doesn't. The exemption list is meant to shrink over time as I refactor, not grow.

[tool.ruff.lint.mccabe]
max-complexity = 10
Enter fullscreen mode Exit fullscreen mode

The complexity check is the one that's saved me the most time long-term. A function with a score of 12+ is a function where fixing a bug in one branch risks breaking three others. Catching that before it lands means refactoring when the context is fresh, not six months later when I've forgotten why the branches exist.

The Async Concurrency Trap

This pattern looks concurrent:

results = [await process(item) for item in items]
Enter fullscreen mode Exit fullscreen mode

It's not. It awaits each call one after another, sequentially. If process() takes 100ms and you have 20 items, that's 2 seconds. The concurrent version:

results = await asyncio.gather(*[process(item) for item in items])
Enter fullscreen mode Exit fullscreen mode

Same 20 items, but they all run at the same time. Total time: ~100ms instead of 2 seconds.

The ASYNC rule set flags the sequential pattern. The linter doesn't auto-fix this one since the fix changes execution behavior, not just formatting. It blocks the commit with an explanation, and you decide how to restructure.

In a codebase with 70+ async API routes, this pattern showing up undetected in a few hot paths would quietly degrade response times without any obvious cause.

Timezone Awareness

Every datetime.now() call without timezone info is a future debugging session. In a system that handles Dutch business hours, UTC timestamps from external APIs, and scheduled tasks that need to run at specific local times, a naive datetime silently produces the wrong result in any calculation that crosses a timezone boundary.

The DTZ rules force datetime.now(UTC) everywhere. It's the kind of rule that feels pedantic until you spend an afternoon figuring out why a scheduled task ran an hour early because of a daylight saving time transition.

The Pre-Commit Hook

repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.9.10
    hooks:
      - id: ruff
        args: [--fix]
      - id: ruff-format
Enter fullscreen mode Exit fullscreen mode

Every commit runs Ruff. The --fix flag auto-resolves what it can: import sorting, formatting, simple simplifications. Anything it can't auto-fix blocks the commit until you handle it manually. The linter tells you exactly what's wrong and where.

What This Doesn't Catch

Linters understand structure, not intent. They won't tell you a function returns the wrong result. They won't tell you a query is technically valid SQL but returns stale data. They won't tell you an API endpoint works but exposes data it shouldn't.

For that you need tests, thoughtful design, and the kind of review that requires understanding what the code is supposed to do. The linter handles the mechanical layer: formatting, dead code, complexity, security anti-patterns, timezone correctness. That's maybe 80% of the issues that make a codebase worse over time. Automating that 80% means your limited review time goes to the 20% that actually needs a human.

Where This Is Going

Right now, the gap in the industry isn't AI code generation. That works well enough. The gap is the control layer around it. Experienced developers set up linters, pre-commit hooks, and CI gates because they've learned from years of debugging what happens when standards aren't enforced. They know why parameterized queries matter because they've seen an injection. They know why complexity limits matter because they've maintained a 300-line function.

A newer generation of developers is building with powerful tools from day one, but without the scar tissue that makes you instinctively set up guardrails. The industry is going to need better defaults, better tooling, and better education around code quality enforcement. Not because new developers are worse. Because the tools are fast enough that the consequences of shipping without guardrails arrive faster too.

Mechanical enforcement isn't a substitute for skill. It's what lets skill scale.


This is part of "1 Dev, 22 Containers," a series about building an AI office management system on consumer hardware.

Find me on GitHub.

Source: dev.to

arrow_back Back to News