Why Regex Sucks in a Hot Loop

A while back I ripped a regex out of my SQL parser and replaced it with twenty lines of hand-written string scanning. Then I got nervous. Hand-rolling a scanner because you assume regex is slow is exactly the kind of premature optimization I make fun of other people for. So I filed an issue against myself: prove the hand-rolled version is actually faster, or delete it and go back to the regex.

Months later I sat down to settle it. The honest regex came back about ninety times slower. And the reason had nothing to do with raw matching speed.

What the code does

The function answers one small question: does a specific table alias appear in this piece of SQL, followed by a dot? That's how the parser notices a subquery reaching out to an outer table, like o.id pointing at an orders o defined outside it.

func containsWordDot(text, word string) bool {
    if word == "" {
        return false
    }
    needle := word + "."
    idx := 0
    for {
        pos := strings.Index(text[idx:], needle)
        if pos < 0 {
            return false
        }
        absPos := idx + pos
        // the char before the alias must not be part of a longer identifier,
        // so alias "a" doesn't match "data."
        if absPos > 0 {
            prev := rune(text[absPos-1])
            if unicode.IsLetter(prev) || unicode.IsDigit(prev) || prev == '_' {
                idx = absPos + 1
                continue
            }
        }
        return true
    }
}

It's not pretty. The boundary check is the only reason it exists, because a plain strings.Contains would happily match a. inside data. and invent a correlation that was never there.

The case for going back to regex

My own argument against the code was simple. Go's regex engine is RE2: deterministic, no catastrophic backtracking, genuinely fast. Compile one pattern once at startup and reuse it everywhere:

var wordDotPattern = regexp.MustCompile(`\b\w+\.\w+`)

Less code. No manual boundary handling to get wrong. If it benchmarked even close to the hand-rolled version, deleting twenty fiddly lines was the right move. I expected to delete them.

Where the plan fell apart

That regex matches any word-dot-word. My function answers a narrower question: does this specific alias appear? And the alias is different on every call, because the parser walks every table in the query and asks about each one in turn.

So the precompiled pattern can't replace the function. It's answering a question nobody asked. The only regex that does the same job has to bake the specific alias into the pattern, which means compiling a fresh regex every single call:

regexp.MustCompile(`\b` + regexp.QuoteMeta(word) + `\.`).MatchString(text)

That line is the whole story. You can't compile a pattern once when the pattern changes every time you run it. "Compile once, reuse forever" quietly dies the moment you notice the thing it depends on isn't a constant. And compiling a regex is not cheap. Doing it in a loop, once per table per query, is a bill you pay on every parse for the rest of the program's life.

The numbers

I ran the matrix I'd written into the issue: short input (~20 bytes), medium (~200), long (~2000), each as a match, a miss, and a near-miss. Ten runs each, medians:

input        scanner      regex (compiled per call)
short/miss     21 ns          1,972 ns
short/hit      24 ns          1,990 ns
medium/hit    195 ns          6,252 ns
long/hit    1,566 ns         46,600 ns

Six to ninety times slower depending on size, and the scanner did it with zero allocations while the regex allocated on every call.

For a sanity check I also benchmarked the precompiled regex, the one that can't actually do the job. The scanner still won at every size, because strings.Index is SIMD-accelerated and allocates nothing, while even a warm MatchString carries per-call overhead. There was no input length where regex caught up. No crossover at all.

So does regex suck?

Not really.
The title is a little unfair, and I'll own that, I needed to catch your eye a bit :)

Regex is the right tool sometimes, and a state machine I wrote by hand is often the thing that deserves to be deleted

What actually sucks is reaching for regex without noticing that your pattern isn't constant.
The cost of a regex is split in two: compiling it, and matching with it. Everyone remembers matching is fast.

People forget compiling is the expensive half, and that you only get to skip it when the pattern is fixed. Put a per-call variable in the pattern and you've moved the expensive half into your hot path without realizing it.

The benchmark earned its place, but not the way I thought it would. It didn't just tell me which option was faster. Forcing myself to write the genuinely equivalent regex is what revealed there was no equivalent precompiled regex in the first place. Making the comparison fair was where the real answer was hiding.

I kept the twenty lines and left the benchmark numbers in a comment on top, so the next person who thinks "this should just be a regex" can read the receipts instead of arguing with me about it. Which is exactly what past-me wanted when I filed the issue.

Measure the thing you'd actually ship. Not the thing that's easy to type into a benchmark.

This came out of issue #59 on postgresparser, a pure-Go PostgreSQL parser I work on. The full benchmark and verdict live in the issue.