The ESCAPE Byte Problem: How We Beat Brotli by Separating Token Streams

rust dev.to

Part 4 of the Glasik Notation series.

Previous articles covered the sliding window tokenizer, Aho-Corasick O(n) matching, and GN's first verified benchmarks against gzip.

The Waste Was Hidden in Plain Sight
After implementing Aho-Corasick O(n) matching, GN was fast. Sub-millisecond per chunk, competitive with brotli on latency. But the ratio numbers kept coming back flat:

gzip-6: 2.18x
GN AC: 2.20x (+0.9% vs gzip)
brotli-6: 2.47x

We were barely beating gzip. Brotli was 12% ahead. The vocabulary was real — 31,248 tokens per 200 chunks, 190 tokens per chunk average. The matches were happening. So where were the bits going?
We ran a token stream entropy analysis:
pythonfrom collections import Counter
import math

token_ids = []
for c in sample:
raw = slider.encode_ac_raw(c)
i = 0
while i < len(raw):
if raw[i] == ESCAPE and i+1 < len(raw):
token_ids.append(raw[i+1])
i += 2
else:
i += 1

counter = Counter(token_ids)
total = sum(counter.values())
entropy = -sum(c/total * math.log2(c/total) for c in counter.values())
print(f"Token entropy: {entropy:.3f} bits/token")
Result: 7.758 bits/token.

We were encoding each token as 2 bytes: ESCAPE + id. That's 16 bits per token. The theoretical minimum was 7.758 bits. We were wasting 51.5% of every token encoding. That's where the bits were going.

Why the Mixed Stream Was Hurting Us

Our tokenized output looked like this:

[ESCAPE][id][ESCAPE][id][lit][lit][lit][ESCAPE][id][lit][ESCAPE][id]...

Every token costs 2 bytes: an ESCAPE byte (0x01) followed by the ID. We fed this into deflate expecting it to compress well. But deflate uses LZ77 — it looks for repeated byte sequences in a sliding window. The ESCAPE bytes were fragmenting every pattern.

Where deflate might have seen:
" the " " the " " the " ← repeating 6-byte sequence, compresses well
It was instead seeing:
[01][04] " t" "he" [01][04] " t" ... ← ESCAPE bytes breaking the pattern
The ESCAPE byte was acting like static on a radio signal. Present in every token, making the mixed stream look noisier than it actually was.

The Insight: Separate the Streams
What if we just... didn't mix them?
Instead of one interleaved stream, emit two:

Token stream: just the IDs — [04][04][38][20][04][07]...
Literal stream: just the literal bytes — "t" "h" "e" " " "a" ...

Then compress each independently with raw deflate.
The token stream is pure symbols. Token ID 4 (" the") fires 483 times in 200 chunks. That's a highly skewed distribution — deflate loves it. The literal stream is clean text with no ESCAPE pollution. It compresses the way text is supposed to compress.

pythontoks, lits = slider.encode_ac_split(chunk)
dt = zlib.compress(toks, 6)[2:-4] # raw deflate
dl = zlib.compress(lits, 6)[2:-4]
frame = struct.pack('>H', len(dt)) + dt + dl

This is the same insight behind why PNG separates prediction from entropy coding, why video codecs separate motion vectors from residual — when you have structurally different data, compress the structures separately.

The Numbers

We ran this across 4 corpora, 3 seeds each — 12 independent measurements. Standard protocol: warm 500 chunks, test next 300.
Batch size matters. Each chunk has ~37 token IDs. Deflate header overhead (~18 bytes) dominates a tiny stream. Batching solves this — concatenate N chunks before compressing the token stream:

GN split b=1: 2.226x 0.043ms -6.6% vs brotli ← header overhead dominates
GN split b=4: 2.385x 0.036ms +0.1% vs brotli ← already matching brotli
GN split b=8: 2.456x 0.036ms +3.1% vs brotli ← production sweet spot
GN split b=16: 2.542x 0.037ms +6.7% vs brotli ← diminishing returns

b=8 is the production choice. Beyond b=16 the marginal gain flattens and you're accumulating more latency budget than the ratio improvement justifies.

Full 12-measurement verification at b=8:
CorpusGN split b=8vs gzipvs brotlip50p99ShareGPT2.49–2.52x+15%+2%0.043ms0.061msWildChat2.48–2.51x+15%+2%0.042ms0.073msLMSYS2.50–2.56x+14%+2%0.044ms0.079msUbuntu-IRC2.06–2.09x+49%+28%0.008ms0.013ms

Every single measurement beats both gzip and brotli.
And on tail latency: GN split b=8 p99 never exceeds 0.123ms. Brotli-6 p99 reaches 0.226ms. GN has 2–4x better tail latency than brotli while achieving better compression ratio.

Why This Works (The Information Theory)
The mixed tokenized stream had:

Token entropy: 7.758 bits/token
Encoding cost: 16 bits/token
Waste: 51.5%

The split stream:

Token stream: pure symbols, deflate compresses ~2–3x on its own
Literal stream: clean text, no structural noise, deflate compresses ~1.9x

Combined result: 2.49–2.56x on the original input

The separation lets each compressor do what it was designed to do. This isn't a trick — it's giving deflate the data structure it can actually exploit.

The Frame Format

Simple and self-contained:

[2B tok_deflated_len][tok_deflated][lit_deflated]
Two bytes of length prefix for the token stream, then the two compressed streams concatenated. Given the vocabulary, you can decode it without any other external state.
The Rust implementation in codon.rs:
rustpub fn encode_ac_split(buf: &[u8], ac: &AhoCorasick) -> (Vec, Vec) {
let mut tok_ids: Vec = Vec::new();
let mut literals: Vec = Vec::new();
let mut pos = 0usize;

for m in ac.find_iter(buf) {
    for &b in &buf[pos..m.start()] {
        literals.push(b);
    }
    let pat_idx = m.pattern().as_usize();
    if pat_idx < 254 {
        tok_ids.push((pat_idx + 1) as u8);
    } else {
        for &b in &buf[m.start()..m.end()] {
            literals.push(b);
        }
    }
    pos = m.end();
}
for &b in &buf[pos..] { literals.push(b); }
(tok_ids, literals)
Enter fullscreen mode Exit fullscreen mode

}
O(n) scan, single pass, clean split.

Lossless Round-Trip

The split-stream is lossless when encoder and decoder share the same vocabulary. Token IDs are indices — to decode, you need to know what pattern each ID maps to.
GN uses a stateful model in production. Encoder and decoder share a synchronized sliding window; each frame carries a 2-byte dict_version. If they diverge, the decoder requests a resync. This keeps frames small while guaranteeing correctness.
Round-trip verified: 5/5 test cases pass including empty buffers, raw ESCAPE bytes in input, and 10,000-byte repetitive inputs.

What's Next: Fractal Dictionary Sharding

The split-stream insight revealed something deeper: token and literal streams have fundamentally different statistical structure. Taking that further — different types of content have different vocabulary entirely.
Code blocks repeat function, return, const. System messages repeat role definitions. User messages repeat question structures. Compressing them with a single shared vocabulary leaves ratio on the table.
We're implementing fractal dictionary sharding: four vocabulary tiers (L0 universal, L1 domain, L2 session, L3 chunk) with per-shard-type routing and deterministic crystal identity per shard — same content always produces the same compressed shape. The FractalCompressor is implemented, wired into the napi production path, and passing roundtrip verification across all shard types.
More on that in Article 5.

Code and Paper

GitHub: github.com/atomsrkuul/glasik-core (MIT)
npm: gni-compression@1.0.0
arXiv: pending cs.IR endorsement — if you're a qualified author (3+ cs papers): code 7HWUBA

Robert Rider is an independent researcher building Glasik, an open-source compression and context management system for LLM deployments.

Source: dev.to

arrow_back Back to Tutorials