Structure Before Bytes: How Metarc Beats tar+zstd on Real Code

tar+zstd is very hard to beat.

So I stopped trying to beat it as a byte compressor.

Instead, I tried something else:

compress the structure first, then compress the bytes.

That is the idea behind Metarc, a small experimental archiver written in Go.

It explores what I call metacompression: reducing structural and semantic redundancy across a source tree before applying a standard compressor such as zstd.

And on my current source-code benchmark corpus, Metarc is now smaller than tar+zstd on every tested repository.

The problem: tar sees a stream, not a project

Traditional archive pipelines are usually built around a simple idea:

directory tree → tar stream → compressor

This is robust, portable, simple, and battle-tested.

But it also means that by the time compression starts, a rich source tree has already been flattened into a byte stream.

A source-code repository is not just bytes.

It contains structure:

repeated files
duplicated licenses
common boilerplate
generated content
repeated JSON structures
logs with predictable patterns
similar files across directories
semantic redundancy that is easier to see before everything becomes a stream

A byte-level compressor can still find many patterns, of course.

zstd is excellent.

But some redundancy is easier to detect when the input is still a file tree.

That is the core idea of Metarc.

What is metacompression?

Metacompression means compressing information above the byte-stream level.

Instead of asking only:

how can we compress this sequence of bytes?

Metarc also asks:

what does this directory tree contain, and what redundancy exists before we turn it into a stream?

The current approach is roughly:

text source tree → scan files → analyze content → detect redundancy → apply structural / semantic transforms → store catalog + blobs → apply zstd → .marc archive

Metarc does not try to replace zstd.

It tries to give zstd a better input.

Current result

On the current benchmark corpus, Metarc produces smaller archives than tar+zstd on every tested source-code repository.

A few examples:

Repository	tar+zstd	marc	Gain
Kubernetes	81.1M	75.3M	7.2% smaller
React	18.5M	17.3M	6.4% smaller
Redis	8.9M	8.4M	5.6% smaller
NumPy	18.4M	17.7M	3.8% smaller

Across the tested repositories, the gain is currently around 3–7% compared to tar+zstd.

That may sound modest.

But beating tar+zstd at all on real source-code repositories is already interesting, because the final compression step still uses a standard compressor.

The difference comes from what happens before that final compression step.

Why this matters

A 3–7% improvement is not revolutionary by itself.

The interesting part is not the number alone.

The interesting part is that the improvement comes from changing the layer where compression starts.

Most compression pipelines treat the file tree as something to serialize before compression.

Metarc treats the file tree as something to analyze before compression.

That opens the door to transforms that byte-stream compression does not naturally model:

global file deduplication
semantic normalization
repeated structure detection
content-aware storage decisions
corpus-aware compression strategies

This is especially relevant for source-code repositories, where repetition often exists at a higher level than raw bytes.

Speed results

Metarc is also faster than tar+zstd in my current cold-cache benchmark runs.

But I want to be precise here:

the speedup is not the core meta-compression claim.

The speed difference mostly comes from implementation choices:

parallel scanning
parallel hashing
BLAKE3
lightweight transforms
avoiding some traditional archive pipeline costs

The main claim of meta-compression is about archive size.

The speed results are interesting, but they should not be confused with the compression idea itself.

What these benchmarks do not prove

These results do not prove that Metarc is universally better than tar+zstd.

They do not prove that this approach wins on every kind of data.

They do not prove that .marc should replace .tar.zst.

They show something narrower and, I think, more interesting:

on the tested source-code repositories, there is enough structural redundancy outside the byte stream to make meta-compression measurable.

Different datasets may behave very differently.

For example, I would not expect the same kind of gains on:

already compressed media
random binary blobs
encrypted data
image-heavy repositories
small directories with little repetition
corpora where zstd already captures most of the redundancy

This project is still experimental.

The goal is not to declare victory over tar.

The goal is to explore what happens when compression starts from a richer model of the input.

Benchmark methodology

The benchmark is designed to compare Metarc against a realistic baseline:

text tar + zstd

The current benchmark methodology uses:

real open-source repositories
pinned repository revisions
the same machine for all runs
median of multiple runs
cold cache by default for timing
direct comparison against tar+zstd
round-trip verification mode

The benchmark scripts are included in the repository.

For example:

bash ./scripts/run_bench.sh --type size ./scripts/run_bench.sh --type time ./scripts/compare_on_repo.sh --mode test

The full benchmark page is here:

https://github.com/arhuman/metarc-go/blob/main/docs/benchmarks.md

Why cold cache?

Benchmarking archive tools is tricky.

A hot-cache run can mostly measure CPU and compression speed.

A cold-cache run includes more of the real end-to-end cost of archiving a source tree from disk.

Both are useful, but they answer different questions.

For now, the default benchmark uses cold cache because it better represents the practical operation of archiving a repository.

That said, the distinction matters.

One thing I learned while working on the benchmark is that methodology can change the interpretation dramatically.

So if you look at the numbers, please look at the methodology too.

Benchmarks without context are just numerology with better fonts.

What Metarc currently does

Metarc is still experimental, but the current architecture already includes the core pieces needed to explore this idea:

text scan → analyze → plan → store → compress

At a high level:

Metarc scans the source tree.
It analyzes files and metadata.
It detects redundancy and chooses transforms.
It stores a catalog and content blobs.
It compresses the final archive.

The important part is that the archive format is not only a serialized tar stream.

It contains a representation of the structure Metarc discovered.

That gives the format room to evolve.

What could come next

The current results are encouraging, but this is still early.

Some possible next steps:

stronger file-level deduplication
repeated boilerplate detection
license block factoring
JSON normalization
log normalization
generated-file detection
language-aware transforms
better corpus analysis
more benchmark repositories
benchmark history per release
stronger archive verification tooling

I am especially interested in transforms that are simple, explainable, and reversible.

The goal is not to invent a magical compressor.

The goal is to find practical cases where knowing more about the input lets us compress it better.

Why Go?

Metarc is written in Go because the problem fits Go well:

file tree walking
concurrency
streaming I/O
simple binaries
good standard library
easy distribution
predictable performance

It is also a good language for building experimental infrastructure tools without turning the codebase into a research prototype that only runs on the author’s laptop during a full moon.

Go keeps the project boring in the right places.

That is useful when the idea itself is already unusual.

Is this production-ready?

No.

Metarc is usable, but it should still be treated as experimental.

The format may evolve.

The transforms may change.

The benchmark corpus will expand.

The implementation will likely be rewritten in parts as the design becomes clearer.

For now, I see it as a playground for exploring compression strategies above the byte-stream level.

If it becomes useful as a real archiver, great.

But the first goal is to test the idea properly.

The core idea

The shortest way to summarize Metarc is this:

`text
tar compresses a byte stream.

Metarc compresses a source tree.
`

That difference is now measurable on my benchmark corpus.

And that is what makes the project interesting to me.

Not because tar+zstd is bad.

It is not.

But because source-code repositories contain structure, and maybe compression tools should exploit more of it.

Feedback wanted

I am especially looking for:

criticism of the benchmark methodology
source-code repositories with unusual redundancy
ideas for reversible semantic transforms
references to similar projects or papers
cases where this approach should fail

The repository is here:

https://github.com/arhuman/metarc-go

The benchmark page is here:

https://github.com/arhuman/metarc-go/blob/main/docs/benchmarks.md

If you know a repository that would be a good stress test, I would love to try it.