tar+zstd is very hard to beat.
So I stopped trying to beat it as a byte compressor.
Instead, I tried something else:
compress the structure first, then compress the bytes.
That is the idea behind Metarc, a small experimental archiver written in Go.
It explores what I call metacompression: reducing structural and semantic redundancy across a source tree before applying a standard compressor such as zstd.
And on my current source-code benchmark corpus, Metarc is now smaller than tar+zstd on every tested repository.
The problem: tar sees a stream, not a project
Traditional archive pipelines are usually built around a simple idea:
directory tree → tar stream → compressor
`
This is robust, portable, simple, and battle-tested.
But it also means that by the time compression starts, a rich source tree has already been flattened into a byte stream.
A source-code repository is not just bytes.
It contains structure:
- repeated files
- duplicated licenses
- common boilerplate
- generated content
- repeated JSON structures
- logs with predictable patterns
- similar files across directories
- semantic redundancy that is easier to see before everything becomes a stream
A byte-level compressor can still find many patterns, of course.
zstd is excellent.
But some redundancy is easier to detect when the input is still a file tree.
That is the core idea of Metarc.
What is metacompression?
Metacompression means compressing information above the byte-stream level.
Instead of asking only:
how can we compress this sequence of bytes?
Metarc also asks:
what does this directory tree contain, and what redundancy exists before we turn it into a stream?
The current approach is roughly:
text
source tree
→ scan files
→ analyze content
→ detect redundancy
→ apply structural / semantic transforms
→ store catalog + blobs
→ apply zstd
→ .marc archive
Metarc does not try to replace zstd.
It tries to give zstd a better input.
Current result
On the current benchmark corpus, Metarc produces smaller archives than tar+zstd on every tested source-code repository.
A few examples:
| Repository | tar+zstd | marc | Gain |
|---|---|---|---|
| Kubernetes | 81.1M | 75.3M | 7.2% smaller |
| React | 18.5M | 17.3M | 6.4% smaller |
| Redis | 8.9M | 8.4M | 5.6% smaller |
| NumPy | 18.4M | 17.7M | 3.8% smaller |
Across the tested repositories, the gain is currently around 3–7% compared to tar+zstd.
That may sound modest.
But beating tar+zstd at all on real source-code repositories is already interesting, because the final compression step still uses a standard compressor.
The difference comes from what happens before that final compression step.
Why this matters
A 3–7% improvement is not revolutionary by itself.
The interesting part is not the number alone.
The interesting part is that the improvement comes from changing the layer where compression starts.
Most compression pipelines treat the file tree as something to serialize before compression.
Metarc treats the file tree as something to analyze before compression.
That opens the door to transforms that byte-stream compression does not naturally model:
- global file deduplication
- semantic normalization
- repeated structure detection
- content-aware storage decisions
- corpus-aware compression strategies
This is especially relevant for source-code repositories, where repetition often exists at a higher level than raw bytes.
Speed results
Metarc is also faster than tar+zstd in my current cold-cache benchmark runs.
But I want to be precise here:
the speedup is not the core meta-compression claim.
The speed difference mostly comes from implementation choices:
- parallel scanning
- parallel hashing
- BLAKE3
- lightweight transforms
- avoiding some traditional archive pipeline costs
The main claim of meta-compression is about archive size.
The speed results are interesting, but they should not be confused with the compression idea itself.
What these benchmarks do not prove
These results do not prove that Metarc is universally better than tar+zstd.
They do not prove that this approach wins on every kind of data.
They do not prove that .marc should replace .tar.zst.
They show something narrower and, I think, more interesting:
on the tested source-code repositories, there is enough structural redundancy outside the byte stream to make meta-compression measurable.
Different datasets may behave very differently.
For example, I would not expect the same kind of gains on:
- already compressed media
- random binary blobs
- encrypted data
- image-heavy repositories
- small directories with little repetition
- corpora where
zstdalready captures most of the redundancy
This project is still experimental.
The goal is not to declare victory over tar.
The goal is to explore what happens when compression starts from a richer model of the input.
Benchmark methodology
The benchmark is designed to compare Metarc against a realistic baseline:
text
tar + zstd
The current benchmark methodology uses:
- real open-source repositories
- pinned repository revisions
- the same machine for all runs
- median of multiple runs
- cold cache by default for timing
- direct comparison against
tar+zstd - round-trip verification mode
The benchmark scripts are included in the repository.
For example:
bash
./scripts/run_bench.sh --type size
./scripts/run_bench.sh --type time
./scripts/compare_on_repo.sh --mode test
The full benchmark page is here:
https://github.com/arhuman/metarc-go/blob/main/docs/benchmarks.md
Why cold cache?
Benchmarking archive tools is tricky.
A hot-cache run can mostly measure CPU and compression speed.
A cold-cache run includes more of the real end-to-end cost of archiving a source tree from disk.
Both are useful, but they answer different questions.
For now, the default benchmark uses cold cache because it better represents the practical operation of archiving a repository.
That said, the distinction matters.
One thing I learned while working on the benchmark is that methodology can change the interpretation dramatically.
So if you look at the numbers, please look at the methodology too.
Benchmarks without context are just numerology with better fonts.
What Metarc currently does
Metarc is still experimental, but the current architecture already includes the core pieces needed to explore this idea:
text
scan
→ analyze
→ plan
→ store
→ compress
At a high level:
- Metarc scans the source tree.
- It analyzes files and metadata.
- It detects redundancy and chooses transforms.
- It stores a catalog and content blobs.
- It compresses the final archive.
The important part is that the archive format is not only a serialized tar stream.
It contains a representation of the structure Metarc discovered.
That gives the format room to evolve.
What could come next
The current results are encouraging, but this is still early.
Some possible next steps:
- stronger file-level deduplication
- repeated boilerplate detection
- license block factoring
- JSON normalization
- log normalization
- generated-file detection
- language-aware transforms
- better corpus analysis
- more benchmark repositories
- benchmark history per release
- stronger archive verification tooling
I am especially interested in transforms that are simple, explainable, and reversible.
The goal is not to invent a magical compressor.
The goal is to find practical cases where knowing more about the input lets us compress it better.
Why Go?
Metarc is written in Go because the problem fits Go well:
- file tree walking
- concurrency
- streaming I/O
- simple binaries
- good standard library
- easy distribution
- predictable performance
It is also a good language for building experimental infrastructure tools without turning the codebase into a research prototype that only runs on the author’s laptop during a full moon.
Go keeps the project boring in the right places.
That is useful when the idea itself is already unusual.
Is this production-ready?
No.
Metarc is usable, but it should still be treated as experimental.
The format may evolve.
The transforms may change.
The benchmark corpus will expand.
The implementation will likely be rewritten in parts as the design becomes clearer.
For now, I see it as a playground for exploring compression strategies above the byte-stream level.
If it becomes useful as a real archiver, great.
But the first goal is to test the idea properly.
The core idea
The shortest way to summarize Metarc is this:
`text
tar compresses a byte stream.
Metarc compresses a source tree.
`
That difference is now measurable on my benchmark corpus.
And that is what makes the project interesting to me.
Not because tar+zstd is bad.
It is not.
But because source-code repositories contain structure, and maybe compression tools should exploit more of it.
Feedback wanted
I am especially looking for:
- criticism of the benchmark methodology
- source-code repositories with unusual redundancy
- ideas for reversible semantic transforms
- references to similar projects or papers
- cases where this approach should fail
The repository is here:
https://github.com/arhuman/metarc-go
The benchmark page is here:
https://github.com/arhuman/metarc-go/blob/main/docs/benchmarks.md
If you know a repository that would be a good stress test, I would love to try it.