The Designer Changed 3 Numbers. The Diff Said "Binary Files". So I Built SmartDiff

If you've ever worked at a game studio, you know the drill: half the game isn't in code. It's in config tables — items, skills, drop rates, level curves — maintained as Excel spreadsheets and checked into SVN next to the source.

Here's what reviewing a config change looked like for us:

A designer tweaks 3 numbers in a 5,000-row item table. They commit. You open the diff and get one of two things:

Cannot display: file marked as a binary type. — thanks, SVN, very helpful. (If you're on git, you know this one as Binary files differ.)
Or, if the file happens to be XML-based: two thousand changed lines of <Cell>, <Style>, column widths, window positions, and the cursor location the file was saved with. Somewhere in there are the 3 numbers that actually matter.

So the actual review process was: open the old version in Excel, open the new version in Excel, put them side by side, and eyeball it. For every commit. Forever.

At some point I snapped and built SmartDiff — an open-source, local-first diff and merge tool that understands spreadsheets as tables of data, not as lines of text.

Why line-based diff will never work for spreadsheets

It's not that diff tools are bad. It's that they're answering the wrong question. A spreadsheet file is a serialization of a table, and the serialization is full of things that aren't data:

Format noise. Styles, column widths, pane positions, selection state — Excel rewrites these constantly. They produce diffs that look huge but mean nothing.
Positional matching. Text diff matches by line number. Insert one row in the middle of a table and every row below it is "changed". The one real edit drowns in a cascade of false positives.
Merging is all-or-nothing. Two people edit the same file? With binary spreadsheets your version control offers you a coin flip: keep yours or take theirs. Someone's afternoon gets discarded.

Every one of these is a semantic problem. You can't fix it with a better text algorithm — the tool has to know it's looking at a table.

What SmartDiff does instead

SmartDiff parses the spreadsheet (.xml SpreadsheetML, .xlsx, .xls) into a table model, throws away everything that isn't data, and diffs that.

This is what the same "3 numbers changed in a 5,000-row table" looks like now:

Green rows are added, red rows are deleted, yellow cells are modified — with old → new shown right in the cell, and token-level highlighting so you can see which number in a long formula-ish value actually moved.

Need to review a whole revision range instead of one file? There's a GitHub-style "Files changed" overview:

And when two people really do edit the same table, instead of the coin flip you get a cell-level three-way merge: non-conflicting edits from both sides are merged automatically, and only genuine conflicts — the same cell changed to two different values — ask for a human decision:

The merge result is written back to the original XML, preserving comments, processing instructions, and namespaces, so the file stays diff-able and Excel-friendly.

The interesting technical bits

A few problems turned out to be more fun than expected:

Row matching without trusting row numbers. SmartDiff auto-detects the ID column (most config tables have one) and matches rows by identity across versions. No usable ID column? It falls back to content hashing, then to row position as a last resort. The result: inserting or deleting a row produces exactly one diff entry, not five thousand.

Knowing what to ignore. Columns without a header are treated as designer annotations and excluded from the diff — they never ship to the game anyway. Styles, pane state, and column widths are dropped at parse time, so they can't generate noise by construction.

Three-way merge at the cell level. Classic merge thinks in lines; SmartDiff thinks in (row ID, column) coordinates. With BASE, MINE, and THEIRS parsed into tables, most "conflicts" dissolve: you edited the price column, your colleague edited the description column of the same row — both survive, automatically. Only true same-cell disagreements need resolution.

SVN integration, because that's where the pain was. The tool polls the repo, shows a banner when new revisions land, and when an update would conflict it offers per-file choices — keep mine, take theirs, or drop into the semantic merge view. (No SVN installed? Browse mode still works as a clean table viewer, but the diff and merge features are built around SVN working copies.)

The stack, briefly

Backend: Python + Flask, talking to the SVN CLI
Frontend: zero-dependency vanilla JS single-page app — no build step, no node_modules
Distribution: PyInstaller single .exe with in-app auto-update, for the "I am not installing Python for this" crowd
76 automated tests across the diff engine, merge engine, and API

Nothing exotic — the value is in the table model, not the framework.

Try it

The project is open source: github.com/noahsarkcc/smartdiff

Grab the standalone SmartDiff.exe from Releases (no Python needed), or
pip install -r requirements.txt && python server.py and it opens in your browser.

Point it at an SVN working copy full of spreadsheet configs and see what your last "binary" commit actually changed.

I'd genuinely love to hear: how does your team review spreadsheet or config-table changes? Side-by-side eyeballing? Export to CSV and pray? A commercial tool that actually works? Tell me in the comments — especially if you've solved the .xlsx-merge problem differently.