Spec-Driven, AI-Assisted, Test-Validated — A Practitioner's Account

What made a two-week typesetting library possible, and what the methodology still lacks

§0 — Hook

Most accounts of AI-assisted development describe tools and workflows. Very few show primary sources: the actual specification documents, the actual errors caught, the actual decisions revised under implementation pressure. Without those, the account is not evaluable and not replicable. You cannot tell whether the methodology produced the result or whether the result happened despite the methodology.

This article shows the sources. paragraf — an open source typesetting library built in two weeks — is the case study. 12 packages, Rust/WASM and TypeScript, covering a complete pipeline from font shaping to PDF output. The methodology is the subject. Every claim below is either demonstrated by an artifact or flagged explicitly as opinion.

The paragraf demo — live in browser.

§1 — The Prerequisite Nobody Mentions

Every AI-assisted development article eventually says some version of "give the AI precise specifications and you get better output." Almost none of them explain where precise specifications come from.

They do not come from knowing the algorithms. paragraf implements the Knuth-Plass line-breaking algorithm, optical margin alignment, rustybuzz OpenType shaping, and Unicode BiDi. None of those were known in detail before the project started. They were researched, trusted to AI implementation, and verified through tests.

What domain knowledge actually contributes is different and harder to acquire: knowledge of failure modes in the target environment.

The onMissing design in @paragraf/compile — skip, fallback, placeholder, each with defined behavior — does not appear in the typesetting literature. It comes from having seen a real product information management export fail a batch job at 2am because three records out of ten thousand had a missing field. The normalize() hook that maps raw data to template bindings comes from knowing that every enterprise data source has a different field naming convention and no library adapter ever covers a specific customer's exact schema. The strict layer dependency rules — each package imports only from layers below, no exceptions — come from having debugged circular dependency failures in InDesign automation pipelines.

An AI given the same algorithm knowledge and no production context would have built something technically correct that fails the first time it touches real data. The specifications were precise because the failure modes were known in advance. That precision is the prerequisite.

This maps closely to what practitioners in the AI coding literature describe as spec inputs: markdown documents, conversations, diagrams, domain models, and existing code feeding into requirements that the AI can act on reliably. The taxonomy is accurate. What the literature underemphasises is that the quality of those inputs depends almost entirely on what the human already knows — not about the AI, but about the problem domain.

The specification document that governed package structure and testing strategy

§2 — The Two-Loop Process

The development process followed two nested loops. Understanding the distinction between them is the core of the methodology.

The outer loop covers the project. It produces: a problem definition, scope constraints, a high-level layer architecture, an architecture diagram, and a versioned roadmap. Outer loop documents are updated when implementation reality forces a revision. They are not fixed contracts — they are living records of current understanding. This is what Fowler calls design-first collaboration: the human owns the architecture, the AI executes within it.

The inner loop covers each package. It produces: a scope definition, input/output schemas, a step-by-step implementation plan with defined subtasks and edge cases, unit tests written before implementation, then implementation against that specification, closing with integration and end-to-end tests validating every contract.

Here is what the inner loop looks like in practice, from the @paragraf/linebreak package plan:

an inner loop plan sample

The two loops are not independent. An issue discovered mid-package feeds back into the outer loop and may revise the project scope, the architecture, or the roadmap. A concrete example: the layer numbering in the original plan had 1c-font-engine and 1b-shaping-wasm. During extraction, the numbering was revised to 1b-font-engine and 2a-shaping-wasm to better reflect the actual dependency structure. Small decision, visible consequence: the outer loop was updated in response to implementation reality rather than preserved as a false contract. The architecture documents show that evolution directly:

Architecture at the start of the project (left) and after several outer loop iterations (right).

§3 — Trust, But Verify

Trust, but verify — I first heard it from my director Bob Bair when we first met in Greece, nearly ten years ago. I assumed it was professional jargon. It turned out to be a Russian proverb, and the most accurate description of what makes AI-assisted development work at this level.

The phrase "AI-assisted" covers an enormous range of practices, from fully autonomous code generation to a human using AI as an implementation engine operating inside precisely defined contracts. The methodology here is firmly the second. No agentic framework was used — existing frameworks are built around task delegation and autonomy, which is the opposite of what this methodology requires. Every step involved a human decision. The AI tools were implementation engines and discussion partners, not architects.

The tooling split was deliberate. VS Code Copilot and Claude Sonnet/Haiku handled code generation inside the inner loop — writing implementations against pre-defined schemas and tests. Claude Opus and Gemini handled architecture discussions, document synthesis, and the outer loop decisions where the question was "what, why & how should this be" rather than "implement this." A third role — code review and audit — ran across both: Claude connected to GitHub, Copilot code review, and manual review at integration boundaries. The key practice was using multiple models as a quality control mechanism: when two models diverge on an assessment of the same code or decision, that disagreement is a signal worth investigating. The human resolves it. This is what the emerging harness engineering literature is beginning to formalise — the scaffolding around AI tools matters as much as the tools themselves.

The control mechanism that makes this work at the code level is test-first development — but not in the conventional sense of "write tests alongside your code." The distinction matters: tests written before implementation define what correct means before asking for anything. They are specifications expressed as assertions. The AI implements against them. Errors are caught at the unit boundary, not at integration time.

During the first week, unit tests were passing across all packages and end-to-end tests were failing across all packages. The response was to stop, run a full audit, classify every issue by severity, and fix in order before continuing. The audit document was named excuse-me-kemal-I-forked-up.md. The name is the emotional record of the moment. The contents are the professional response to it.

Six critical issues. Two high. Three medium. All code files, tests, and documents fixed before shipping. The table is not evidence that the process prevents errors — it is evidence that the review step is real and not ceremonial. Errors that compound into the next stage are significantly more expensive than errors caught at their origin. The audit caught them at their origin.

§4 — What It Produced and What It Lacks

What it produced: 12 published packages covering a complete typesetting pipeline. 906 unit tests across all packages, 70 end-to-end tests, and 23 manual test scripts producing real PDF and SVG output. A live demo running the full WASM shaping pipeline in the browser. A complete compile API that takes a template and a data record and returns a PDF buffer in a single function call. The visible output of a correctly specified system:

Knuth-Plass (left) distributes spacing evenly across all lines simultaneously. Greedy (right) fills each line independently, producing uneven spacing and a stretched final line.

The left column was produced by Knuth-Plass with real OpenType metrics. The right column was produced by the greedy algorithm used by every JavaScript PDF library. The difference is the consequence of specification precision applied at every layer of the stack.

What it lacks: three gaps worth naming honestly.

Documentation was not versioned alongside code. Inner loop documents were overwritten as understanding evolved. The dependency reference document changed significantly during the project but only its current state is preserved. Git tracked every code change with line-level precision while documentation changes were invisible. The discipline was there. The infrastructure to preserve the evidence of that discipline was not. These are not the same failure.

Session continuity works for one person. The session handoff document — a structured snapshot of current state, architectural decisions with reasoning, known bugs classified by severity, and next steps — is what allows AI-assisted development to resume coherently across sessions. Fowler's context-anchoring describes this problem precisely: without deliberate anchoring, each session starts from a degraded understanding of the project's current state. The handoff document is a manual solution to that problem. It works. It does not yet scale to a team without additional tooling design.

The process is disciplined but not yet systematic. Disciplined means: there was a defined structure, it was followed consistently, it produced results. Systematic means: the process itself is observable and reproducible from its own records. The audit document was produced because an audit was run manually at a moment of failure. A systematic process would have tooling that made that audit continuous rather than reactive.

§5 — What the Next Version Looks Like

Three concrete improvements follow directly from the gaps above.

Documentation commits alongside code commits. When a specification changes in response to an implementation discovery, that change should be recorded at the same moment as the code change that triggered it, with the reasoning attached. Not a git diff of a prose document — a dated entry written by the person who made the decision, capturing intent rather than just state.

Inner loop documents versioned rather than overwritten. Each package plan should have a version history showing what changed between drafts and why. The delta between versions is where the feedback loop between inner and outer is visible. Without it you have outcomes but not reasoning.

Session handoff documents dated and immutable. The handoff document that allows a session to resume coherently should be treated as a changelog entry, not a mutable working document. Write it at the end of a session, date it, do not overwrite it. The history of those documents is the history of how the project's understanding evolved.

The quality of AI-assisted output is determined by the precision of the specification, the discipline of the review, and the honesty of the record. All three are learnable. None of them require a particular tool. The methodology described here is not finished — it is a working version that produced a working result and has visible room to improve. That is a more useful account than a polished success story, and it is the only kind worth writing.

The next article in this series covers the problem space in more depth — the typographic and algorithmic reasons why existing JavaScript document libraries fall short of publication quality, and what it takes to close that gap.

paragraf is open source. The repository, the live demo, and the article series are at github.com/kadetr/paragraf.