I built a deep learning framework in Rust from scratch — Part 3: the road to crates.io

In Part 1 I argued why a graph-based DL framework in pure Rust was a project worth doing.

In Part 2 I wrote the GPU backend on wgpu and figured out how to make TransformerBlocktrain on it. Both posts ended with the same honest admission: the code was fine, but the project wasn't ready for other humans.

This post is about closing that gap. Six phases of work, a v0.2.0 → v0.3.1
bump, and a crate that now looks like something you'd actually reach for.

Here's the plan I committed to at the start:

Phase 1: cleanup and consistency — get to 0 warnings.
Phase 2: API reliability — declarative layer API so users don't
hand-manage HashMap<String, Shape>.
Phase 3: GPU completeness — every CPU op should have a WGSL twin.
Phase 5: ecosystem — RoPE done properly, Slice/Concat primitives, CI.
Phase 6: pre-release polish — fmt, clippy, docs.rs, CNN example.

(Phase 4 — performance — is intentionally deferred to v0.5. I'll explain.)

Phase 1 — 121 warnings to zero

The first thing I did was run cargo build --release --all-targets. The
output scrolled for a full screen:

warning: `rustyasg` (bin "rustyasg") generated 121 warnings (14 duplicates)
warning: `rustyasg` (lib) generated 21 warnings

142 warnings feels like somebody stopped caring somewhere, but what I
actually found was amusing: src/main.rs started with

mod analysis;
mod asg;
mod autograd;
mod gui_viewer;
// ... seven more lines

That's the binary recompiling the entire library as a separate crate. So
every pub struct in nn/ that main.rs didn't touch became "never used" —
and there are a lot of them. Fix: replace those mod declarations with
use rustyasg::*. ~100 false-positive warnings gone in one diff.

The rest was actual cleanup: deprecated rand::thread_rng → rand::rng,
unused imports, a let minus_one = lit_scalar(-1.0) that was never read, and
the rotate_half function in RoPE that turned out to be a stub that added
cos as a bias. I couldn't fix it at this point (needed Slice/Concat ops)
but I added a big doc-comment calling it out so nobody would accidentally ship
it to production:

/// **STUB IMPLEMENTATION.** Full RoPE requires Slice/Concat operations
/// to split head_dim into pairs. Current code just adds `cos` as a bias —
/// this is **not** mathematically correct RoPE and must be fixed before
/// production use.
fn rotate_half(&self, x: &Tensor, _seq_offset: usize) -> Tensor { ... }

At the end of Phase 1 cargo build --release --all-targets was clean and I
had one open wound in RoPE to close later.

Phase 2 — the declarative layer API

This was the change I was most worried about because it was a breaking
change for every v0.2 user. Look at the main.rs I inherited:

let layer1 = Linear::new(&ctx, "layer1");
// ...
let mut initial_shapes = HashMap::new();
initial_shapes.insert("x".to_string(), (vec![4, 2], DType::F32));
initial_shapes.insert("layer1.weights".to_string(), (vec![2, 8], DType::F32));
initial_shapes.insert("layer1.bias".to_string(),    (vec![1, 8], DType::F32));
// ...
if name.contains("w_q.weights") || name.contains("w_k.weights") || ... {
    shape = vec![embed_dim, embed_dim];
} else if name.contains("linear1.weights") {
    shape = vec![embed_dim, ff_hidden_dim];
}

That's string-matching to figure out parameter shapes. When I saw
name.contains("w_q") in the binary I knew what Phase 2 had to be.

The fix: layers should own their shape information. Every nn::*
constructor takes dimensions, and the layer self-registers with
GraphContext:

pub fn new(ctx: &Rc<RefCell<GraphContext>>, name: &str, in_f: usize, out_f: usize) -> Self {
    let weights = Tensor::new_parameter_with_shape(
        ctx,
        &format!("{name}.weights"),
        vec![in_f, out_f],
        Initializer::XavierUniform,
    );
    // bias too...
}

Behind the scenes GraphContext got a new parameter_meta: HashMap<String, ParameterMeta> where ParameterMeta carries shape, dtype, and an
Initializer. Then two helper methods close the loop:

ctx.build_shape_map(&input_shapes)     // → feeds ShapeInference
ctx.init_parameters(&mut runtime_data) // → samples weights

I also wrote nn::init with nine standard initializers — Zeros, Ones,
Constant, Uniform, Normal, Xavier (uniform/normal), Kaiming (uniform/normal) —
because if layers are going to pick initializers on the user's behalf, the
defaults should be good defaults. Xavier-uniform for Linear, Kaiming-uniform
for Conv2d, Ones/Zeros for LayerNorm gamma/beta, Normal(0, 0.02) for
embeddings (GPT-2 conventions).

The user-facing difference is shocking. main.rs used to be 270 lines with
a hand-rolled 50-line shape-dispatch block. After Phase 2 it's 225 lines,
and the shape dispatch is one line:

ShapeInference::run_with_context(&mut graph, &ctx.borrow(), &input_shapes)?;

The XOR example dropped from 275 lines to 190 lines with no loss of clarity.
The string-matching in the binary is gone entirely.

Obviously this broke every single example, every single test. I rewrote them
all. That's what v0.3.0 is: one coherent breaking change, one SemVer bump.

Phase 3 — GPU completeness

At the start of Phase 3, if you ran cargo run --release -- --gpu on the
TransformerBlock demo, you got:

thread 'main' panicked:
  UnimplementedOperation("node type not supported on GPU:
                          LayerNorm { input: 0, gamma: 10, beta: 11, eps: 1e-5 }")

The README had been saying "✅ GPU backend" for months. It was lying by
omission. LayerNorm is not a composite op on GPU — it's a specialized
NodeType — and nobody had written the WGSL shaders.

So I wrote them. Four shaders for LayerNorm alone:

LayerNorm (forward): one worker per row.
LayerNormBackward (∂L/∂x): the formula is dx = inv_std * (dy·γ - mean(dy·γ) - x_norm·mean(dy·γ·x_norm)) — two reduction passes per row before the final output.
LayerNormGradGamma: parallelize over columns, each worker scans all rows.
LayerNormGradBeta: simple column-wise sum.

I added a helper dispatch_rowwise next to the existing dispatch_shader,
because LayerNorm's "one thread = one row, iterate the reduction axis
internally" pattern keeps showing up (and it did — in EmbeddingGrad later).

Then came the avalanche. Each took roughly the same pattern:

Operation	Shaders added
`Conv2dBackwardInput`, `Conv2dBackwardWeight`	2
`MaxPool2d`, `MaxUnpool2d` (backward)	2
`AvgPool2d`, `AvgUnpool2d`, `AdaptiveAvgPool2d`	3
`Embedding`, `EmbeddingGrad`	2
`ConvTranspose2d`	1 (+ bias shader)

Each with a matching CPU-vs-GPU parity test. The trickiest was
MaxUnpool2d: naïvely you'd scatter from grad_output back to the positions
of max values, but WGSL doesn't have atomic f32, so concurrent scatter is a
data race. I worked around it by parallelizing over input positions — for
each (n, c, ih, iw), find which windows cover it, recompute argmax in each
of those windows, and accumulate only if we're the argmax. O(kH²·kW²) per
element, which is fine for typical 2×2 / 3×3 kernels.

After Phase 3:

Epoch  1, Loss: 9.347217
Epoch  2, Loss: 1.233081
...
Epoch 15, Loss: 0.000002
--- TRAINING COMPLETE in 576ms ---

TransformerBlock trains on GPU end-to-end. 42 parity tests (now 46) verify
every GPU op matches the CPU reference to 1e-5.

Phase 5 — closing the RoPE wound, and ecosystem

The stub RoPE from Phase 1 had been eating at me. Fixing it required adding
three new primitives:

NodeType::Slice { input, axis, start, end } — the obvious building block.
NodeType::Concat { inputs, axis } — the dual.
NodeType::SliceBackward { grad_output, axis, start, full_size } — for the gradient of Slice. Concretely: zero-pad grad_output back to the original shape.

Full coverage means: NodeType + shape inference + CPU impl + GPU WGSL +
autograd. For Concat I punted on a pure-GPU implementation (would need a
multi-input kernel with dynamic strides) — the GPU path reads the inputs
back to CPU, concatenates via ndarray, and re-uploads. It's slow, but
correct, and RoPE only concatenates a few tensors of moderate size.

With Slice/Concat in place, rotate_half becomes the textbook split-half:

let x1 = x.slice(3, 0, half_dim);
let x2 = x.slice(3, half_dim, self.head_dim);

let rot1 = &(&x1 * &cos_tensor) - &(&x2 * &sin_tensor);
let rot2 = &(&x1 * &sin_tensor) + &(&x2 * &cos_tensor);

rot1.concat(&[&rot2], 3)

Mathematically correct. End-to-end differentiable. Zero stubs.

Also in Phase 5: GitHub Actions CI, CHANGELOG.md in Keep-a-Changelog
format, and CONTRIBUTING.md documenting every place you have to edit when
adding a new NodeType (six, it turns out).

Phase 6 — the polish that makes or breaks a release

This is the unglamorous phase, but it's the one that separates "published a
crate" from "published a crate that people use."

cargo fmt --all -- --check: 28 files reformatted.

cargo clippy --all-targets -- -D warnings: 33 warnings in lib alone
when I started. Most were mechanical (div_ceil, assign_op), and
cargo clippy --fix --allow-dirty chewed through them. For three of them —
too_many_arguments, type_complexity, should_implement_trait — the
clippy suggestion was worse than the original, so I allow'd them at crate
level with a comment explaining why. The library is now clippy-clean under
-D warnings, and so are the tests, binary, and examples (modulo one
#![allow(clippy::if_same_then_else)] in the MNIST example where the
"identical blocks" are part of an 0–9 pattern-generation lookup).

Strict rustdoc: RUSTDOCFLAGS="-D rustdoc::broken_intra_doc_links" caught
ten broken links, all of the form [N, C_in, H, W] in doc-comments where
I'd written tensor shape notation inside markdown link syntax. Fix: wrap
shape notations in backticks: `[N, C_in, H, W]`.

Cargo.toml for docs.rs:

[package.metadata.docs.rs]
all-features = true
rustdoc-args = ["--cfg", "docsrs"]

[profile.release]
opt-level = 3
lto = "thin"
codegen-units = 1
strip = "debuginfo"

exclude = ["logo.png", "target/*", ".github/*", "*.log"]

The exclude is important — the published crate is 120 KB instead of 1.3 MB.
Users don't need the logo to compile.

A real CNN example. Up to this point every example was a MLP, which made
all the Conv2d work feel theoretical. I wrote examples/cnn_classifier.rs:

Input [N, 1, 8, 8]
  → Conv2d(1→8, 3×3, pad=1) → ReLU → AvgPool2d(2×2)    → [N, 8, 4, 4]
  → Conv2d(8→16, 3×3, pad=1) → ReLU → AdaptiveAvgPool2d → [N, 16, 1, 1]
  → reshape → Linear(16→8) → ReLU → Linear(8→3)         → [N, 3]

Trained with Adam on a 3-class synthetic dataset. Converges to 100% test
accuracy in under a second. First real exercise of Conv2dBackwardInput,
Conv2dBackwardWeight, AvgUnpool2d, AdaptiveAvgPool2d as an actual
training loop.

The last boss: CI on three platforms

After all of that, I pushed to GitHub and got this from Actions:

✅ cargo fmt (Ubuntu)
❌ cargo clippy (-D warnings) (Ubuntu) — exit code 101
✅ cargo doc (Ubuntu)
✅ test (ubuntu-latest)
✅ test (macos-latest)
❌ test (windows-latest) — exit code 1

The clippy one was predictable in retrospect: my local rustc was 1.89.0,
but dtolnay/rust-toolchain@stable on CI was grabbing whatever stable
pointed at on the morning of the build. New rustc, new clippy lints, new
-D warnings failures. Fix: pin the toolchain via env var:

env:
  RUST_TOOLCHAIN: "1.89.0"

And reference @master with the pinned version in every job. Now CI is
deterministic.

The Windows failure was more interesting. Two tests passed on Ubuntu and
macOS but failed on Windows:

#[test]
fn test_save_load_checkpoint() {
    let path = "test_checkpoint_dir";
    save_checkpoint(path, &checkpoint).unwrap();
    // ...
    fs::remove_dir_all(path).ok();
}

cargo test runs tests in parallel by default. Two tests opening the same
relative path in the same process's working directory is a race.
Windows' filesystem is stricter than Linux about concurrent deletion — Linux
will usually let you remove a directory while another handle is open, Windows
tells you to go away. The test passed on Ubuntu because the race was benign;
it failed on Windows because it wasn't.

Fix: temp directory with a unique suffix.

let path = std::env::temp_dir().join(format!(
    "rustyasg_ckpt_{}_{}",
    std::process::id(),
    std::time::SystemTime::now()
        .duration_since(std::time::UNIX_EPOCH)
        .unwrap()
        .as_nanos()
));

Plus --test-threads=1 in CI as defence-in-depth. Now all three platforms
pass.

Going international: two READMEs

Until last week the README was in Russian, because I'm Russian and I never
expected anyone else to read it. That's the kind of default that quietly
keeps a project from getting discovered.

Fix: make the primary README.md English — docs.rs, crates.io, GitHub's
front page. Then mirror it as README.ru.md for Russian readers. Both
reference each other in the first lines.

Same treatment for every file in src/. I delegated the code translation
to a subagent with explicit instructions — preserve formatting, preserve
identifiers, only rewrite strings and comments — and verified with

grep -rP "[\x{0400}-\x{04FF}]" --include="*.rs" .

The only Cyrillic left in the repo is in README.ru.md (intentional) and
a stale target/package/0.2.0/ artifact (not published, not tracked).

docs.rs docs are now professional English. That matters more than I
expected it would.

The numbers

Before	After
Warnings	142	0
Tests	85 + 26 + 8 = 119	87 + 46 + 8 = 141
GPU ops supported	~25 basic + Conv2d fwd	+11 (LayerNorm, Conv2d bwd, pool, embedding, ConvTranspose2d, Slice/Concat/SliceBackward)
Lines in `main.rs`	270	225
String-matching in binary	yes	zero
Layer constructors	`Linear::new(ctx, name)`	`Linear::new(ctx, name, in, out)`
RoPE	stub (`+ cos as bias`)	correct split-half
CI	none	4 jobs, 3 OSes, strict fmt/clippy/doc
README	Russian only	English + Russian mirror
Published crate size	1.3 MB (with logo)	~120 KB
TransformerBlock loss (15 epochs)	—	0.000002 on GPU
CNN example accuracy	(none existed)	100% on 3-class synthetic

Publishing

I'm tagging this one v0.3.1 and pushing to crates.io. The command list is
underwhelming given how much work led up to it:

# Ensure everything is clean.
cargo fmt --all -- --check
cargo clippy --release --all-targets -- -D warnings
cargo test --release --lib --tests --test grad_check -- --test-threads=1
RUSTDOCFLAGS="-D rustdoc::broken_intra_doc_links" cargo doc --lib --no-deps

# Commit + tag.
git add -A
git commit -m "Release v0.3.1"
git tag -a v0.3.1 -m "v0.3.1"
git push origin master --tags

# Dry-run — builds the crate archive and compiles it in isolation.
# Catches anything that works on your machine but would fail from scratch.
cargo publish --dry-run

# Ship.
cargo publish

Then docs.rs picks it up automatically and rebuilds documentation at
https://docs.rs/rustyasg/0.3.1.

What's next (v0.5 — performance)

I deliberately deferred Phase 4. The reason is boring and correct: you can't
optimise what you haven't measured, and you shouldn't measure what isn't
correct yet. v0.3.1 is correct. v0.5 will be the performance release:

GPU buffer pool. Currently every training step allocates fresh buffers for every intermediate tensor. An arena that reuses allocations should be a big win — but I want to benchmark before and after with criterion so I can quote real numbers instead of vibes.
Kernel fusion. MatMul + Bias + Activation is a three-kernel dance that should be one WGSL shader. Detection + code generation for the fusion pass is a chunk of compiler work.
Mixed precision (f16) with loss scaling.
Inference-only mode. Skip autograd graph construction when the user just wants predictions. The graph-to-graph design makes this almost free — just don't build the gradient graph.
Tiny GPT example. Needs causal masking + proper multi-batch, which needs a working inference path first.

That's v0.5. And then v1.0 adds ONNX export, WebAssembly, a real model zoo,
and a much-improved visualiser.

Reflection

The thing I keep relearning: the distance between "code that works" and
"code someone else can use" is enormous and almost entirely about things
nobody celebrates. Renaming warnings. Writing CONTRIBUTING.md. Choosing a
temp path instead of a hardcoded one. Pinning a toolchain version.
Translating a README.

If you're thinking about open-sourcing a project, my honest advice: do Phase 6 first.
Get the strict CI green, get docs.rs building, get the README
to be the first thing you actually want someone to read. Then write the
code. Your future self will thank you — and so will everybody who finds the
crate on a search.

Code is at https://github.com/Xzdes/RustyAsg. Crate is at
https://crates.io/crates/rustyasg. Issues and PRs welcome.

Part 4 will be the performance push. See you there.