Getting ONNX Runtime CUDA Working on NVIDIA Blackwell (GX10/DGX Spark)

Or: How I spent 12 hours discovering that nobody ships GPU inference binaries for NVIDIA's own hardware

TL;DR

NVIDIA's DGX Spark (GX10) ships with a Grace CPU (ARM64) and GB10 GPU (Blackwell, sm_121). As of April 2026, no prebuilt ONNX Runtime GPU binary exists for this platform — not from Microsoft, not from PyPI, not from the Rust ecosystem. Here's how I built one from source and got CUDA-accelerated embedding inference running. The prebuilt binaries and full build instructions are published at https://github.com/Albatross1382/onnxruntime-aarch64-cuda-blackwell.

The Setup

I'm building a Rust application that uses snowflake-arctic-embed-m-v2.0 for semantic search — a 768-dimension embedding model running via ONNX inference. On my laptop (ARM64, CPU-only via tract-onnx), each embedding took ~3,400ms. The GX10's Blackwell GPU should crush this.

Step 1: Swap tract-onnx for the ort crate (Easy)

The ort Rust crate wraps ONNX Runtime with CUDA support. I extracted an EmbeddingProvider trait, implemented TractProvider and OrtProvider, wired a --features ort-cuda flag. Clean refactor, 26 files, all tests passing.

Immediate result: 3,400ms → 135ms — even on CPU. ORT's CPU backend is far better optimised than tract's pure-Rust implementation. 25x improvement before even touching the GPU.

Step 2: Where Are the GPU Binaries? (The Wall)

The ort crate auto-downloads prebuilt ONNX Runtime binaries. On aarch64 + CUDA 13:

[ort-sys] [WARN] no prebuilt binaries available on this platform
for combination of features 'cu13'

Checked every source:

pyke.io (ort crate's CDN): No aarch64 + CUDA builds at all
Microsoft GitHub releases: No aarch64 GPU tarballs
PyPI (onnxruntime-gpu): "No matching distribution found" for aarch64
NVIDIA apt repos: Has cuDNN and CUDA, but no onnxruntime package
NVIDIA AI Workbench: Not bundled

NVIDIA sells hardware that their own ML ecosystem doesn't have prebuilt inference binaries for.

Step 3: Build from Source (The Gauntlet)

Attempt 1: ORT v1.20.1

Eigen hash mismatch: GitLab changed the zip archive, breaking the SHA1 check. Fix: pre-clone Eigen via git and set FETCHCONTENT_SOURCE_DIR_EIGEN.
thrust::unary_function removed: CUDA 13's CCCL/Thrust removed this class. ORT v1.20.1 uses it. Dead end — need a newer ORT.

Attempt 2: ORT v1.24.4

compute_53 unsupported: CUDA 13 dropped old GPU architectures. Fix: CMAKE_CUDA_ARCHITECTURES=121.
sm_120 vs sm_121: I initially guessed sm_120 for Blackwell. Wrong — the GB10 is compute capability 12.1. Discovered via nvidia-smi --query-gpu=compute_cap --format=csv,noheader. Cost: one full rebuild (~40 minutes).
Success: ORT v1.24.4 built clean with sm_121. Total build time: ~40 minutes on 20 cores.

Step 4: Dynamic Loading vs Static Linking (The Trap)

With the built .so, I tried two approaches:

Static linking (ORT_LIB_LOCATION): The ort-sys build script ignored it and downloaded the CPU-only binary anyway. The env var didn't propagate through Cargo's build process reliably.

Dynamic loading (load-dynamic feature + ORT_DYLIB_PATH): Loaded the correct library. CUDA provider plugin loaded. All symbols resolved. EP registered without error. But I couldn't tell whether CUDA was actually active.

Step 5: The Debugging Trap

At this point I spent hours convinced CUDA wasn't working. The ort crate's EP registration appeared to silently fail — inference timing looked the same with and without CUDA. I even wrote a 20-line unsafe FFI workaround to call the legacy OrtSessionOptionsAppendExecutionProvider_CUDA function directly.

The actual problem? I didn't have tracing-subscriber initialised. The ort crate logs all EP registration events via the tracing crate. Without a subscriber, RUST_LOG does nothing — zero output, zero feedback on whether CUDA is active.

Once I added tracing-subscriber:

INFO ort::ep: Successfully registered `CUDAExecutionProvider`
TRACE: Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 455
INFO: Creating BFCArena for Cuda ...
INFO: cuDNN version: 92000

CUDA was working the whole time — through the crate's native ort::ep::CUDA registration. The FFI workaround was unnecessary.

The definitive proof:

With GPU: 148ms
CUDA_VISIBLE_DEVICES="" (GPU hidden): 3,279ms

Lesson learned: If you're using the ort crate and can't tell whether your EP is active, add tracing-subscriber before anything else. It's not optional for debugging.

Verification

With RUST_LOG=ort=debug, ORT confirms CUDA activation:

INFO ort::logging: Creating BFCArena for Cuda ...
INFO ort::logging: Extending BFCArena for Cuda. bin_num:20 num_bytes: 768147456

768MB allocated on GPU for model weights. Only lightweight shape ops fall back to CPU — standard ORT optimisation behaviour.

Step 6: INT8 vs FP32

The quantized (INT8) model doesn't have CUDA kernels for sm_121. All ops fall back to CPU, making it slower than ORT's CPU path with the FP32 model. Solution: use the FP32 model for CUDA, keep the quantized model for the CPU-only tract backend.

Summary

Step	Time
tract-onnx CPU (before)	3,400ms
ORT CPU (ort crate, no GPU)	135ms
ORT CUDA (GB10)	148ms cold
ORT CPU (CUDA disabled)	3,279ms

The 135ms → 148ms difference on cold start is misleading — model loading dominates. In a long-running server with warm sessions, CUDA inference should be significantly faster than CPU.

What Needs to Happen

Microsoft: Publish aarch64 Linux GPU release artifacts for ONNX Runtime
pyke.io: Add aarch64 + CUDA prebuilts to the ort crate's download cache
ort crate: Document that tracing-subscriber is required to see EP registration status. Without it, debugging execution providers is nearly impossible.
NVIDIA: Your flagship developer workstation doesn't have prebuilt ML inference binaries. Fix that.

The prebuilt binaries and complete build instructions are available at https://github.com/Albatross1382/onnxruntime-aarch64-cuda-blackwell.