Getting ONNX Runtime CUDA Working on NVIDIA Blackwell (GX10/DGX Spark)
Or: How I spent 12 hours discovering that nobody ships GPU inference binaries for NVIDIA's own hardware
TL;DR
NVIDIA's DGX Spark (GX10) ships with a Grace CPU (ARM64) and GB10 GPU (Blackwell, sm_121). As of April 2026, no prebuilt ONNX Runtime GPU binary exists for this platform — not from Microsoft, not from PyPI, not from the Rust ecosystem. Here's how I built one from source and got CUDA-accelerated embedding inference running. The prebuilt binaries and full build instructions are published at https://github.com/Albatross1382/onnxruntime-aarch64-cuda-blackwell.
The Setup
I'm building a Rust application that uses snowflake-arctic-embed-m-v2.0 for semantic search — a 768-dimension embedding model running via ONNX inference. On my laptop (ARM64, CPU-only via tract-onnx), each embedding took ~3,400ms. The GX10's Blackwell GPU should crush this.
Step 1: Swap tract-onnx for the ort crate (Easy)
The ort Rust crate wraps ONNX Runtime with CUDA support. I extracted an EmbeddingProvider trait, implemented TractProvider and OrtProvider, wired a --features ort-cuda flag. Clean refactor, 26 files, all tests passing.
Immediate result: 3,400ms → 135ms — even on CPU. ORT's CPU backend is far better optimised than tract's pure-Rust implementation. 25x improvement before even touching the GPU.
Step 2: Where Are the GPU Binaries? (The Wall)
The ort crate auto-downloads prebuilt ONNX Runtime binaries. On aarch64 + CUDA 13:
[ort-sys] [WARN] no prebuilt binaries available on this platform
for combination of features 'cu13'
Checked every source:
- pyke.io (ort crate's CDN): No aarch64 + CUDA builds at all
- Microsoft GitHub releases: No aarch64 GPU tarballs
-
PyPI (
onnxruntime-gpu): "No matching distribution found" for aarch64 - NVIDIA apt repos: Has cuDNN and CUDA, but no onnxruntime package
- NVIDIA AI Workbench: Not bundled
NVIDIA sells hardware that their own ML ecosystem doesn't have prebuilt inference binaries for.
Step 3: Build from Source (The Gauntlet)
Attempt 1: ORT v1.20.1
-
Eigen hash mismatch: GitLab changed the zip archive, breaking the SHA1 check. Fix: pre-clone Eigen via git and set
FETCHCONTENT_SOURCE_DIR_EIGEN. -
thrust::unary_functionremoved: CUDA 13's CCCL/Thrust removed this class. ORT v1.20.1 uses it. Dead end — need a newer ORT.
Attempt 2: ORT v1.24.4
-
compute_53unsupported: CUDA 13 dropped old GPU architectures. Fix:CMAKE_CUDA_ARCHITECTURES=121. -
sm_120 vs sm_121: I initially guessed sm_120 for Blackwell. Wrong — the GB10 is compute capability 12.1. Discovered via
nvidia-smi --query-gpu=compute_cap --format=csv,noheader. Cost: one full rebuild (~40 minutes). - Success: ORT v1.24.4 built clean with sm_121. Total build time: ~40 minutes on 20 cores.
Step 4: Dynamic Loading vs Static Linking (The Trap)
With the built .so, I tried two approaches:
Static linking (ORT_LIB_LOCATION): The ort-sys build script ignored it and downloaded the CPU-only binary anyway. The env var didn't propagate through Cargo's build process reliably.
Dynamic loading (load-dynamic feature + ORT_DYLIB_PATH): Loaded the correct library. CUDA provider plugin loaded. All symbols resolved. EP registered without error. But I couldn't tell whether CUDA was actually active.
Step 5: The Debugging Trap
At this point I spent hours convinced CUDA wasn't working. The ort crate's EP registration appeared to silently fail — inference timing looked the same with and without CUDA. I even wrote a 20-line unsafe FFI workaround to call the legacy OrtSessionOptionsAppendExecutionProvider_CUDA function directly.
The actual problem? I didn't have tracing-subscriber initialised. The ort crate logs all EP registration events via the tracing crate. Without a subscriber, RUST_LOG does nothing — zero output, zero feedback on whether CUDA is active.
Once I added tracing-subscriber:
INFO ort::ep: Successfully registered `CUDAExecutionProvider`
TRACE: Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 455
INFO: Creating BFCArena for Cuda ...
INFO: cuDNN version: 92000
CUDA was working the whole time — through the crate's native ort::ep::CUDA registration. The FFI workaround was unnecessary.
The definitive proof:
- With GPU: 148ms
-
CUDA_VISIBLE_DEVICES=""(GPU hidden): 3,279ms
Lesson learned: If you're using the ort crate and can't tell whether your EP is active, add tracing-subscriber before anything else. It's not optional for debugging.
Verification
With RUST_LOG=ort=debug, ORT confirms CUDA activation:
INFO ort::logging: Creating BFCArena for Cuda ...
INFO ort::logging: Extending BFCArena for Cuda. bin_num:20 num_bytes: 768147456
768MB allocated on GPU for model weights. Only lightweight shape ops fall back to CPU — standard ORT optimisation behaviour.
Step 6: INT8 vs FP32
The quantized (INT8) model doesn't have CUDA kernels for sm_121. All ops fall back to CPU, making it slower than ORT's CPU path with the FP32 model. Solution: use the FP32 model for CUDA, keep the quantized model for the CPU-only tract backend.
Summary
| Step | Time |
|---|---|
| tract-onnx CPU (before) | 3,400ms |
| ORT CPU (ort crate, no GPU) | 135ms |
| ORT CUDA (GB10) | 148ms cold |
| ORT CPU (CUDA disabled) | 3,279ms |
The 135ms → 148ms difference on cold start is misleading — model loading dominates. In a long-running server with warm sessions, CUDA inference should be significantly faster than CPU.
What Needs to Happen
- Microsoft: Publish aarch64 Linux GPU release artifacts for ONNX Runtime
- pyke.io: Add aarch64 + CUDA prebuilts to the ort crate's download cache
-
ort crate: Document that
tracing-subscriberis required to see EP registration status. Without it, debugging execution providers is nearly impossible. - NVIDIA: Your flagship developer workstation doesn't have prebuilt ML inference binaries. Fix that.
The prebuilt binaries and complete build instructions are available at https://github.com/Albatross1382/onnxruntime-aarch64-cuda-blackwell.