We built CleitonForge — a neutral benchmarking layer for quantum simulation written in Rust. The idea is simple: take the same quantum circuit, run it through multiple simulators using an identical canonical IR, and compare. No favorites. No built-in assumptions about which framework is correct.
We plugged in four backends: our own native statevector, quantrs2, roqoqo, and q1tsim. On Bell states, Grover search, QFT, and Bernstein-Vazirani — all four agreed to machine precision. Then we ran QAOA.
Native vs quantrs2 fidelity: 0.00000000.
Not a rounding error. Not a bug in our code. A fundamental sign disagreement in how one framework defines the Rz gate — invisible in standard benchmarks, catastrophic in parameterized quantum algorithms.
The finding: two definitions of the same gate
The Rz(λ) gate is one of the most common in quantum computing. IBM, Qiskit, OpenQASM, and virtually every textbook define it as:
IBM / Qiskit convention — used by: native · roqoqo · q1tsim
Rz(λ) = ⎡ e^{−iλ/2} 0 ⎤
⎣ 0 e^{+iλ/2} ⎦
Opposite sign convention — used by: quantrs2-core
Rz(λ) = ⎡ e^{+iλ/2} 0 ⎤
⎣ 0 e^{−iλ/2} ⎦
The difference is a global sign flip in the exponent. For λ = 0, both give the identity — you can't tell them apart. For λ = π/4 (as in QAOA's cost layer), the two matrices produce orthogonal quantum states. Fidelity: exactly zero.
The evidence
We ran every backend through the same QAOA MaxCut circuit (2 qubits, γ = −3π/4, β = −π/8) and measured cross-backend fidelity:
| Pair | Fidelity | Verdict |
|---|---|---|
| native ↔ roqoqo | 1.00000000 | ✅ agree |
| native ↔ q1tsim | 1.00000000 | ✅ agree |
| roqoqo ↔ q1tsim | 1.00000000 | ✅ agree |
| native ↔ quantrs2 | 0.00000000 | ❌ differ |
| roqoqo ↔ quantrs2 | 0.00000000 | ❌ differ |
| q1tsim ↔ quantrs2 | 0.00000000 | ❌ differ |
Why Quantum Volume doesn't catch it
We also ran the IBM Quantum Volume benchmark — Haar-random SU(4) circuits, 100 trials per width. All four backends returned identical HOG fractions to 4 decimal places:
| Width n | QV | native | quantrs2 | roqoqo | q1tsim |
|---|---|---|---|---|---|
| n = 2 | — | 0.5355 | 0.5355 | 0.5355 | 0.5355 |
| n = 3 | 8 | 0.7340 ✅ | 0.7340 ✅ | 0.7340 ✅ | 0.7340 ✅ |
| n = 4 | 16 | 0.7991 ✅ | 0.7991 ✅ | 0.7991 ✅ | 0.7991 ✅ |
| n = 5 | 32 | 0.8459 ✅ | 0.8459 ✅ | 0.8459 ✅ | 0.8459 ✅ |
This is not a contradiction — it's the key insight. Quantum Volume uses random angles. When λ is drawn from a uniform distribution over [0, 2π), the sign flip averages out across the ensemble. The divergence only surfaces with specific, purposeful angles — exactly the kind used in QAOA, VQE, and other variational algorithms.
The implication: Standard Clifford benchmarks and random circuit tests are blind to this class of convention disagreement. You need a benchmark that exercises parameterized gates at specific angles — which is exactly what QAOA does, and exactly what CleitonForge's benchmark suite runs automatically across all backends.
The architecture that found it
CleitonForge is not a simulator. It's a neutral benchmarking layer that sits between circuits and simulators. Every backend receives the same canonical intermediate representation — a flat list of typed operations with no framework-specific encoding — and returns a statevector.
// Same circuit, four backends, zero code duplication
let circuit = parse_qasm2(qaoa_source)?;
let backends: &[(&str, &dyn SimulationBackend)] = &[
("native", &NativeStateVectorBackend),
("quantrs2", &QuantRS2Backend),
("roqoqo", &RoqoqoBackend),
("q1tsim", &Q1tSimBackend),
];
for (name, backend) in backends {
let result = backend.run(&circuit, /*shots=*/0, seed)?;
measure_fidelity(&reference_sv, &result.statevector);
}
The canonical IR means backends never see each other's types. roqoqo's gate structs, q1tsim's matrix API, and quantrs2's internal representation are all adapted at the backend boundary. This is also why the Rz divergence is detectable: the input angle λ is the same floating-point value for all backends. The difference is entirely in how each framework's gate definition applies it.
Beyond the finding: what we built along the way
Exact noisy simulation
We implemented a density matrix backend — exact noisy simulation via ρ ∈ ℂ^(4ⁿ), no Monte Carlo variance. Four Kraus noise channels (depolarizing, amplitude damping, bit-flip, phase-flip) applied after each gate. At 12 qubits you get the exact probability distribution from a single simulation run.
import cforge
c = cforge.Circuit(2)
c.h(0); c.cx(0, 1)
# IBM Nairobi calibration: avg SX err 0.03%, avg CX err 0.64%
r = cforge.run(c, backend="density-matrix",
depolarizing_1q=0.000315,
depolarizing_2q=0.00638)
print(r.top_states(4))
# [('00', 0.487), ('11', 0.487), ('01', 0.013), ('10', 0.013)]
Real hardware calibration
We ship a parser for IBM's public calibration JSON format — T1/T2 times, per-qubit SX error rates, per-pair CX error rates — and convert it directly to a NoisyConfig. IBM Nairobi numbers are included as a reference snapshot.
Python bindings
Everything above is available from Python via pip install cleitonforge. The bindings expose all backends, noise channels, and the QASM parser. Built with PyO3 and maturin; Linux/macOS/Windows wheels on PyPI.
Performance (release build)
| Benchmark | Qubits | Gates | Time (ms) |
|---|---|---|---|
| Bell state | 2 | 2 | 0.6 |
| GHZ state | 3 | 3 | 0.4 |
| QFT | 3 | 8 | 0.4 |
| Bernstein-Vazirani | 3 | 8 | 0.3 |
| Grover search | 3 | 43 | 0.3 |
What we are — and what we're not
The temptation when you find a convention divergence is to build your own simulator and do it "right." We're not doing that.
Our moat is neutrality. The moment CleitonForge ships its own simulator, it becomes a competitor to the projects it benchmarks — and loses the only thing that makes its findings credible: no skin in the game. roqoqo cannot tell you "our Rz convention matches IBM and quantrs2's doesn't," because they're a party to the comparison. We can.
What does make sense: a convention normalization transpiler — a layer that detects which Rz convention a circuit assumes and corrects it for the target backend. That's an extension of the benchmarking mission, not a departure from it.
What's next
Open source: Convention normalization transpiler — detect and correct Rz sign on any circuit. Planned as part of cforge-parser.
arXiv: Preprint: "Neutral cross-framework benchmarking reveals Rz sign convention divergence in quantum simulation backends." Formally citable.
Enterprise SaaS: cforge-enterprise: per-device calibration tables, Zero-Noise Extrapolation, convention-aware circuit routing. API for teams running multi-backend quantum workloads.
Tooling: Jupyter notebook companion, QASM linter that flags convention-sensitive gates, dashboard for multi-hardware fidelity tracking.
The SaaS angle is straightforward: quantum teams running the same algorithm on different hardware providers need consistency guarantees that no single provider can offer. CleitonForge is the neutral layer that can.
Try it yourself
pip install cleitonforge
# or from source — reproduce the exact findings:
cargo run --release --example benchmark_suite -p cforge-cli
cargo run --release --example quantum_volume -p cforge-cli