Building a Replayable Decision Kernel in Rust

I built Calybris Core because I kept running into the same uncomfortable question in decision-heavy systems:

After the system says "yes", "no", or "use this instead", what exactly can we prove later?

Not prove in the formal-methods sense. I mean the practical engineering version:

Which policy was active?
What was the input?
What decision was returned?
Can the decision be replayed?
Did the budget/exposure invariant still hold?
Can an audit log detect tampering?

Calybris Core is my attempt to make that boundary small, deterministic, and boring.

It is not an LLM framework.

It is not an exchange.

It is not a strategy engine.

It is not a web service.

It is a Rust core primitive:

candidate + policy constraints -> decision + digests + optional WAL + budget proof

The first reference examples are LLM routing and pre-trade admission guards, but the crate itself is domain-neutral.

Repo: github.com/emirhuseynrmx/calybris-core

Crate: crates.io/crates/calybris-core

Docs: docs.rs/calybris-core

The boundary I wanted

A lot of systems have a hidden decision point that looks simple from the outside:

request comes in
system checks constraints
system returns allow / substitute / reject

But when something goes wrong, that simple decision becomes hard to reconstruct.

Maybe the model was changed.

Maybe a budget was exceeded.

Maybe a cheaper fallback was selected.

Maybe an operator needs to explain why an action was rejected.

Maybe an audit log was modified after the fact.

The typical response is to add more logs.

That helps, but logs alone are not the same as replayable decisions. I wanted the core decision result to carry enough structure that an independent verifier can ask:

If I replay the same input against the same policy snapshot, do I get the same decision?

That became the central design constraint.

What Calybris decides

The kernel module evaluates a KernelInput against a validated PolicySnapshot.

The result is a KernelDecision:

ExecuteRequested
Substitute
Reject

The decision contains the selected candidate, reason, estimated cost, utility, counterfactual fields, evaluated/eligible counts, and policy/catalog epochs.

The important part is not the specific domain. The important part is that the decision is deterministic and replayable.

In code, the shape is intentionally direct:

use calybris_core::kernel::*;
use calybris_core::verify::{verify_decision, VerifyResult};

let decision = snapshot.prescribe(input);

assert_eq!(
    verify_decision(&snapshot, input, &decision),
    VerifyResult::Valid
);

The hot path deliberately avoids:

floating point
JSON
clocks
network calls
hidden I/O
unsafe Rust

The crate root uses:

#![forbid(unsafe_code)]

That is not magic, but it is a useful line in the sand.

Why I avoided floating point

The reference use cases both involve costs, budgets, confidence, risk, and utility.

It would be easy to reach for f64. I avoided it.

Calybris uses integer amounts and basis points. Financial amounts are fixed-point microcents. Quality, risk, confidence, and policy thresholds are represented as integer basis points.

That keeps replay behavior less surprising.

For audit-oriented code, "close enough" is a dangerous phrase. If a decision depends on a threshold, I want the arithmetic to be explicit and repeatable.

Canonical digests, not "whatever serde emitted"

Replay alone is not enough. You also need stable fingerprints.

Calybris computes canonical SHA-256 digests for:

policy snapshots
decision inputs
decision outputs
budget ledger snapshots

The digest layouts are version-tagged byte layouts, not hashes of arbitrary JSON.

That distinction matters. JSON is great for transport and inspection, but field order and serialization choices are not a good audit boundary.

The digest tags are explicit:

calypol1
calyinp1
calydcn1
calyldg1

Policy models are sorted before hashing. Ledger tenants are sorted before hashing. A logically equivalent snapshot should not get a different fingerprint because a map happened to iterate differently.

The audit bundle

A decision can be wrapped in an audit bundle:

policy digest
input digest
decision digest
replay_valid

The verifier checks the structural decision, not just a string.

If you change the input, replay fails.

If you change the decision, replay fails.

If you use the wrong policy, replay fails.

If the digest fields do not match canonical recomputation, replay fails.

That is the reason I have been using the phrase "proof-carrying decision core", although I am still looking for feedback on whether that wording is too strong.

To be clear: this is not a formal proof system. It is a replayable evidence bundle.

Optional WAL

The crate also includes an optional write-ahead log.

Each WAL entry contains:

sequence number
previous hash
entry hash
record data

The unkeyed mode is useful for corruption detection and basic tamper evidence. The keyed mode uses HMAC-SHA256, which is the mode you would use if an attacker might rewrite entries and recompute hashes.

The audited WAL path looks like this:

prescribe
  -> audit_bundle
  -> append_audited
  -> replay_audited_wal

Replay fails closed if the chain is broken or if any policy/input/decision digest does not match.

I intentionally did not put secret storage, key rotation, file locking, or multi-process coordination inside this crate. Those are deployment concerns and should be owned by the embedding system.

Budget conservation

The budget engine is another small core primitive.

The invariant is:

remaining + reserved + committed_lifetime == initial

A reservation removes spendable balance.

A commit turns a reservation into lifetime committed spend.

A release returns the hold.

A top-up extends initial and remaining budget.

The budget engine uses CAS for the hot balance updates and mutex-protected metadata maps for the surrounding state.

The invariant is checked on frozen snapshots. Multi-step operations may have transient internal states, so the docs are careful not to claim every mid-operation snapshot is linearizable.

That distinction matters. Audit docs should say what is guaranteed, not what sounds good.

Why not a general rules engine?

Calybris is narrower than a rules engine.

It does not try to provide a policy language. It does not parse arbitrary user rules. It does not evaluate scripts.

The current kernel is closer to:

rank candidates under hard constraints
return the best positive-utility candidate
otherwise reject

That narrowness is intentional. I wanted the core to be small enough to reason about, test, replay, and document.

A larger product can put a policy language above this layer. Calybris is the deterministic bottom layer.

Testing the uncomfortable parts

The project has tests for the parts I would worry about first:

optimized kernel output vs reference implementation
digest stability and sensitivity
replay mismatch detection
WAL tampering, duplicate sequence, truncation, malformed JSON
keyed WAL verification
budget conservation under mixed operations
overflow paths
concurrent reserve/commit/release behavior
Loom interleavings
Miri on the library and audit pipeline

The CI runs MSRV and stable jobs, clippy with warnings denied, docs, examples, proptest-heavy jobs, Loom, Miri, cargo-audit, and cargo-deny.

That does not make it "audited". It does make it less hand-wavy.

Try it locally

git clone https://github.com/emirhuseynrmx/calybris-core
cd calybris-core
cargo run --example quickstart
cargo run --example llm_routing
cargo run --example replay_audit

Use it as a dependency:

cargo add calybris-core

Kernel-only, without WAL:

cargo add calybris-core --no-default-features

Current status

The current release is v0.3.10.

Release notes:

github.com/emirhuseynrmx/calybris-core/releases/tag/v0.3.10

The crate is Apache-2.0 and usable, but I would not describe it as a complete production platform.

It is a core primitive. If you embed it in a production system, you still own:

key management
WAL storage policy
deployment controls
external audit
monitoring
operational runbooks
integration-level failure handling

Feedback I want

I would especially like feedback from Rust, security, infra, and systems people on:

Is the API boundary clear?
Is "proof-carrying decision core" misleading?
Should this remain a narrow primitive, or grow a small policy language?
Are the WAL responsibilities split correctly between crate and caller?
What replay/audit guarantees would you expect before trusting something like this?

The repo is here:

github.com/emirhuseynrmx/calybris-core