How One Local Model Gets Nine Personalities (Without Ever Swapping Weights)

In my first post, I laid out the broad architecture of Rada: local-first AI coding, Behavioral Routing, Sentinel, the Autorouter. Think of that post as the map. This one is the terrain.

Today I'm going deep on the co-determination matrix. The system that lets a single resident model produce nine distinct behavioral profiles by crossing developer intent with real-time hardware state. And Sentinel, the Rust module that measures the hardware side of that equation on every single request.

Quick context if you missed Post 1

Rada keeps one local GGUF model loaded in RAM. No hot-swapping. The model adapts its behavior based on what you're doing (intent) and what your machine can handle (memory band). Sentinel monitors RAM. The Autorouter handles cloud when local isn't enough.

That's the 30-second version. Now let's get into the actual implementation.

The co-determination matrix

The core idea is that the model's behavior isn't determined by a single variable. It's the product of two axes:

Intent (what the developer is doing):

Refactor/Debug: tighten existing code, fix bugs, improve structure
Build from Scratch: generate new code, scaffold features
Explain: walk through logic, teach concepts

Memory Band (what the hardware can handle right now):

Normal (< 70% RAM): full capability
Elevated (70-84% RAM): constrained but functional
Critical (≥ 85% RAM): local generation unsafe, escalate

Cross those two axes and you get a 3×3 matrix. Each cell is a BehaviorProfile in Rust:

struct BehaviorProfile {
    label: &'static str,
    memory_band: MemoryBand,
    prompt_suffix: &'static str,
    temperature: f32,
    max_output_tokens: u32,
    max_retrieval_chars: usize,
    local_token_cap: usize,
}

Every profile controls four knobs: temperature, output token budget, retrieval context size, and a behavioral instruction injected into the system prompt.

Here's the actual matrix from the codebase:

Normal (< 70%)	Elevated (70-84%)	Critical (≥ 85%)
Refactor/Debug	temp 0.1, 1200 tokens	temp 0.05, 900 tokens	temp 0.0, 700 tokens
Build	temp 0.45, 12000 tokens	temp 0.3, 8000 tokens	temp 0.2, 5000 tokens
Explain	temp 0.35, 1800 tokens	temp 0.25, 1200 tokens	temp 0.15, 800 tokens

Look at the spread. A Build task at normal memory gets temperature 0.45 and a 12,000-token budget. The same Build task under critical memory pressure? Temperature drops to 0.2 and the budget cuts to 5,000 tokens. The model becomes conservative precisely when your machine needs it to.

Refactor at normal memory runs at 0.1 temperature. Already deterministic, already tight. Under critical pressure it drops to 0.0 and the token budget shrinks from 1,200 to 700. At that point the model is producing the smallest safe diff it can manage.

The prompt suffix: behavioral steering without fine-tuning

Each cell in the matrix also carries a prompt_suffix that gets injected into the system prompt at generation time. This is how a 7B model "knows" it's supposed to refactor conservatively vs. build expansively vs. explain clearly.

The prompt composition pipeline builds five layers in order:

Role preamble (language expertise based on the active file)
Sentinel status ("Current Sentinel profile: Refactor / Elevated. Memory band: elevated.")
Intent persona (operational instructions for the selected mode)
Behavioral routing instruction (the prompt_suffix from the matrix cell)
Output format spec (how to structure the response with @@file tags)

The model sees its own constraints. It knows its memory band. It knows its token budget. That transparency turns out to matter. A model that's told "you have 900 tokens and memory is elevated" produces tighter output than one that's just given a smaller max_tokens parameter silently. The behavioral instruction gives the model permission to be concise.

Sentinel: measuring reality on every request

Here's where the memory axis of the matrix becomes real.

Sentinel is a Rust module that checks system RAM before every local generation. Not a polling loop. Not a timer. A synchronous check at the decision point.

The implementation is platform-specific:

macOS: calls memory_pressure -Q first. If that fails (older macOS versions), falls back to parsing vm_stat output, calculating used memory from free pages, speculative pages, and purgeable pages.

Linux: reads /proc/meminfo and computes usage from MemTotal minus MemAvailable.

Windows: runs Get-CimInstance Win32_OperatingSystem via PowerShell and calculates from total vs. free physical memory.

All three paths produce the same output: a single u8 representing percent of RAM in use. That number feeds into this function:

fn classify_memory_band(memory_usage_percent: Option<u8>) -> MemoryBand {
    match memory_usage_percent {
        Some(usage) if usage >= 85 => MemoryBand::Critical,
        Some(usage) if usage >= 70 => MemoryBand::Elevated,
        _ => MemoryBand::Normal,
    }
}

Notice the Option<u8>. If platform detection fails entirely (permissions issues, unexpected output format), the system defaults to Normal. Fail open, not closed. A developer who can't get RAM readings still gets full local capability rather than being locked out.

Two enforcement points

Sentinel doesn't just label the memory band. It enforces it.

Gate 1: Pre-generation check. Before any local inference starts, Sentinel reads memory and resolves the behavior profile:

let memory_usage_percent = detect_system_memory_usage_percent();
let behavior_profile = resolve_behavior_profile(intent, memory_usage_percent);

If the result is a Critical-band profile, the request never reaches the local model. It gets rerouted to the Autorouter for cloud handling, with a message explaining why: "Sentinel escalated this request because local memory pressure is critical."

Gate 2: Budget enforcement. Even within Normal and Elevated bands, the resolved profile's token budgets and retrieval caps constrain the generation. A Build request that would happily consume 12,000 tokens at 60% RAM gets capped at 8,000 tokens at 75% RAM. The model adapts its output to fit.

Why per-request, not periodic

Your RAM state changes constantly. You open a browser tab. Docker pulls an image. Spotlight re-indexes. The memory band from 30 seconds ago might not be the one you're in now.

A periodic check (say, every 5 seconds) creates a window where the system's picture of memory is stale. A per-request check means the behavioral profile is always current. The memory axis of the co-determination matrix isn't a configuration setting. It's measured. Every time.

The overhead is negligible. Reading /proc/meminfo on Linux takes microseconds. The macOS memory_pressure call is similarly lightweight. The cost of one extra syscall per request is invisible compared to the inference time that follows.

Why this is the patentable piece

System prompts aren't novel. Temperature scaling isn't novel. RAM monitoring isn't novel.

What's novel is the co-determination: using real-time hardware state as a first-class input to model behavioral configuration, intersected with developer intent, to produce adaptive profiles from a single resident model. The model doesn't just respond to what you asked. It responds to what you asked given what your machine can handle right now.

We filed a US provisional patent on this mechanism. The claim isn't any individual piece. It's the intersection.

What this looks like in practice

Developer on a 16 GB MacBook. Running VS Code, Chrome with 15 tabs, Docker, Slack. RAM sitting at 72%. They select Refactor intent and ask the model to clean up a function.

Sentinel reads 72%, classifies Elevated band. The co-determination matrix resolves to: temperature 0.05, 900-token output budget, tightened retrieval context. The prompt suffix tells the model to emit the smallest possible safe diff.

The model produces a focused, conservative refactor. No rambling. No unnecessary changes. The developer's machine stays stable.

Same developer, same session, but they close Docker and Chrome. RAM drops to 55%. They switch to Build intent and ask for a new feature module. Sentinel reads 55%, classifies Normal band. Temperature jumps to 0.45, output budget goes to 12,000 tokens, full retrieval context. The model produces expansive, complete code.

Same model. Same weights. Different behavior. Zero downtime between the two requests.

What's next

The Autorouter is the other half of this story. When Sentinel escalates to cloud, the Autorouter decides which cloud tier to hit (and the routing is a pure Rust function, not another LLM call). That's the next post.

Rada is in closed beta. If you want to see how the co-determination matrix behaves on your specific hardware, the waitlist is at userada.dev.

Eli Hadam Zucker is the founder of Rada. Previously at Wise. Building local-first AI tooling in Rust.