Your drift detector fires. The session looks clean. You roll back anyway.
That's the false positive problem — and it's not a threshold tuning issue. It's architectural. Embedding-based detectors measure geometric displacement in vector space. They have no model of semantic trajectory, logical flow, or whether a session that drifted away has returned. Once the threshold trips, it stays tripped.
This post documents a working fix: a dual-signal governor implemented in drift_orchestrator that introduces a second orthogonal signal and uses disagreement between the two as an arbitration metric. The implementation is live at tag v0.13.0-dual-signal-governor. The data behind it is real.
The problem in concrete terms
In a previous post I documented Semantic Gradient Evasion — how embedding-based drift detectors can be bypassed through small, consistent semantic shifts. The control set for that benchmark revealed something I underreported: 2 of 3 legitimate sessions triggered detection when they shouldn't have.
Both false positives came from Fix 1, the anchor distance threshold at τ=0.4. Not Fix 3, the monotonic window detector that catches gradient chains. Fix 1.
Here's why. Take a firewall review session — five steps, all on-topic, coherent throughout:
- Review the firewall configuration for the production environment
- Check the ingress rules on the production firewall
- Verify that port 443 is open and port 22 is restricted to known IPs
- Confirm the firewall logs are being forwarded to the SIEM
- Document the current firewall rule set for the security review
At step 2, the session expands from "firewall configuration" to specific port rules. That expansion moves the embedding away from the anchor. Anchor distance hits 0.4479 — above τ=0.4. Fix 1 fires.
By step 3, the session is returning. Final anchor distance: 0.2729. The session came back. Fix 1 doesn't know that. The alert persists through the end of the session.
This isn't a threshold problem. Setting τ higher just moves the false positive boundary — it doesn't fix the underlying issue, which is that a stateless geometric threshold has no awareness of return, trajectory, or meaning.
Three signals, not two
The fix requires stepping back from the single-signal framing. What you actually need:
Signal A — Geometric displacement: cosine distance in embedding space. Fast, deterministic, stateless. Good trigger, bad arbiter. Serves as the detection layer.
Signal B — Semantic continuity: LLM coherence assessment over accumulated window text. Slower, probabilistic, context-aware. Approximates logical flow rather than embedding proximity. Serves as the interpretation layer.
Signal C — Divergence between internal and external assessments: disagreement between what the geometric and semantic signals report. In the current implementation, divergence is approximated using the external categorical verdict rather than a fully calibrated numeric metric. The external evaluator already computes a numeric proxy: divergence = abs(alpha - external_score) in external_evaluator.py. The architecture was implicitly three-signal before this paper named it.
The key insight: disagreement between A and B is only meaningful when Signal A has entered an alert condition. Below the alert threshold, divergence between signals is expected and not actionable. It is specifically the combination of geometric alert and semantic stability that indicates false positive risk. Additionally, divergence is only considered valid when both signals are operating above their minimum context thresholds — Signal B requires sufficient accumulated window text to produce reliable coherence assessments.
Signal C disagreement under alert conditions is a high-confidence indicator of ambiguity — not of false positive certainty. The external signal is not a replacement for geometric detection. It is a disambiguation layer engaged when the geometric signal is in a high-variance alert state and the semantic evidence says something different.
The probe data
I ran both false positive sequences through the live orchestrator with qwen2.5:3b as the external evaluator via a local gateway.
stable_session (firewall review — should not trigger):
| Step | anchor_dist | Fix1 | qwen verdict | qwen drift |
|---|---|---|---|---|
| 0 | 0.000 | — | DEGRADED | 0.75 |
| 1 | 0.213 | — | DEGRADED | 0.70 |
| 2 | 0.448 | FIRES | STABLE | 0.35 |
| 3 | 0.287 | fired | STABLE | 0.25 |
| 4 | 0.273 | fired | STABLE | 0.25 |
Fix1 fires at step 2. qwen says STABLE at step 2. Simultaneous disagreement — the signals directly contradict at the moment of triggering.
moderate_drift (auth architecture review — should not trigger):
| Step | anchor_dist | Fix1 | qwen verdict | qwen drift |
|---|---|---|---|---|
| 0 | 0.000 | — | DEGRADED | 0.80 |
| 1 | 0.512 | FIRES | DEGRADED | 0.70 |
| 2 | 0.484 | fired | STABLE | 0.25 |
| 3 | 0.410 | fired | STABLE | 0.25 |
| 4 | 0.450 | fired | STABLE | 0.25 |
Fix1 fires at step 1. qwen stabilizes at step 2 and stays STABLE. Retrospective disagreement — the alert outlived the condition that caused it.
Two distinct patterns. Two distinct governor responses.
The implementation
The governor lives in PolicyEngine.evaluate() in policy.py (tag v0.13.0-dual-signal-governor). It adds two parameters: external_verdict and external_drift. When neither is provided, behavior is identical to the original — full backward compatibility.
Hold is a temporary suppression under uncertainty; veto is a sustained override under confirmed coherence. Both are governed by the same external signal, operating at different evidence thresholds.
Hold mode handles simultaneous disagreement. When the geometric signal has escalated to ROLLBACK and the external signal returns STABLE with drift score below τ=0.40, the action is held at INJECT rather than executing the rollback. Temporary suppression during uncertainty accumulation — the system is not confident enough to roll back, but not confident enough to clear the alert either.
Veto mode handles retrospective disagreement. After STABLE_STREAK_VETO=2 consecutive STABLE verdicts, the governor downgrades action to CONTINUE and marks governor_active=True. Sustained override after sufficient coherence evidence.
Both modes require turn_index >= MIN_WINDOW_TURNS=2. This guard exists because qwen rates single-sentence windows as DEGRADED regardless of content — context starvation in early turns produces unreliable signal.
Empirical behavior with alpha=0.58, external STABLE from turn 2:
| Turn | Geometric | External | Action | Mode |
|---|---|---|---|---|
| 0 | INJECT | DEGRADED | INJECT | — |
| 1 | REGENERATE | DEGRADED | REGENERATE | — |
| 2 | ROLLBACK | STABLE | INJECT | Hold |
| 3 | INJECT | STABLE | CONTINUE | Veto |
| 4 | INJECT | STABLE | CONTINUE | Veto |
The gradient chain detection from the SGE paper is unaffected. Monotonic anchor drift ROLLBACKs bypass the governor entirely, as do RED finding overrides and divergence-based ROLLBACKs.
What doesn't work yet
Inference latency — qwen2.5:3b on CPU-only hardware takes 20-60 seconds per window. Fine for research instrumentation, not for real-time use. Async evaluation with delayed governor engagement is the path forward for production.
External signal manipulation — an adversary aware of the governor could craft sessions designed to return STABLE from the LLM while executing semantic drift. Practical vectors include prompt shaping, coherence spoofing via structured technical language, and delayed drift masking after establishing coherence. The streak and window requirements provide partial resistance. Crucially, the external signal cannot be treated as authoritative — it must remain one input into a stateful arbitration process, not a veto with independent authority.
The general pattern
The three-signal architecture applies to any system where:
- A fast, cheap signal drives alert generation (trigger)
- A slow, expensive signal provides semantic validation (interpreter)
- Divergence between them under alert conditions is measurable (arbitrator)
- Irreversible actions should be gated on temporal consistency of evidence
Security alerting pipelines. Anomaly detection. Autonomous agent control loops. The specific thresholds change. The architecture doesn't.
Sensitivity and precision aren't fundamentally in tension if you can delay irreversible actions until slower evidence has accumulated. Keep the geometric detector maximally sensitive. Add the governor as a precision layer without modifying the detector at all.
Where this goes next
The current Signal C is approximated using the external categorical verdict rather than a calibrated numeric metric. The next step is formalizing it: normalizing both signals to a common scale, characterizing divergence distributions under known conditions (legitimate sessions, gradient attacks, jitter attacks), and deriving a proper decision threshold. That moves the governor from qualitative arbitration to a quantitative decision surface — and makes the architecture formally analyzable rather than empirically validated.
That's the step from project feature to publishable pattern. Separate research track, not a continuation of this work.
Code and data
Everything is at github.com/GnomeMan4201/drift_orchestrator (tag v0.13.0-dual-signal-governor).
-
policy.py— governor implementation -
firewall/gateway.py—/embedand/routeendpoints -
scripts/dual_signal_control_probe.py— the probe that generated the table data -
results/dual_signal_probe.json— raw probe output -
evasion_test_suite.py— the SGE benchmark suite -
papers/dual_signal_governor.md— full paper version of this post
The probe script is runnable if you have Ollama with qwen2.5:3b and nomic-embed-text pulled and the gateway running on 8765. bash scripts/start_gateway.sh handles the gateway startup.