The Elephant in the Room: "Isn't This Just max_iterations?"
Let me address this up front.
If you're building a ReAct loop with a single LLM and 10 tool calls, you do not need a physics-inspired monitoring library. Set max_iterations=10, add a budget cap, and move on. LangGraph, CrewAI, and every modern agent framework already support this natively.
I built state-harness because I ran into a problem that max_iterations doesn't solve. And after benchmarking it across 2,367 runs, I also learned what it can't do — which I'll be equally transparent about.
The Problem That max_iterations Doesn't Solve
There are two specific scenarios where simple iteration caps fall short:
1. Search-Tree Agents (MCTS, Beam Search)
Advanced coding agents — the kind that solve SWE-bench tasks, or the architecture behind tools like Devin — don't run a flat loop. They explore a search tree. Each node branches into multiple candidate solutions. A node that spirals doesn't just waste one turn; it inflates every downstream branch.
In a 50-node search tree, you can't set max_iterations=50 and call it a day. The agent isn't iterating — it's branching. Token usage grows quadratically. A single stuck branch can burn thousands of tokens before the tree-level budget cap even notices, because the per-branch cost looks normal in isolation.
2. Failure Pattern Aggregation at Scale
If you run 100 agent tasks a day, you open LangSmith, look at the traces of the 5 that failed, and debug them manually. That works.
If you run 10,000+ tasks a day, manual trace inspection is impossible. Your observability bill alone (storing and indexing millions of multi-turn traces) becomes significant. What you actually need is: classify the failure pattern at the edge, at zero cost, and export it as a structured attribute to your metrics pipeline. Then your Grafana dashboard shows: "This week, 40% of failures are retry storms on the SQL tool → add exponential backoff."
That's not something max_iterations gives you. It's not something LangSmith gives you (at least not without paying for indexing every trace). It's what state-harness was designed for.
The Core Insight: Growth-Ratio Normalization
In physics, Lyapunov stability determines whether a dynamical system will return to equilibrium or diverge.
I modeled LLM agent token consumption as a dynamical system where the "energy" V(k) is a function of cumulative token growth. The stability criterion is straightforward: if the energy derivative ΔV ≥ 0 for consecutive steps, the system is diverging.
The problem: In any multi-turn conversation, token usage grows monotonically because the context window accumulates history. A naive Lyapunov monitor would trip on every healthy conversation — you'd get 100% false positives.
The solution: Instead of monitoring raw token counts, normalize each turn against a running baseline to compute a growth ratio:
- Growth ratio ≈ 1.0 → the agent is consuming tokens at its expected rate (stable)
- Growth ratio > 2.0× for 3+ consecutive turns → the agent is consuming disproportionately more each turn (diverging)
This normalization is the key insight. It's analogous to the distinction between intensive and extensive quantities in thermodynamics — monitoring density (ratio) rather than mass (absolute count).
Integration: 5 Lines of Code
from state_harness import GrowthRatioGuard, FailureReport
guard = GrowthRatioGuard(token_budget=50_000)
with guard:
for turn in agent_loop:
result = llm.invoke(turn.prompt)
guard.record_step(tokens_used=result.usage.total_tokens)
report = FailureReport.from_guard(guard)
print(report)
When the guard trips, the diagnostic report classifies the failure pattern — at zero cost, with no LLM calls:
⚠️ STABILITY TRIPPED at turn 12
Pattern: Context Accumulation Spiral (confidence: 92%)
• Last 5 turns all exceeded 1.5× baseline (4/4 were accelerating).
• Peak growth ratio: 5.2× baseline.
• Without intervention, projected cost was $0.0396 (actual: $0.0039).
Suggested actions:
🔴 1. Enable history compression in your agent loop.
🟡 2. Lower the growth ratio threshold to 1.8×.
🟢 3. Add a sliding-window context strategy.
The classified pattern and suggested actions export cleanly to OpenTelemetry:
from opentelemetry import trace
span = trace.get_current_span()
span.set_attributes(report.to_otel_attributes())
# Adds: state_harness.pattern, state_harness.confidence, etc.
Framework Integrations
LangGraph:
from langgraph.prebuilt import create_react_agent
from state_harness.adapters import monitor_graph
agent = create_react_agent(model, tools=[search, calculate])
safe = monitor_graph(agent, token_budget=100_000)
result = safe.invoke({"messages": [("user", "Fix the login bug")]})
print(safe.report)
CrewAI:
from state_harness.adapters import CrewAICallback
callback = CrewAICallback(token_budget=200_000)
crew = Crew(agents=[...], tasks=[...], step_callback=callback.step_callback)
result = crew.kickoff()
print(callback.report)
Benchmarks: What Worked and What Didn't
We evaluated state-harness across 2,367 total runs with a 5-condition ablation study on three benchmarks.
What worked: Zero false positives on stable tasks
Across 1,136 MINT runs (short-loop reasoning) and 750 τ³-bench runs (medium-loop customer service), state-harness never tripped once. The growth-ratio normalization correctly identified these as stable conversations and introduced <2% token overhead.
This is the most important result. A monitoring tool that interferes with healthy agents is worse than no monitoring at all.
What worked: Compute savings on search trees
On SWE-bench Verified (37 Django instances, Moatless-tools SearchTree agent, Gemini 2.5 Flash):
| Condition | Compute (nodes) | Reduction |
|---|---|---|
| A. Baseline (no monitoring) | 945 | — |
| B. + Lyapunov monitor only | 620 | 34.4% |
| D. Full-stack (Lyapunov+RG+VSA) | 580 | 38.6% |
The monitor eliminated all max-budget burnout events (7 tasks hitting the 50-node ceiling → 0) and reduced wall time by 30%.
What didn't work: Improving resolve rates
This is the honest part that most open-source projects would hide.
We ran 3 independent trials per condition (333 total runs) to measure nondeterminism:
| Condition | Mean ± σ |
|---|---|
| A. Baseline | 44.1% ± 4.1% |
| D. Full-stack | 40.5% ± 2.7% |
| E. Naive Cap | 45.9% ± 5.4% |
A naive budget cap achieves comparable resolve rates. The cross-condition variance (2.9%) is smaller than the within-condition nondeterminism (4.1%). state-harness doesn't make agents smarter — it makes failures diagnosable.
Bonus finding: The nondeterminism floor
Both τ³-bench and SWE-bench converged on a ~4–5% intrinsic nondeterminism floor for Gemini 2.5 Flash on code tasks. This means any single-run benchmark comparison reporting performance deltas under 8% is statistically unreliable. If you see a paper claiming "our agent is 6% better," ask them how many trials they ran.
The Three Mechanisms (and an Honest Ablation)
state-harness has three components, all written in Rust (via PyO3):
- Lyapunov Monitor (~1μs/step): The growth-ratio energy function described above.
- RG Decimator (~100μs/compress): TF-IDF-based history compression inspired by Renormalization Group theory.
- Holographic Engine (~10μs/check): VSA-based semantic drift detection using 10,000-dimensional bipolar vectors.
The honest ablation result: Lyapunov alone delivers ~90% of the total benefit (34.4% out of 38.6%). RG and VSA add incremental value. If you want maximum simplicity, just use the GrowthRatioGuard with default settings and ignore the rest.
Who Should (and Shouldn't) Use This
| If you're... | Use state-harness? |
|---|---|
| Building a chatbot or RAG pipeline | ❌ No. These don't spiral. |
| Running a simple ReAct agent (<10 turns) | ❌ No. max_iterations is enough. |
| Running coding/DevOps agents with search trees | ✅ Yes. Branch explosion is real. |
| Running 1000+ agent tasks/day in production | ✅ Yes. Edge-classified failure patterns at zero cost. |
| Benchmarking agents and publishing results | ✅ Yes. The nondeterminism floor matters. |
Try It
- GitHub: github.com/vishal-dehurdle/state-harness
-
PyPI:
pip install state-harness - Research Paper: vishalvermalabs.com/papers/empirical-lyapunov-stability-agent-failure
Built as a research project exploring whether control theory can provide useful runtime guarantees for stochastic software. If you're running agents at scale and want zero-cost failure diagnostics — or if you're just curious about applying physics to AI systems — I'd love your feedback.