I Let an AI Agent Supervisor Run Unattended for 19 Days. Here's What the Telemetry Says.

rust dev.to

I shipped 13 releases of my AI agent supervisor in 14 days while the supervisor was running it.

For 19 days — March 22 to April 10 — Batty managed our own development without a human in the dispatch loop. An operator was on Discord for escalations, but the planning, the test gating, the merges, the worktree management, and the self-healing all ran themselves.

This isn't a "fully autonomous" pitch. A human was on Discord. Things broke. Stalls happened. The interesting question isn't "can AI run itself?" — it's "what kind of telemetry do you need before you trust an agent fleet to run itself for 19 days, and what does that telemetry say at the end?"

We have those numbers. Here they are.

What Batty is, in one paragraph

Batty is supervised agent execution for software teams. It runs agents (Claude Code, Codex, Aider) inside tmux panes, tracks work on a markdown kanban board, gates merges behind a real test run, and surfaces escalations to a human via Discord. The supervisor itself is a Rust daemon. It's open source: github.com/battysh/batty.

19 days, by the numbers

This window is 2026-03-22 → 2026-04-10, pulled from ~/batty/.batty/telemetry.db and cross-checked against git log.

Throughput Value
Tasks completed end-to-end 102
Per-seat agent completions 150
Tasks auto-merged 20
Task assignments dispatched 411
Peak daily completions (Apr 6) 55
Commits landed in window 456

55 tasks in a single working day is the throughput point. 102 end-to-end completions with a human only handling escalations is the autonomy point. 13 releases in 14 days is the meta point: the supervisor was shipping itself, faster than I would have shipped it manually.

But throughput isn't the story I want to tell. Self-healing is.

Stability Value
Verification evidence collections 1,304
Verification phase transitions 790
Auto-doctor self-healing actions 258
Task escalations handled 195
Merge confidence scores written 250
Worktree reconciliations 140
State reconciliations 1,052
Daemon heartbeats persisted 989
Agent pane respawns 167
Disk-hygiene cleanups 146

258 self-healing actions is what unattended actually means in practice. The daemon caught and fixed its own stalls 258 times across 19 days without paging me. 195 escalations did get paged — about ten a day — well under the rate where a single operator gives up. 1,304 verification evidence bundles were collected before tasks were allowed near a merge. Every one of those numbers is a moment where the supervisor either kept itself alive or refused to ship something it couldn't justify.

How we keep ourselves honest

A telemetry table is fine. The harder claim is "and we catch the regressions before they ship." That's the part I want to talk about.

v0.11.0 ships a new test surface called the scenario framework. It runs the real TeamDaemon against in-process fake shims (FakeShim + ShimBehavior) on per-test tempdirs. Zero subprocess spawn. Zero tmux. Fully deterministic. 58 scenario tests run on every PR in about 60 seconds.

Twenty-two of those scenarios are prescribed:

  • 1 happy-path scenario
  • 7 regression scenarios — one per recent release bug
  • 14 cross-feature scenarios (worktree corruption, merge conflicts, scope fence violations, ack loops, context exhaustion, silent death, multi-engineer races, disk pressure, stale merge locks, …)

On top of that, a proptest-state-machine fuzz harness runs three targets — fuzz_workflow_happy, fuzz_workflow_with_faults, fuzz_restart_resilience — against ten cross-subsystem invariants on every randomized case.

And cargo test --lib went from 3,369 at v0.10.10 to 3,410 at v0.11.0 (+41 library tests on top of the new 58-test scenario surface). Total: 3,468 tests gated against every PR. Zero warnings on release builds, locked in since 0.10.9.

The test count itself isn't impressive — plenty of projects have more. The interesting part is what each scenario locks in. Every regression scenario started life as a real bug from production telemetry. Each one is now a deterministic replay you can run in 60ms.

Three stories behind three numbers

258 auto-doctor actions → #634 shim restart cooldown

Most of those 258 self-healing actions are boring: a shim looks unresponsive, the daemon respawns it, work continues.

A handful aren't boring. Bug #634: when handle_supervisory_stall fired in src/team/daemon/health/poll_shim.rs, it could re-trigger a second respawn if a stall check happened right after the previous restart. The result was a respawn loop that degraded into repeated orchestrator disconnected / Broken pipe control-plane disconnects. The fix is a stall-restart::{name} cooldown that holds the respawned member as Idle until its freshly-started shim emits its first StateChanged event, with a regression test pinning the cooldown.

You only catch a bug like that if your telemetry tells you the supervisor is fixing the same thing twice.

195 escalations → #612 stale escalation storms

Escalations are how the supervisor pages a human when it can't make forward progress on its own. 195 over 19 days is roughly ten a day — well under the rate where a solo operator burns out.

But early in the window, escalations were much noisier. Bug #612: the inbox digest kept top-billing escalations whose underlying tasks had already moved to done or archived. Stale escalation storms occupied actionable slots that should have been pointing me at real problems. src/team/inbox.rs defines two new helpers — extract_task_ids_from_body and demote_stale_escalations; src/team/messaging.rs wires them into the digest assembly. They demote stale Escalation/Blocker entries whose referenced tasks are all done or archived back to Status.

Telemetry surfaced the rate. Watching the rate told me the rate was wrong.

20 auto-merges → #592 auto-merge gate

Twenty tasks merged to main without me clicking anything. Each one cleared test gating, evidence collection, and merge confidence scoring before landing.

The gate is the interesting part. Bug #592 implements merge_request_skip_reason plus an AutoMergeSkipReason enum with WrongStatus / MissingPacket / NoBranch categories and a full unit-test catalog. Every refusal to auto-merge is categorized and logged; every acceptance writes a confidence score I can audit later. 250 merge confidence scores were written across the window — about five per working day — because every auto-merge candidate is a deliberate decision, not a heuristic shrug.

The "20 auto-merges" stat is the outcome. The 250 confidence scores are the process. I trust the outcome because I can audit the process.

What this run didn't fix

Three honest gaps.

Bot-token rotation can't be solved from code (#598 archived). Discord and Telegram tokens have to be rotated through a provider console. The supervisor will never roll its own credentials. This stays in operator runbooks; no auto-doctor can solve it.

Context exhaustion is still a real shape of failure. The scenario framework has explicit context-exhaustion scenarios because we kept hitting the wall on long-running engineers. We have recovery, but recovery isn't prevention. Better task decomposition is the answer; the supervisor can't decompose its way out of bad framing.

The 10–15 minute productive window. Throughout this 19-day run, I was occasionally restarting the daemon every 10–15 minutes to clear an event-loop freeze. We hadn't root-caused it yet. We did the morning of release day — and the fix is in v0.11.2, shipped the same afternoon as v0.11.0. Every stat above is the previous shape of stability; the current shape is meaningfully better. That's a separate post.

I'm including this as the answer to the obvious question: "how unattended is unattended, really?" Roughly: unattended in the sense that I didn't have to plan tasks, dispatch them, gate them, or merge them. Attended in the sense that I came back and restarted the daemon every so often when the productive window timer ran down. The 19 days are real. So is the asterisk.

Try it

cargo install batty-cli
Enter fullscreen mode Exit fullscreen mode

The 0.11.x Easter release train is on crates.io as of 2026-04-11. v0.11.0 ships the scenario framework. v0.11.1 patches the auto-merge dropped-task bug. v0.11.2 closes the write-timeout pattern that kept producing the 10–15 minute windows.

If your test suite is weak, Batty makes that worse, not better — the merge gate is only as strong as what it gates against. If your tests are real, Batty turns them into a discipline that keeps an agent fleet honest.

It's on GitHub: github.com/battysh/batty. Open source. Reproduce the numbers; the queries are in CHANGELOG.md.

Source: dev.to

arrow_back Back to Tutorials