I let my AI agents rewrite their own prompts. The hard part was stopping them from getting worse.

Most "self-evolving agent" demos die the moment you think about shipping them. Not because the idea is bad, but because an agent that can rewrite its own prompt can also quietly rewrite itself into something worse. It drifts. A critic starts rewarding the wrong thing. A regression slips in and nobody notices for a week because the output still looks fine at a glance.

I spent the better part of three months building a TypeScript framework around that exact failure mode, and I want to walk through the part nobody demos: not the clever loop, but the gate that keeps the loop honest.

What "self-evolving" actually means here

The idea is simple. A normal agent has a static prompt. You write it once, and it never gets better on its own. You are the optimizer, forever, by hand.

Darwin flips that. The agent runs, something measures how good the run was, and over time the system learns where the prompt is weak and proposes a better version. Then, and this is the important bit, it does not just trust the new version. It earns its place.

You run an agent
       |
Darwin measures quality (a critic scores the output)
       |
Patterns emerge over time ("weak on technical topics")
       |
A new prompt variant is generated
       |
A/B tested against the current default
       |
The winner becomes the default
       |
Your agent got better. You did nothing.

That last line is the marketing version. The honest version has a lot more machinery under it, because "the winner becomes the default" is where everything can go wrong.

Why most of these stay demos

Here is the failure you do not see in a five-minute video. You wire up a loop, an LLM critiques its own output, it rewrites its own prompt, and for the first ten runs it genuinely looks like it is improving. Then one of these happens:

The critic optimizes for the wrong signal. It starts rewarding longer answers, or more confident ones, and quality quietly drops while the score goes up.

A tool the agent depends on has a bad hour. The outputs get worse for reasons that have nothing to do with the prompt, the system reads that as "the current prompt is bad," and it evolves away from a prompt that was actually fine.

A rewrite erodes a constraint. The old prompt said "never invent a source." The new, higher-scoring variant is more fluent and slightly more willing to make things up. Your score went up. Your safety went down.

Checking after every single run inflates false positives, so the system declares winners that are just noise.

None of these are exotic. They are the default outcome if you build the loop and not the gate.

The gate is the actual product

So the loop is maybe a third of the work. The rest is the set of guards that decide whether a mutation is allowed to survive. Four of them matter most.

Regression rollback to last-known-good. Every promoted prompt has a recorded baseline. If a newly promoted variant underperforms its predecessor past a threshold, it rolls back automatically. Evolution is allowed to try things. It is not allowed to keep things that made the agent worse.

Data-quality guards that pause evolution. If the signal feeding the critic looks broken, a tool timing out, empty responses, a spike of errors, evolution pauses instead of learning from garbage. You do not want your agent drawing conclusions during an outage.

An alignment check on every mutation. Before any rewrite is even eligible, it is checked against the constraints the agent is supposed to hold. A more fluent prompt that quietly drops a safety rule does not get to compete on score, because it never enters the ring.

Statistically honest A/B. Because the tempting thing is to peek after every run, the gate uses always-valid sequential tests (mSPRT and Hoeffding-style bounds) so that continuous checking does not manufacture significance. A variant wins when it actually won, not when you looked at the right moment.

If you have ever watched an agent get subtly worse after a prompt change and had no principled way to catch it, that stack is the whole reason this exists.

Show me the code

Running an agent is one command:

npm install darwin-agents better-sqlite3
export ANTHROPIC_API_KEY=sk-ant-...   # or OPENAI_API_KEY, or use the Claude CLI

npx darwin run writer "Explain the CAP theorem in simple terms"

Turning on evolution is opt-in, per agent. Nothing rewrites itself unless you say so:

npx darwin evolve writer --enable
npx darwin status writer

Defining your own agent is about a dozen lines:

import { defineAgent } from 'darwin-agents';

export const writer = defineAgent({
  name: 'writer',
  role: 'You explain technical topics clearly and simply.',
  // evolution is off until you enable it; the safety gate is always on
  evolution: {
    enabled: false,
    // opt in to the reflective optimizer when you want it
    useGepa: true,
  },
});

State is stored as a single JSON blob per backend (SQLite or Postgres), which turned out to matter a lot for keeping the thing backward-compatible. Adding a new optional field to an agent's evolution state does not break older rows. They just lack the key, and you read defensively. Boring, but it means upgrades do not eat your history.

The part I am most proud of, briefly

The mutation itself can be driven by a GEPA reflective optimizer running online, inside the gate, instead of as an offline batch job you run once a week. The agent reflects on its own recent trajectories, proposes a targeted rewrite, and that rewrite still has to clear every guard above before it ships. Reflection proposes. The gate disposes. That separation is the whole trick.

Where this actually is, honestly

I am not going to dress up the numbers, because the honest version is more interesting than the pitch.

For most of the last three months this sat at single-digit stars. The quiet stretch where you publish version after version into the void and wonder if anyone runs them. Then, in the last two weeks, without a launch, something turned. Twelve stars in a single day after months of zero-to-one. The core package went from roughly six installs a day to around eighteen. A LangGraph adapter I shipped five weeks ago went from a trickle to a few hundred downloads a week.

The absolute numbers are still small. Eight stars is not a movement and I will not pretend otherwise. But there is a real difference between a number that is small and a number that is small and accelerating, and the curve stopped being flat.

The repo being small is not because it is new, by the way. Some of this code has been running our own agent fleet for months. It is small because I only recently decided to share it, and because I am genuinely bad at growth tactics. The code is real, it gets used, issues get answered. That is the whole offer.

From a small studio in Palma de Mallorca.

If you want to poke at it

It is MIT, TypeScript, on npm as darwin-agents, with a darwin-langgraph adapter if you already live in LangGraph. Source and docs are on GitHub under studiomeyer-io.

The one thing I would genuinely like to hear from this community: what do you do to keep a self-improving agent from drifting? The gate above is my answer, but I am still learning this part, and the failure modes are sneakier than they look. If you have been burned by one, I want to know which one.