Two Brains for a Self-Evolving Agent

Build a support agent today and you’ll meet the same ghost everyone meets. On Monday a customer asks how to return an order; you walk the agent through it — look up the phone number, match the order, call the refund API — and it handles the case beautifully. On Tuesday an almost identical email arrives, and the agent asks for the order number all over again, as if Monday never happened. Today’s agents are one-off problem solvers: they solve, and they forget.

The reflex is to hand it more memory — stuff yesterday’s chat logs into the context window, or bolt on a RAG database. But that’s the wrong kind of memory. Retrieval is good at facts — what year the company was founded — and useless at procedure — which workflow to follow when a refund comes in. Pour ten thousand words of yesterday’s trial-and-error into the prompt and the model drowns; the one lesson worth keeping is buried in noise. What an agent needs isn’t a thicker diary. It’s the ability to distill — to turn a chaotic day into a reusable skill.

A doer and a note-taker

SkillOS, a new recipe from Google Cloud AI Research, UIUC, and MIT, makes one clean move: it splits the agent in two. A frozen executor does the work — it retrieves relevant skills from a library and acts on them, and is never trained. A separate, trainable skill curator watches each finished task and decides what to write down: it can insert a new skill, update an existing one, or delete one that’s doing harm. The skills themselves are plain Markdown — a name, a line on when to use this, a workflow, and, tellingly, a line on when not to. Doing and remembering are two different jobs, and the paper’s wager is that you should stop asking one mind to do both.

Two brains, and where the cut falls

This is the part I keep turning over. Making the agent’s acting self and its remembering self into separate subsystems is the real idea here, and it’s hard not to reach for the brain as a metaphor — not the tired left-versus-right story, which is mostly myth, but something closer to how we actually consolidate. What the curator does after each task is roughly what sleep does after a day: it replays the trajectory and settles the useful part into durable skill, the way the hippocampus hands the day’s episodes to the cortex overnight. The executor lives in the moment, fast and reactive; the curator is slow, retrospective, deliberate — a System 2 to the executor’s System 1.

The bold, faintly unnatural part is that SkillOS freezes the doer and trains only the note-taker. In a real brain the two are tangled, co-evolving. Splitting them cleanly and freezing one half is an engineering compromise — but it’s exactly what makes the memory portable: the same trained curator improves a stronger executor it has never seen, lifting a Gemini-2.5-Pro agent from 66.4% to 80.2% on one benchmark.

What it does well

Three things stand out. First, the memory is a white box. Most systems compress experience into vectors no human can read; when the agent learns the wrong lesson, you can’t even find it. Here the lessons are Markdown — when the agent misbehaves you open its library, read what it wrote like a colleague’s code, and fix a crooked line by hand. In production that legibility is worth a great deal.

Second, it learns to throw things away. Early in training the curator is a hoarder, inserting compulsively into an empty library. Then it discovers that clutter misleads the next task and costs it reward — so inserts fall off, edits and merges take over, and deletions slowly climb. It learns to make the notebook thinner as it reads. Anyone who’s lived in a large codebase knows the truth here: adding code is easy; daring to delete is the mark of someone who actually understands the system.

Third, the skills climb toward meta-skills. Early notes are platitudes (“pay attention to your surroundings”); later ones are transferable strategy. Told to find a CD “under the lamp,” the naive agent opens drawers like a headless fly and times out; the trained one calls up a skill that never mentions CDs — when something is under an object you can’t see, find the light source, walk to it, search nearby. That isn’t a memorized path. It’s a small mental model that ports to rooms it has never entered.

And the punchline: the curator here is an 8-billion-parameter open model, and it beats using Gemini-2.5-Pro as the curator — 61.2% against 50.7% on ALFWorld. Gemini improvises from vast general knowledge; the small model has been drilled, through reinforcement, on what this system actually needs. On a narrow job, a trained specialist outworks a brilliant generalist.

What’s still crude

The paper is honest about where it’s thin, and the gaps are the interesting part. The first is poisoned memory. Because the system trusts its own library, a lesson learned by accident — “just deny every refund,” inferred from a few customers who happened to give up — can ossify into a skill and quietly steer every future agent into the same ditch. The external quality judge is there to catch this, but in a messy business with no answer key, a wrong lesson that compounds is a slow-burning fuse.

The second is deeper: the agent finds its notes by keyword. After a year and ten thousand skills, a few literal keywords will not reliably surface the one file that solves a deep, unfamiliar problem. A real memory system will have to make retrieval itself an action — search, look, realize it’s wrong, search again — and its skills will likely look less like prose than like composable, runnable code. That’s the horizon the paper points at without reaching.

The takeaway

Step back and this is what progress in AI actually looks like now — not a single leap but a stack of concrete, load-bearing steps, arriving faster than anyone can quite track. Memory is one of those steps, and it matters more than its modest framing suggests: an agent that learns to keep the right notes and discard the wrong ones is inching toward one that improves itself — recursive self-improvement, one of the few roads that could bend the curve sharply upward. SkillOS doesn’t get there. It freezes half the brain, files its memories by keyword, and can still be poisoned by its own success. But it teaches a machine to do something we’ve always assumed took a mind: to look back on a day’s work, keep what mattered, and let the rest go. The blank spaces on the map are the ones worth watching.

Sources

Siru Ouyang, Jun Yan, Chen-Yu Lee, et al. SkillOS: Learning Skill Curation for Self-Evolving Agents. arXiv:2605.06614, 2026.
The refund-agent framing and several turns of phrase I owe to a video walkthrough of the paper.
Numbers and the “lamp / CD” example are from the paper (§4 and the case study); any misreading is mine.