← back

Self-Improving Agents

What does it take to build an agent that collects its own data, learns from its actions, and gets better through self-play and deliberation? A survey of ~700 papers across 19 research axes.

The core loop
Act in the world. Observe outcomes. Model what happened. Evaluate whether it was good. Improve the policy. Repeat. Every self-improving system implements some variant of this.

Five Components

A fully self-improving agent requires five interlocking capabilities. Each has deep precedent in the literature.

  WORLD MODEL          VALUE FUNCTION         POLICY
  predicts what        estimates long-term     makes decisions
  happens next         worth of states         (explicit or search)
       \                    |                    /
        \                   |                   /
         v                  v                  v
      +-----------------------------------------+
      |        DATA COLLECTION LOOP             |
      |  self-play, exploration, curiosity      |
      +-------------------+---------------------+
                          |
                          v
      +-----------------------------------------+
      |       DELIBERATION / PLANNING           |
      |  MCTS, tree search, chain-of-thought    |
      +-----------------------------------------+

19 Research Axes

The full survey covers 19 overlapping fields. Each solves a different piece of the self-improvement puzzle.

01 Self-Play
AlphaZero Lineage
Self-play creates an unbounded, self-generating curriculum. Your previous self is always an appropriately challenging opponent. TD-Gammon through AlphaZero to SPIRAL for LLMs.
02 World Models
Imagination & Simulation
Plan by simulating outcomes instead of trial-and-error. Dyna, Dreamer series, MuZero. New: Sora, Genie, Cosmos as foundation world models.
03 LLM Self-Improvement
Generate, Filter, Retrain
STaR, ReST, self-rewarding LMs, DeepSeek-R1. Works for surfacing latent capabilities but can't bootstrap genuinely new abilities from nothing.
04 Curiosity
Exploration & Open-Endedness
ICM, RND, Go-Explore, POET, OMNI-EPIC. Intrinsic motivation prevents the agent from getting stuck in known territory. QD archives as stepping-stone libraries.
05 Agent Architectures
ReAct, Reflexion, SAGE
Autonomous LLM agents that reason, act, and self-correct. Tool use, memory, multi-step planning. Agent-R1 combines RL with agent reasoning.
06 Deliberation
Inference-Time Compute
MCTS, chain-of-thought, Tree of Thoughts, o1/R1. More thinking time on harder problems. The System 1/System 2 paradigm.
07 Neuroscience
Brain-Inspired Learning
Complementary Learning Systems, predictive coding, dopamine as TD error. The brain solves every problem self-improving agents face.
08 Reward Modeling
Value Learning & RLHF
RLHF, DPO, process reward models. How agents learn what's good. Every proxy reward can be hacked -- but scaling makes it manageable.
09 Data Flywheels
Synthetic Data & Self-Collection
Phi models, Orca, AgentInstruct. Deployment generates data; data improves the model. But recursive self-training causes model collapse (Shumailov, 2024).
10 Meta-Learning
Learning to Learn
MAML, in-context learning as meta-learning, learned optimizers (VeLO), AlphaEvolve. Six layers from learning initializations to metacognitive self-improvement.
11 Evolution
Neuroevolution & QD
NEAT, MAP-Elites, novelty search, POET. Darwin Godel Machine rewrites its own code. AlphaEvolve improved the LLMs underlying itself.
12 Hierarchical RL
Temporal Abstraction
Options framework, HER, DIAYN, SayCan, Director. Decompose long-horizon tasks. LLMs for semantic decomposition, RL for execution.
13 Continual Learning
Fighting Forgetting
EWC, progressive networks, replay. CLS theory from neuroscience. Strong pre-trained representations may solve most of the stability problem.
14 Causal Reasoning
Beyond Correlation
Pearl's SCMs, the causal hierarchy theorem, NOTEARS. Robustness requires causal models (Richens & Everitt, 2024). LLMs don't truly reason causally yet.
15 Program Synthesis
Code as Knowledge
DreamCoder, ARC benchmark, FunSearch, Voyager's skill library. Programs are compact, verifiable, composable. Library learning is cumulative self-improvement.
16 Multi-Agent
Societies of Models
QMIX, emergent communication, Theory of Mind, debate. Constitutional AI as multi-agent self-play. Next frontier: structured model societies, not bigger individuals.
17 Offline RL
Learning from Fixed Data
Decision Transformer, CQL, Diffuser, Algorithm Distillation. RL as sequence modeling. Offline pre-train, online fine-tune -- the LLM training paradigm for decisions.
18 Sim-to-Real
Embodied Learning
Domain randomization, Isaac Gym, RT-2, Diffusion Policy, Helix. Reality is just another variation when you randomize enough. VLAs transfer web knowledge to robots.
19 Safety
Alignment Under Self-Improvement
Mesa-optimizers, sleeper agents, alignment faking, reward hacking. A self-improving agent is the hardest alignment target. The capability problem and safety problem are entangled.

What Works Today

System Self-Play World Model Value Learning Deliberation Curiosity
AlphaZero++RulesMCTS + value netMCTS-
MuZero++LearnedMCTS + value netMCTS-
DreamerV3-Learned (RSSM)Actor-criticImagination-
Voyager-LLM (implicit)Skill successCurriculumAuto
DeepSeek-R1--RL rewardReasoning-
Darwin Godel Machine--EmpiricalEvolution-

No existing system integrates all five components. MuZero comes closest but lacks curiosity and operates only in games. The gap between game-playing and open-world agents is the central open problem.

Hard Limits

Things that don't work (yet)

The Self-Improving Agent Stack

Meta-learning research reveals six layers of self-improvement, each building on the last:

Layer 6  METACOGNITIVE      Agent evaluates its own learning         [open problem]
Layer 5  COMPOSITIONAL      Model merging, skill libraries           [AlphaEvolve, STOP]
Layer 4  ARCHITECTURAL      NAS, AutoML                              [DARTS, EfficientNet]
Layer 3  IN-CONTEXT         ICL, induction heads, task vectors       [transformers]
Layer 2  ALGORITHM          Learned optimizers, RL^2                  [VeLO, LPG]
Layer 1  INITIALIZATION     MAML, metric learning                    [few-shot adaptation]

The most powerful emerging systems stack multiple layers simultaneously. AlphaEvolve (DeepMind, 2025) operates at layers 2, 4, and 5 -- and achieved genuine recursive self-improvement by improving the training of its own underlying LLMs.

Design Principles

Distilled from ~700 papers across the full survey:

  1. Dual systems -- Fast episodic memory + slow parametric learning. Neither alone suffices. (CLS theory)
  2. Value-equivalent models -- World models should predict what matters for decisions, not reconstruct observations. (MuZero)
  3. Self-play as curriculum -- Your own capability level creates appropriately challenging tasks. (AlphaZero)
  4. Curiosity for exploration -- Intrinsic motivation prevents exploitation of known territory. (ICM/RND)
  5. Process supervision -- Reward intermediate steps, not just outcomes. (PRM)
  6. Expert iteration -- Slow deliberation generates targets for fast policy. (ExIt)
  7. Always mix real data -- Never train purely on self-generated outputs. (Anti-collapse)
  8. Skill accumulation -- Store reusable procedures as programs, not weights. (Voyager/DreamCoder)
  9. Quality-diversity over single optima -- Maintain diverse repertoires. Stepping stones solve problems. (MAP-Elites)
  10. Robustness requires causality -- Agents that learn only correlations fail under distribution shift. (Richens & Everitt)

Essential Reading

If you read 15 papers from this entire survey, read these: