What does it take to build an agent that collects its own data, learns from its actions, and gets better through self-play and deliberation? A survey of ~700 papers across 19 research axes.
A fully self-improving agent requires five interlocking capabilities. Each has deep precedent in the literature.
WORLD MODEL VALUE FUNCTION POLICY
predicts what estimates long-term makes decisions
happens next worth of states (explicit or search)
\ | /
\ | /
v v v
+-----------------------------------------+
| DATA COLLECTION LOOP |
| self-play, exploration, curiosity |
+-------------------+---------------------+
|
v
+-----------------------------------------+
| DELIBERATION / PLANNING |
| MCTS, tree search, chain-of-thought |
+-----------------------------------------+
The full survey covers 19 overlapping fields. Each solves a different piece of the self-improvement puzzle.
| System | Self-Play | World Model | Value Learning | Deliberation | Curiosity |
|---|---|---|---|---|---|
| AlphaZero | ++ | Rules | MCTS + value net | MCTS | - |
| MuZero | ++ | Learned | MCTS + value net | MCTS | - |
| DreamerV3 | - | Learned (RSSM) | Actor-critic | Imagination | - |
| Voyager | - | LLM (implicit) | Skill success | Curriculum | Auto |
| DeepSeek-R1 | - | - | RL reward | Reasoning | - |
| Darwin Godel Machine | - | - | Empirical | Evolution | - |
No existing system integrates all five components. MuZero comes closest but lacks curiosity and operates only in games. The gap between game-playing and open-world agents is the central open problem.
Meta-learning research reveals six layers of self-improvement, each building on the last:
Layer 6 METACOGNITIVE Agent evaluates its own learning [open problem] Layer 5 COMPOSITIONAL Model merging, skill libraries [AlphaEvolve, STOP] Layer 4 ARCHITECTURAL NAS, AutoML [DARTS, EfficientNet] Layer 3 IN-CONTEXT ICL, induction heads, task vectors [transformers] Layer 2 ALGORITHM Learned optimizers, RL^2 [VeLO, LPG] Layer 1 INITIALIZATION MAML, metric learning [few-shot adaptation]
The most powerful emerging systems stack multiple layers simultaneously. AlphaEvolve (DeepMind, 2025) operates at layers 2, 4, and 5 -- and achieved genuine recursive self-improvement by improving the training of its own underlying LLMs.
Distilled from ~700 papers across the full survey:
If you read 15 papers from this entire survey, read these: