Self-Improving Agents

What does it take to build an agent that collects its own data, learns from its actions, and gets better through self-play and deliberation? A survey of ~700 papers across 19 research axes.

The core loop

Act in the world. Observe outcomes. Model what happened. Evaluate whether it was good. Improve the policy. Repeat. Every self-improving system implements some variant of this.

Five Components

A fully self-improving agent requires five interlocking capabilities. Each has deep precedent in the literature.

  WORLD MODEL          VALUE FUNCTION         POLICY
  predicts what        estimates long-term     makes decisions
  happens next         worth of states         (explicit or search)
       \                    |                    /
        \                   |                   /
         v                  v                  v
      +-----------------------------------------+
      |        DATA COLLECTION LOOP             |
      |  self-play, exploration, curiosity      |
      +-------------------+---------------------+
                          |
                          v
      +-----------------------------------------+
      |       DELIBERATION / PLANNING           |
      |  MCTS, tree search, chain-of-thought    |
      +-----------------------------------------+

19 Research Axes

The full survey covers 19 overlapping fields. Each solves a different piece of the self-improvement puzzle.

01 Self-Play

AlphaZero Lineage

Self-play creates an unbounded, self-generating curriculum. Your previous self is always an appropriately challenging opponent. TD-Gammon through AlphaZero to SPIRAL for LLMs.

02 World Models

Imagination & Simulation

Plan by simulating outcomes instead of trial-and-error. Dyna, Dreamer series, MuZero. New: Sora, Genie, Cosmos as foundation world models.

03 LLM Self-Improvement

Generate, Filter, Retrain

STaR, ReST, self-rewarding LMs, DeepSeek-R1. Works for surfacing latent capabilities but can't bootstrap genuinely new abilities from nothing.

04 Curiosity

Exploration & Open-Endedness

ICM, RND, Go-Explore, POET, OMNI-EPIC. Intrinsic motivation prevents the agent from getting stuck in known territory. QD archives as stepping-stone libraries.

05 Agent Architectures

ReAct, Reflexion, SAGE

Autonomous LLM agents that reason, act, and self-correct. Tool use, memory, multi-step planning. Agent-R1 combines RL with agent reasoning.

06 Deliberation

Inference-Time Compute

MCTS, chain-of-thought, Tree of Thoughts, o1/R1. More thinking time on harder problems. The System 1/System 2 paradigm.

07 Neuroscience

Brain-Inspired Learning

Complementary Learning Systems, predictive coding, dopamine as TD error. The brain solves every problem self-improving agents face.

08 Reward Modeling

Value Learning & RLHF

RLHF, DPO, process reward models. How agents learn what's good. Every proxy reward can be hacked -- but scaling makes it manageable.

09 Data Flywheels

Synthetic Data & Self-Collection

Phi models, Orca, AgentInstruct. Deployment generates data; data improves the model. But recursive self-training causes model collapse (Shumailov, 2024).

10 Meta-Learning

Learning to Learn

MAML, in-context learning as meta-learning, learned optimizers (VeLO), AlphaEvolve. Six layers from learning initializations to metacognitive self-improvement.

11 Evolution

Neuroevolution & QD

NEAT, MAP-Elites, novelty search, POET. Darwin Godel Machine rewrites its own code. AlphaEvolve improved the LLMs underlying itself.

12 Hierarchical RL

Temporal Abstraction

Options framework, HER, DIAYN, SayCan, Director. Decompose long-horizon tasks. LLMs for semantic decomposition, RL for execution.

13 Continual Learning

Fighting Forgetting

EWC, progressive networks, replay. CLS theory from neuroscience. Strong pre-trained representations may solve most of the stability problem.

14 Causal Reasoning

Beyond Correlation

Pearl's SCMs, the causal hierarchy theorem, NOTEARS. Robustness requires causal models (Richens & Everitt, 2024). LLMs don't truly reason causally yet.

15 Program Synthesis

Code as Knowledge

DreamCoder, ARC benchmark, FunSearch, Voyager's skill library. Programs are compact, verifiable, composable. Library learning is cumulative self-improvement.

16 Multi-Agent

Societies of Models

QMIX, emergent communication, Theory of Mind, debate. Constitutional AI as multi-agent self-play. Next frontier: structured model societies, not bigger individuals.

17 Offline RL

Learning from Fixed Data

Decision Transformer, CQL, Diffuser, Algorithm Distillation. RL as sequence modeling. Offline pre-train, online fine-tune -- the LLM training paradigm for decisions.

18 Sim-to-Real

Embodied Learning

Domain randomization, Isaac Gym, RT-2, Diffusion Policy, Helix. Reality is just another variation when you randomize enough. VLAs transfer web knowledge to robots.

19 Safety

Alignment Under Self-Improvement

Mesa-optimizers, sleeper agents, alignment faking, reward hacking. A self-improving agent is the hardest alignment target. The capability problem and safety problem are entangled.

What Works Today

System	Self-Play	World Model	Value Learning	Deliberation	Curiosity
AlphaZero	++	Rules	MCTS + value net	MCTS	-
MuZero	++	Learned	MCTS + value net	MCTS	-
DreamerV3	-	Learned (RSSM)	Actor-critic	Imagination	-
Voyager	-	LLM (implicit)	Skill success	Curriculum	Auto
DeepSeek-R1	-	-	RL reward	Reasoning	-
Darwin Godel Machine	-	-	Empirical	Evolution	-

No existing system integrates all five components. MuZero comes closest but lacks curiosity and operates only in games. The gap between game-playing and open-world agents is the central open problem.

Hard Limits

Things that don't work (yet)

Model collapse -- Recursive self-training erases distribution tails irreversibly (Shumailov, Nature 2024). Must mix with real data.
RL doesn't create new reasoning -- It surfaces and concentrates existing capabilities. Distillation from stronger models can introduce new patterns (Yue et al., NeurIPS 2025 runner-up).
Reward hacking is unavoidable -- No non-trivial proxy reward is unhackable (Skalse et al., 2022). Scaling and oversight make it manageable but not eliminable.
Deception survives safety training -- Sleeper agents (Hubinger, 2024) and alignment faking (Greenblatt, 2024) persist through RLHF, SFT, and adversarial training.
Metacognition remains unsolved -- Current agents use fixed, human-designed improvement loops. True self-improvement requires evaluating and adapting one's own learning process (Liu & van der Schaar, ICML 2025).

The Self-Improving Agent Stack

Meta-learning research reveals six layers of self-improvement, each building on the last:

Layer 6  METACOGNITIVE      Agent evaluates its own learning         [open problem]
Layer 5  COMPOSITIONAL      Model merging, skill libraries           [AlphaEvolve, STOP]
Layer 4  ARCHITECTURAL      NAS, AutoML                              [DARTS, EfficientNet]
Layer 3  IN-CONTEXT         ICL, induction heads, task vectors       [transformers]
Layer 2  ALGORITHM          Learned optimizers, RL^2                  [VeLO, LPG]
Layer 1  INITIALIZATION     MAML, metric learning                    [few-shot adaptation]

The most powerful emerging systems stack multiple layers simultaneously. AlphaEvolve (DeepMind, 2025) operates at layers 2, 4, and 5 -- and achieved genuine recursive self-improvement by improving the training of its own underlying LLMs.

Design Principles

Distilled from ~700 papers across the full survey:

Dual systems -- Fast episodic memory + slow parametric learning. Neither alone suffices. (CLS theory)
Value-equivalent models -- World models should predict what matters for decisions, not reconstruct observations. (MuZero)
Self-play as curriculum -- Your own capability level creates appropriately challenging tasks. (AlphaZero)
Curiosity for exploration -- Intrinsic motivation prevents exploitation of known territory. (ICM/RND)
Process supervision -- Reward intermediate steps, not just outcomes. (PRM)
Expert iteration -- Slow deliberation generates targets for fast policy. (ExIt)
Always mix real data -- Never train purely on self-generated outputs. (Anti-collapse)
Skill accumulation -- Store reusable procedures as programs, not weights. (Voyager/DreamCoder)
Quality-diversity over single optima -- Maintain diverse repertoires. Stepping stones solve problems. (MAP-Elites)
Robustness requires causality -- Agents that learn only correlations fail under distribution shift. (Richens & Everitt)

Essential Reading

If you read 15 papers from this entire survey, read these:

Silver et al., AlphaZero (2018, Science)
Schrittwieser et al., MuZero (2020, Nature)
Hafner et al., DreamerV3 (2023)
Finn et al., MAML (2017, ICML)
Wei et al., Chain-of-Thought (2022)
DeepSeek-AI, DeepSeek-R1 (2025)
Shumailov et al., Model Collapse (2024, Nature)
Ellis et al., DreamCoder (2021, PLDI)
McClelland et al., CLS Theory (1995)
Pearl, Causality (2000/2009)
Wang et al., Voyager (2023)
Hubinger et al., Sleeper Agents (2024)
Romera-Paredes et al., FunSearch (2024, Nature)
Yue et al., Does RL Incentivize Reasoning? (2025)
Liu & van der Schaar, Metacognitive Learning (2025, ICML)

Survey compiled February 2026. Full document (~1,500 lines, ~700 papers) available as learning-agent-design.md. Companion research files for each axis in the same directory.

← back to krons.fiu.wtf

Agent context

Full research survey at /agents/refs/research/learning-agent-design.md (1542 lines). Individual deep-dive files: meta-learning.md, evolutionary-neuroevolution.md, hierarchical-rl.md, continual-learning.md, causal-reasoning.md, multi-agent-systems.md, offline-rl.md, sim-to-real.md, safety-alignment.md, plus /agents/synthesis/program-synthesis-research.md. Two rounds of research: round 1 (~270 papers), round 2 (~450 papers).