Part of the Emergence & Intelligence Research Hub ยท Bridge: BFF + ARC ยท BFF: Computational Life

ARC-AGI

The Abstract Reasoning Corpus โ€” distilled from 40+ papers
Last updated: 2026-02-15

What brings you here?
One sentence

ARC-AGI tests whether AI can learn new abstractions from a few examples โ€” the kind of reasoning humans do effortlessly but current AI struggles with, making it the benchmark for measuring progress toward general intelligence.

Understand

ARC (Abstract Reasoning Corpus) is a benchmark created by Franรงois CholletFranรงois CholletCreator of Keras, researcher at Google. Designed ARC to measure intelligence as skill-acquisition efficiency.โ†’ fchollet.com to measure machine intelligence in ways that current AI cannot easily game.

What is an ARC task?

Each task shows 2-5 input-output pairs of colored grids. You must infer the transformation rule and apply it to a new test input.

Why ARC matters

ARC is designed to be:

The key insight

ARC measures skill-acquisition efficiency โ€” how quickly you can learn a new skill from minimal data. This is what Chollet argues intelligence actually is.

3 numbers that matter

85%Average human performance on ARC-AGI public eval[1]
55.5%Best AI score (OpenAI o3-high, Dec 2024) โ€” but at $10K+ per task[2]
$1MARC Prize for 85%+ on private eval with open-source code[1]

Maturity map

working promising experimental research
APPROACHES
working TTTTest-Time TrainingTraining the model on each specific task at inference time. Key technique in top ARC solvers.โ†’ Paper + LLM (test-time compute)
promising Program synthesis + search
experimental Neurosymbolic (LLM + DSL)
research Active inference, object-centric

Terms & Glossary

Hover over highlighted termsTerm TooltipThroughout this page, terms with dotted underlines show definitions when hovered. anywhere on this page for definitions.

Core concepts
ARC โ€” Abstract Reasoning Corpus.
TTT โ€” Test-Time Training. Train on each task.
DSL โ€” Domain-Specific Language for transforms.
Core Knowledge โ€” Human priors (objects, counting).
Approaches
Program synthesis โ€” Generate code that solves tasks.
Neurosymbolic โ€” Combine neural + symbolic reasoning.
MCTS โ€” Monte Carlo Tree Search for DSL programs.
Active inference โ€” Model world to minimize surprise.
Benchmark versions
ARC-AGI-1 โ€” Original 800 tasks. 2019-2024.
ARC-AGI-2 โ€” Harder. 120 eval tasks. 2025.
ARC-AGI-3 โ€” Interactive mini-games. March 2026.
RE-ARC โ€” Synthetic training variants.
Key systems
o3 โ€” OpenAI's reasoning model. 55.5% on ARC.
NVARC โ€” NVIDIA's TTT approach. 1st 2025.
TRM โ€” 7M params, beats large LLMs.
ARChitects โ€” 1st place 2024, 53.5% private.
Training tools
nanochat โ€” Full LLM pipeline for $100. Karpathy.
Muon โ€” Orthogonalized optimizer for matrices.
TRL โ€” HuggingFace RL training library.
OpenRLHF โ€” Distributed RLHF framework.

ARC-AGI-2 (2025)

Released early 2025, ARC-AGI-2 is significantly harder for AI while remaining easy for humans. Pure LLMs score 0%. Even frontier reasoning systems achieve only single-digit percentages.

Key difference

ARC-AGI-1 tasks could often be solved instantly by humans. In ARC-AGI-2, every task requires deliberate thinking โ€” average human completion time is 2.7 minutes. But 100% of tasks have been solved by at least 2 humans in under 2 attempts.

What changed from ARC-AGI-1

AspectARC-AGI-1ARC-AGI-2
Eval set size100 tasks120 tasks
Human avg score~85%~60%
Human avg time< 1 min2.7 min
Pure LLM score~5-15%0%
Best AI (open)53.5%~24%
Best AI (any)55.5%54%

Design changes

Why efficiency matters now

Log-linear scaling is insufficient for ARC-AGI-2. New test-time adaptation algorithms or novel AI architectures are needed. You can't just throw more compute at it.

ARC Prize 2025 Competition

1,455 teams submitted 15,154 entries. More than $125,000 awarded.

TOP SCORE WINNERS (Kaggle)
1st: NVARC โ€” 24% @ $0.20/task (Ivan Sorokin, Jean-Francois Puget)
2nd: ARChitects โ€” 16.5% (masked diffusion LLM)
3rd: MindsAI โ€” 15.4% (TTT pipeline)
PAPER AWARD WINNERS
1st: TRM โ€” 7M params, 45% ARC-1, 8% ARC-2
2nd: SOAR โ€” Self-improving evolutionary synthesis, 52% ARC-1
3rd: CompressARC โ€” MDL-based, 76K params, no pretraining

2025 Winning Solutions Deep Dive

NVARC 1st Place Score

NVIDIA Kaggle Grandmasters' winning approach combines Qwen3 4B with Tiny Recursive Models.

  • Core insight: Move heavy reasoning offline into synthetic data pipeline
  • Synthetic data: 103K puzzles + 3.2M augmented examples
  • Training: All done on single H100; batch size 2 works on 24GB GPUs
  • Inference: Qwen-4B alone achieved 24% โ€” ensemble added marginal gains
  • Cost: Only $0.20 per task (vs $10K+ for o3)
Key strategy

Heavyweight LLM reasoning (CoT, tool use, agents) couldn't fit Kaggle's runtime. Instead: train smaller models offline that run fast during evaluation.

TRM (Tiny Recursive Model) 1st Place Paper
GitHub ยท Paper

Samsung researcher Alexia Jolicoeur-Martineau shows that 7M parameters can beat 671B parameter models.

  • Architecture: Single transformer block that iterates, not stacks
  • Mechanism: Recursively improves predicted answer + latent state
  • Results: 45% ARC-1, 8% ARC-2 โ€” beats DeepSeek-R1 (671B), o3-mini, Gemini 2.5 Pro
  • Also improved: Sudoku-Extreme 87.4%, Maze-Hard 85.3%
Core principle

Recursion substitutes for depth and size. By iteratively reasoning over its own output, TRM simulates a much deeper architecture without the memory cost.

SOAR 2nd Place Paper
GitHub ยท Paper ยท Models

Self-improving evolutionary program synthesis by Julien Pourcel et al. (ICML 2025).

  • Method: LLM generates programs โ†’ test โ†’ refine promising ones โ†’ fine-tune LLM on attempts
  • Key insight: Failed programs are correct programs for different tasks โ€” use them as training data
  • Results: 52% ARC-1 test set โ€” SOTA for open-source LLMs without hand-crafted DSLs
  • Dataset released: 5 million ARC solutions
CompressARC 3rd Place Paper
GitHub ยท Paper

Isaac Liao (CMU) proves lossless compression alone can produce intelligent behavior.

  • Approach: Minimum Description Length (MDL) โ€” find shortest program that outputs puzzle + solution
  • No pretraining: Models randomly initialized, trained only at inference time
  • No dataset: One model trains on one task, outputs one answer
  • Size: Only 76K parameters, ~20 min per puzzle on RTX 4070
  • Results: 20-34% ARC-1, 4% ARC-2
ARChitects 2025 2nd Place Score

Major evolution from their 2024 autoregressive approach to masked diffusion.

  • Architecture: LLaDA-8B masked diffusion LLM with 2D positional encoding
  • Method: Soft-masking turns grid positions into partially noisy states, iteratively refined
  • Key change: Autoregressive failed on tasks requiring global structure understanding
  • Results: 16.5% on ARC-AGI-2
Poetiq SOTA Verified (54%)
GitHub ยท Blog

First to break 50% on ARC-AGI-2 through refinement loops on Gemini 3 Pro.

  • Core method: Generate solution โ†’ receive feedback โ†’ analyze โ†’ refine (iterate)
  • Self-auditing: System monitors progress, decides when to stop
  • Cost: $30/task (vs Gemini 3 Deep Think at $77/task for 45%)
  • Model-agnostic: Works with OpenAI, Anthropic, xAI models too
2025's defining theme

The emergence of the refinement loop โ€” per-task iterative optimization guided by feedback. This is how the 50% barrier was finally broken.

Commercial Model Performance

ModelARC-AGI-2Cost/taskNotes
Poetiq + Gemini 3 Pro54%$30SOTA verified, open source
Gemini 3 Deep Think45%$77Previous best
Opus 4.5 (Thinking, 64k)37.6%$2.20Best single-model commercial
Gemini 3 Pro (baseline)31%$0.81No refinement
NVARC24%$0.20Competition winner

ARC-AGI-3 (2026)

Launching March 25, 2026. The first major format change since ARC was introduced in 2019. ARC-AGI-3 moves from static reasoning to interactive reasoning.

The shift

ARC-1 and ARC-2 test static reasoning on grids. ARC-3 tests interactive reasoning in mini-games โ€” agents must explore, plan, remember, and adapt across multiple steps.

Current AI performance on ARC-3 preview
  • Humans: Solve games within minutes
  • AI agents: Score zero points โ€” fail even basic preview games

What's new in ARC-AGI-3

FORMAT CHANGES
Interactive grid-world mini-games (not static puzzles)
1,000+ levels across 150+ environments
Agents must perceive โ†’ plan โ†’ act across multiple steps
Long-horizon goals requiring memory

Required AI capabilities

Why the format change?

ARC-AGI-1 and ARC-AGI-2 are being "overfit" โ€” not through direct memorization, but because public train and private test sets share enough similarity that models trained on public data perform better than they should. ARC-3 makes this impossible.

Preview results (July 2025)

The ARC-AGI-3 preview released video-game-like environments where agents must achieve long-horizon goals.

Humans
โœ“ Solve games within minutes
โœ“ Quickly infer rules through exploration
โœ“ Adapt strategies based on feedback
Current AI
โœ— Score zero points
โœ— Fail even the developer preview (3 games)
โœ— Cannot maintain coherent multi-step plans

Efficiency comparison

ARC-AGI-3 will enable formal comparison of human and AI action efficiency (learning efficiency) for the first time. Not just "did you solve it?" but "how efficiently did you learn the rules?"

ARC-AGI-3 Preview

Play the preview games and see how you compare to current AI systems.

Timeline

July 2025Preview release (3 games)
March 25, 2026Full ARC-AGI-3 launch
2026ARC Prize 2026 competition

Reality Check

The uncomfortable truths about current approaches.

The o3 asterisk
  • 55.5% at $10K+ per task โ€” not practical[2]
  • Low-compute config: only 25% (worse than many open approaches)
  • Still far from human-level efficiency
  • Does not qualify for ARC Prize (not open-source)

Current leaderboard (2025)

SystemARC-AGI-1ARC-AGI-2Approach
o3-high55.5%โ€”Massive compute ($10K+/task)
NVARC (2025 winner)โ€”~24%TTT + TRM ensemble
ARChitects (2024 winner)53.5%โ€”TTT + augmentation
marc61.9%*โ€”TTT + program synthesis
TRM (7M params)45%8%Recursive reasoning
Humans~85%~85%โ€”

*marc ensemble with program synthesis. Scores as of Feb 2025.

The efficiency gap

Humans solve ARC tasks in seconds with a 20W brain. o3 needs thousands of dollars of compute per task. The gap isn't accuracy โ€” it's efficiency.

What works
โœ“ Test-time training (TTT)
โœ“ Data augmentation
โœ“ Ensemble methods
โœ“ Chain-of-thought prompting
What doesn't (yet)
โœ— Pure LLMs without TTT
โœ— Memorization
โœ— More parameters alone
โœ— Pure symbolic search

Build

Practical approaches that actually work.

Which approach should I try first?
  • TTT + LLM: Best results right now. Fine-tune per task.
  • DSL + search: More interpretable, harder to scale.
  • Hybrid: LLM generates DSL, search refines.
WINNING RECIPE (2024-2025)
1. Generate synthetic variants of each task (augmentation)
2. Fine-tune a small model on those variants (TTT)
3. Sample multiple solutions
4. Verify against training examples
5. Ensemble multiple models
Result: 40-55% on private eval
NVARC approach 2025 Winner
GitHub ยท Paper
  • Qwen3 4B + Tiny Recursive Models ensemble
  • Synthetic data generation pipeline (SDG)
  • 103K synthetic puzzles + 3.2M augmented
  • ~24% on ARC-AGI-2 under contest constraints
ARChitects approach 2024 Winner
  • Mistral-NeMo-Minitron-8B base model
  • Test-time training with augmented data
  • Novel augmentations + stability criterion
  • 53.5% on private eval

Code resources

Competition Winners (Open Source)
NVARC โ€” 1st place ARC Prize 2025. TTT + TRM ensemble. ~24% on ARC-AGI-2.
ARChitects โ€” 1st place ARC Prize 2024. 53.5% private eval. TTT approach.
Omni-ARC โ€” 3rd place 2024. Multi-task learning + TTT. Excellent docs.
Icecuber โ€” 1st place Kaggle 2020. 142 hand-crafted DSL ops. Foundational.
Core DSL & Data Generation
arc-dsl โ€” Domain-specific language. Solutions for all 400 training tasks.
Re-ARC โ€” 1000 synthetic variants per training task. Programmatic generators.
BARC โ€” 400K synthetic ARC tasks. LLM-based program remixing.
arcmentations โ€” Augmentation library. Torchvision-compatible transforms.
2025 Paper Award Winners
TRM โ€” 1st. 7M params beats 671B models. 45% ARC-1, 8% ARC-2.
SOAR โ€” 2nd. Self-improving evolution. 52% ARC-1. 5M solutions released.
CompressARC โ€” 3rd. MDL-based, 76K params, no pretraining needed.
Poetiq โ€” SOTA verified. Refinement loops on Gemini 3. 54% ARC-2.
Research Implementations
marc (TTT) โ€” "Surprising Effectiveness of TTT". 61.9% with ensemble.
GridCoder โ€” Neurally-guided program synthesis. Runner-up paper 2024.
CodeIt โ€” Self-improving LLMs. 15% eval, hindsight replay.
arc-mdl โ€” Earlier MDL approach by Sebastien Ferre.
Benchmarks & Community
ARC-AGI โ€” Official dataset. 800 tasks by Chollet.
ARC-AGI-2 โ€” Harder 2025 variant. Harder than ARC-AGI-1.
ConceptARC โ€” 160 tasks organized by concepts. Tests abstraction.
arc-dsl-llm โ€” LLM-friendly DSL. Cleaner function names.

Small vs Frontier Models

Do you need GPT-4 or can a fine-tuned 4B model do the job? The evidence is clear: for specialized tasks, small fine-tuned models often match or beat frontier models at 1/100th the cost.

The key insight

A 7M parameter TRM beats 671B DeepSeek-R1 on ARC reasoning (45% vs 15.8%). A fine-tuned Qwen3-4B matches GPT-4 on specialized tasks. Architecture and training matter more than raw size.

Evidence: Small Models Beating Giants

Small ModelLarge ModelTaskResult
TRM 7MDeepSeek-R1 671BARC-AGI-145% vs 15.8% (3ร— better)
TRM 7MGPT-4, Claude, R1Hard Sudoku87.4% vs 0%
CompressARC 76KNo pretrainingARC-AGI-120-34% from scratch
Qwen3-4B fine-tunedGPT-OSS-120B8 benchmarksMatches/exceeds 7 of 8
Llama 3.2 1B fine-tunedGPT-4.1E-commerce intent99% vs 99%
Phi-4 14BLlama 3.3 70BMATH benchmarkPhi-4 wins
The 100,000ร— factor

TRM (7M params) outperforms DeepSeek-R1 (671B params) on ARC reasoning. That's a model 100,000ร— smaller achieving 3ร— better performance. The difference? Recursive architecture vs autoregressive generation.

When Small Models Win

Use Small Models (1B-14B)
โœ“ Specialized, narrow tasks
โœ“ Structured outputs (code, math, grids)
โœ“ Latency-critical applications
โœ“ Edge/mobile deployment
โœ“ Cost-sensitive production
โœ“ Privacy-sensitive (local inference)
Use Frontier Models (70B+)
โœ“ General-purpose, open-ended tasks
โœ“ Complex multi-step reasoning
โœ“ Cross-domain knowledge synthesis
โœ“ Low-volume, high-stakes decisions
โœ“ When you lack training data
โœ“ Rapid prototyping before specialization

The Fine-Tuning Multiplier

Fine-tuning impact by model size

Smaller models benefit disproportionately more from fine-tuning than larger ones.

Model SizeBase PerformanceAfter Fine-tuningImprovement
1B-3BLowOften matches 70B baseDramatic (2-5ร—)
4B-8BModerateCan match GPT-4 on taskLarge (1.5-3ร—)
14B-32BGoodFrontier-competitiveModerate (1.2-2ร—)
70B+Very goodIncremental gainsSmall (1.1-1.3ร—)

Cost Comparison

ApproachModelARC-AGI-2 ScoreCost/TaskRatio
Frontier + refinementPoetiq + Gemini 354%$30.00150ร—
Frontier thinkingGemini 3 Deep45%$77.00385ร—
Small fine-tunedNVARC Qwen3 4B24%$0.201ร— (baseline)
Tiny specializedTRM 7M8%~$0.010.05ร—
The efficiency frontier

NVARC won ARC Prize 2025 with a 4B model at $0.20/task. Poetiq achieves 54% but at $30/task. The question isn't "which scores higher" โ€” it's "what's the cost-performance sweet spot for your use case?"

Open-Source Small Models (2025-2026)

The latest small models often match or beat much larger models from just a year ago. Here's the complete landscape.

Under 1B Parameters (Edge/Mobile)
Qwen3-0.6B โ€” 100+ languages. Best tiny multilingual.
SmolLM2-360M โ€” HuggingFace. On-device, low-power.
Qwen2.5-0.5B โ€” Multilingual, instruction-following.
1B-3B Parameters (Local/Laptop)
SmolLM3-3B โ€” HuggingFace. Fully open. Beats Qwen3-4B on some tasks.
Ministral 3 3B โ€” Mistral. Multimodal + vision. Apache 2.0.
OLMo 2 1B โ€” AI2. Fully open (data+code). Beats Gemma/Llama 1B.
Llama 3.2 1B โ€” Meta. Tool calling. Mobile-optimized.
StableLM-Zephyr 3B โ€” Stability AI. DPO-tuned. Strong alignment.
4B Parameters (Sweet Spot)
Qwen3-4B-Instruct โ€” Best overall for fine-tuning. ARC Prize winner base.
Phi-4-mini 3.8B โ€” Microsoft. Reasoning rivals 7B-9B models.
Gemma 3 4B โ€” Google. 128K context. Vision capable.
7B-8B Parameters (Workstation)
Llama 3.1 8B โ€” Meta. Best all-around under 10B. Huge ecosystem.
Qwen3-8B โ€” Matches Qwen2.5-14B. Great multilingual.
Ministral 3 8B โ€” Mistral. Beats Gemma 12B. Vision + 256K context.
OLMo 2 7B โ€” AI2. Fully open. Outperforms Llama 3.1 8B.
Mistral 7B โ€” Classic. Sliding window attention. Apache 2.0.
Reasoning Specialists (2025-2026)
Phi-4-reasoning-plus 14B โ€” Microsoft. 77.7% AIME 2025. Rivals DeepSeek-R1.
Phi-4-mini-reasoning 3.8B โ€” Beats o1-mini on Math-500. Edge-deployable.
Ministral 3 14B Reasoning โ€” 85% AIME 2025. SOTA for size class.
OLMo 3 Think โ€” AI2. Explicit reasoning. Fully open.
DeepSeek R1 Distilled (Reasoning Transfer)
R1-Distill-Qwen-1.5B โ€” R1 reasoning in 1.5B. 800K samples.
R1-Distill-Qwen-7B โ€” Strong reasoning transfer.
R1-Distill-Llama-8B โ€” Llama base + R1 reasoning.

Model Comparison Table (2026)

ModelParamsContextBest ForLicense
Qwen3-4B4B32KFine-tuning, structured dataApache 2.0
SmolLM3-3B3B128KTransparency, researchApache 2.0
Phi-4-reasoning-plus14B16KMath, science, codingMIT
Ministral 3 8B8B256KMultimodal, long contextApache 2.0
OLMo 2 7B7B4KFull reproducibilityApache 2.0
Llama 3.1 8B8B128KEcosystem, toolingLlama 3.1
Gemma 3 4B4B128KMultimodal, 140+ languagesGemma

Benchmark Snapshot (Early 2026)

ModelMMLUGSM8KHumanEvalAIME 2025
Phi-4-reasoning-plus 14Bโ€”โ€”68.8%77.7%
Ministral 3 14B Reasoningโ€”โ€”โ€”85%
Phi-4-mini-reasoning 3.8Bโ€”โ€”โ€”33.6%
Qwen3-4B~75%~85%~70%โ€”
SmolLM3-3B~68%~75%~55%โ€”
OLMo 2 7B~63%~70%~45%โ€”
2026 Trends

Ministral 3 14B Reasoning achieves 85% on AIME 2025 โ€” matching frontier models from 2024. Phi-4-mini-reasoning (3.8B) beats o1-mini on some benchmarks. The gap between "small" and "frontier" is collapsing rapidly.

Which small model should I choose?
  • Best for fine-tuning? โ†’ Qwen3-4B (benchmark leader)
  • Need reasoning? โ†’ Phi-4-reasoning-plus or Ministral 3 14B Reasoning
  • Full transparency? โ†’ OLMo 2/3 or SmolLM3 (data+code open)
  • Multimodal + vision? โ†’ Ministral 3 or Gemma 3
  • Edge/mobile? โ†’ Qwen3-0.6B, SmolLM2-360M, or Phi-4-mini
  • Largest ecosystem? โ†’ Llama 3.1/3.2
  • R1-style reasoning cheap? โ†’ R1-Distill-Qwen-7B

Fine-Tuning Techniques

PARAMETER-EFFICIENT METHODS
LoRA โ€” Low-Rank Adaptation. Train ~0.01% of params. Most popular.
Memory: ~1/3 of full fine-tuning
QLoRA โ€” 4-bit quantized LoRA. Even more efficient.
Memory: ~1/10 of full fine-tuning. Run 70B on single GPU.
Adapters โ€” Add small trainable modules between layers.
Prompt Tuning โ€” Learn soft prompts, freeze all weights.
LoRA Quick Start
  • Rank (r): Start with 8-16. Lower = fewer params, higher = more capacity.
  • Target modules: Usually attention layers (q_proj, v_proj, k_proj, o_proj)
  • Alpha: Scaling factor. Common: alpha = 2ร— rank
  • Result: 19MB adapter vs 11GB full model. Same performance.
Fine-Tuning Tools
PEFT โ€” HuggingFace. LoRA, QLoRA, adapters. Most accessible.
Unsloth โ€” 2ร— faster fine-tuning. 70% less memory. Free tier.
Axolotl โ€” Config-based fine-tuning. Supports many methods.
LitGPT โ€” Lightning AI. Pretrain + fine-tune from scratch.

Decision Framework

Should I fine-tune a small model or use a frontier API?
  • < 1000 queries/day, general tasks โ†’ Use frontier API (simpler)
  • > 10K queries/day, any task โ†’ Fine-tune small model (cost)
  • Specialized task with training data โ†’ Fine-tune Qwen3-4B or Phi-4
  • Latency < 100ms required โ†’ Small model, local inference
  • Privacy/compliance requirements โ†’ Self-hosted fine-tuned model
  • Structured reasoning (ARC, math, code) โ†’ Consider TRM-style architecture
  • No training data available โ†’ Start with frontier, collect data, then fine-tune

Key Papers

PaperYearKey Finding
Less is More (TRM)20257M params beats 671B on reasoning via recursion
LoRA2021Train 0.01% of params, match full fine-tuning
QLoRA20234-bit quantization + LoRA. 70B on single GPU.
Phi-4202414B beats 70B on math via synthetic data
Small vs Large LLMs2024Specialized 1B matches GPT-4 with ~100 samples

RL Refinement Methods

A comprehensive guide to reinforcement learning techniques for training and refining LLMs โ€” from foundational RLHF to frontier self-improvement methods. These techniques power the reasoning improvements in ARC solvers.

Why RL matters for ARC

The best ARC solvers use RL to improve reasoning at test-time. DeepSeek-R1's GRPO, SOAR's evolutionary refinement, and Poetiq's iterative loops all leverage RL principles to achieve SOTA results.

Taxonomy of RL Methods

RL FOR LLM REFINEMENT
Reward-Based (Online) โ€” PPO, GRPO, REINFORCE++
Train reward model โ†’ Generate โ†’ Score โ†’ Update policy
Reward-Free (Offline) โ€” DPO, IPO, KTO, ORPO, SimPO
Learn directly from preference pairs, no RM needed
Self-Improvement โ€” STaR, SPIN, Self-Rewarding, SOAR
Model improves itself through iteration
Verifiable Rewards โ€” RLVR for code/math
Use execution/tests as ground-truth reward signal

Foundational Methods

RLHF + PPO Industry Standard

The original method that powered ChatGPT. Train a reward model from human preferences, then use PPO to optimize the policy.

  • Components: Policy model, Reference model, Reward model, Value model (4 LLMs total)
  • Pros: Proven at scale, handles complex preferences, can optimize arbitrary rewards
  • Cons: Expensive (4 models), unstable training, requires extensive tuning
  • Use when: You have compute budget and need maximum control over the reward signal
DPO (Direct Preference Optimization) Most Popular

Simplifies RLHF by treating alignment as a classification problem. Used to train Llama 3, Zephyr, NeuralChat.

  • Components: Policy model, Reference model (2 LLMs โ€” no reward/value models)
  • Input: Preference pairs (prompt, chosen response, rejected response)
  • Pros: Simple, stable, efficient, no reward model needed
  • Cons: Offline only (can't explore), requires paired preference data
  • Use when: You have preference data and want simple, reliable alignment
DPO vs PPO

DPO is simpler and cheaper. PPO performs better on complex tasks (code generation) when properly tuned. For most use cases, start with DPO.

Preference Optimization Variants

MethodKey InnovationWhen to UseCode
IPORegularization to prevent overfittingDPO overfits your dataTRL
KTOUses prospect theory, no pairs neededOnly have thumbs up/down dataTRL
ORPONo reference model neededMemory constrainedTRL
SimPOReference-free, simpler objectiveWant stability + performanceGitHub

GRPO Family: Deep Dive (2024-2026)

GRPO and its variants are the dominant algorithms for training reasoning LLMs in 2025-2026. This section covers the full evolution from GRPO โ†’ DAPO โ†’ VAPO โ†’ Dr. GRPO.

Why GRPO matters

GRPO proved that pure RL can develop reasoning without supervised fine-tuning. DeepSeek-R1 showed emergent behaviors (reflection, self-verification, "aha moments") arose naturally from GRPO training with simple regex rewards.

GRPO (Group Relative Policy Optimization) DeepSeek 2024

The algorithm behind DeepSeek-R1's reasoning capabilities. Cuts PPO compute in half by eliminating the value network.

How GRPO Works

  1. Group Sampling: For each prompt, sample G outputs (typically 16)
  2. Reward Assignment: Score each output (accuracy + format rewards)
  3. Baseline: Use group mean as baseline: A = r - mean(r)
  4. Normalize: Divide by group std for stable gradients
  5. Policy Update: PPO-style clipped objective + KL penalty

Key Differences from PPO

AspectPPOGRPO
Models needed4 (policy, ref, reward, value)2 (policy, ref)
Value estimationLearned critic networkGroup mean baseline
MemoryHigh~50% less
KL penaltyIn reward signalDirect loss term

DeepSeek-R1 Training Config

  • Learning rate: 3e-6
  • KL coefficient: 0.001
  • Clip ratio ฮต: 10
  • Temperature: 1.0
  • Samples per prompt: 16
  • Max length: 32,768 tokens
  • Batch: 32 prompts ร— 16 samples = 512
The "Aha Moment"

During R1-Zero training, models spontaneously developed self-reflection: "Wait, let me reconsider..." โ€” emerging purely from RL without being taught. This suggests reasoning can be discovered, not just imitated.

DAPO (Decoupled Clip & Dynamic Sampling) ByteDance 2025

GRPO++ โ€” fixes key issues in GRPO. 50 points on AIME 2024 with Qwen2.5-32B, outperforming DeepSeek-R1-Zero.

DAPO's Four Key Improvements

TechniqueProblem SolvedHow
Clip-HigherEntropy collapseAsymmetric clipping (0.2 lower, 0.28 upper)
No KL LossRestricts long CoTRemove KL term entirely for reasoning tasks
Dynamic SamplingZero gradients when all correct/wrongOversample, filter groups with acc=0 or acc=1
Token-Level LossLength biasAverage loss over all tokens, not sequences

DAPO Results

  • AIME 2024: 50% (vs DeepSeek-R1-Zero's 47%)
  • Training efficiency: 50% fewer steps than R1-Zero
  • Model: Qwen2.5-32B base โ†’ reasoning model
VAPO (Value Augmented PPO) ByteDance 2025

Returns to value-based methods with better credit assignment. 60.4 on AIME 2024 โ€” current SOTA.

Why Value-Based?

GRPO/DAPO use final reward only. VAPO learns per-step values for precise credit assignment โ€” crucial for long reasoning chains where early errors compound.

VAPO's Seven Techniques

  • Refined value learning with better targets
  • Exploration-exploitation balance
  • Smoother training dynamics
  • Better length scaling
  • Faster score growth

VAPO Results

MethodAIME 2024Steps to Match
GRPO (DeepSeek-R1)47baseline
DAPO5050% fewer
VAPO60.460% fewer than DAPO
Dr. GRPO 2025

Fixes mathematical biases in GRPO's advantage estimation.

Three Biases Fixed

  • Baseline bias: Using biased baseline without correction
  • Length normalization: Removes length bias causing longer wrong answers
  • Std normalization: Removes division by group std

Result

Prevents models from generating progressively longer incorrect responses. Improves token efficiency.

REINFORCE++ 2025-2026

Simplified alternative to GRPO. More stable training, used by ProRL V2, ScaleRL, Magistral.

Key Difference

  • GRPO: Local normalization within group
  • REINFORCE++: Global normalization across batch
  • Result: GRPO can overfit immediately; REINFORCE++ learns more stably

Single-Sample Efficiency

REINFORCE++ with k=1 achieves top-tier scores while being more token-efficient than group-sampling GRPO.

Algorithm Comparison (2025-2026)

AlgorithmAIME 2024Value ModelKL LossBest For
GRPO47NoYesGeneral reasoning
DAPO50NoNoLong CoT, math
VAPO60.4Yesโ€”Complex reasoning, SOTA
Dr. GRPOโ€”NoNoToken efficiency
REINFORCE++โ€”Noโ€”Stability, single-sample
Which GRPO variant should I use?
  • Just starting? โ†’ Use GRPO via TRL (simplest)
  • Training long reasoning? โ†’ Use DAPO (no KL, dynamic sampling)
  • Need SOTA performance? โ†’ Use VAPO (value-based, 60.4 AIME)
  • Seeing length explosion? โ†’ Use Dr. GRPO (fixes length bias)
  • Want stability? โ†’ Use REINFORCE++ (global normalization)
  • Memory constrained? โ†’ DAPO or Dr. GRPO (no value model)

Implementation Libraries

GRPO Family Implementations
TRL โ€” HuggingFace. GRPOTrainer. Easiest to start.
OpenRLHF โ€” Ray + vLLM. GRPO, DAPO, REINFORCE++. Scales to 70B+.
veRL โ€” ByteDance. GRPO, VAPO. 200+ hyperparams. Scales to 671B.
simple_GRPO โ€” Educational. ~200 lines. Understand the algorithm.
GRPO-Zero โ€” DeepSeek-R1 GRPO from scratch.

Key Papers (Chronological)

PaperDateContribution
DeepSeekMathFeb 2024Introduced GRPO for math reasoning
DeepSeek-R1Jan 2025GRPO for general reasoning, "aha moment"
REINFORCE++Jan 2025Stabilized critic-free RL
DAPOMar 2025Four fixes for GRPO, 50 on AIME
Dr. GRPOMar 2025Unbiased GRPO, fixes length explosion
VAPOApr 2025Value-based, 60.4 AIME SOTA

Self-Improvement Methods

STaR (Self-Taught Reasoner) Foundational

Bootstraps reasoning ability by learning from self-generated rationales. The model teaches itself to reason.

  • Loop: Generate rationales โ†’ Keep correct ones โ†’ Fine-tune โ†’ Repeat
  • Rationalization: For wrong answers, generate rationale given the correct answer (reason backward)
  • Results: Comparable to 30ร— larger models on CommonsenseQA
  • Limitation: Requires base model with some reasoning ability (GPT-2 too small)
Quiet-STaR Inner Monologue

Extends STaR to teach LLMs to "think before speaking" with internal rationales at every token.

  • Method: Generate inner rationales in parallel before responding
  • Zero-shot gains: GSM8K 5.9%โ†’10.9%, CommonsenseQA 36.3%โ†’47.2%
  • Key insight: Reasoning is implicit in almost all text โ€” learn to infer unstated rationales
SPIN (Self-Play Fine-Tuning) Adversarial
GitHub ยท Paper

LLM plays against previous versions of itself in a GAN-like framework.

  • Mechanism: Current model learns to distinguish its own outputs from human data
  • Roles: Opponent (generates synthetic) + Main Player (discriminates)
  • Results: +5% on HuggingFace benchmarks over 3 iterations
  • Advantage: No preference data or separate reward model needed
Self-Rewarding LMs Meta AI
GitHub ยท Paper

Model acts as both actor AND judge โ€” provides its own rewards via LLM-as-a-Judge prompting.

  • Why: Human feedback bottlenecks superhuman performance
  • Method: Generate โ†’ Self-judge โ†’ Iterative DPO
  • Extension: Meta-Rewarding โ€” model judges its own judgments
  • Results: Llama-3-8B win rate 22.9%โ†’39.4% on AlpacaEval 2
SOAR ARC Prize 2025 2nd Paper
GitHub ยท Paper ยท Models

Self-improving evolutionary program synthesis. 52% on ARC-AGI-1 โ€” SOTA for open-source without DSLs.

  • Loop: LLM generates programs โ†’ Test โ†’ Refine best ones โ†’ Fine-tune LLM on attempts
  • Key insight: Failed programs are correct programs for different tasks โ€” use as training data
  • Hindsight: Relabel failed attempts as solutions for tasks they accidentally solve
  • Released: 5 million ARC solutions dataset

Verifiable Rewards (RLVR)

RL with Verifiable Rewards Code & Math

Use deterministic verifiers (code execution, math checking) as ground-truth rewards instead of learned reward models.

  • Domains: Code (unit tests), Math (exact match), Formal proofs
  • Advantages: No reward hacking, objective, scales with compute
  • Finding: RLVR can extend reasoning boundaries for both math and code
  • Caveat: May narrow exploration โ€” models favor known high-reward paths
ReST-EM Google/DeepMind

Expectation-Maximization framework for self-training with binary rewards.

  • E-step: Generate samples, filter with binary reward
  • M-step: Fine-tune base model on filtered samples
  • Key: Always fine-tune base model (not previous iteration) to prevent drift
  • Results: Outperforms human-data fine-tuning on MATH and APPS
ReST-MCTS* NeurIPS 2024
GitHub ยท Paper

Combines process reward guidance with tree search for higher-quality reasoning traces.

  • Method: MCTS* search + per-step value training
  • Advantage: Higher accuracy than Best-of-N, Tree-of-Thought at same compute budget
  • Outperforms: ReST-EM and Self-Rewarding LM in iterative improvement

Constitutional AI & RLAIF

Constitutional AI (CAI) Anthropic
Paper ยท Blog

Train harmless AI using self-improvement with a "constitution" of principles. No human labels for harmful outputs.

  • SL Phase: Generate โ†’ Self-critique โ†’ Revise โ†’ Fine-tune on revised
  • RL Phase: AI evaluates responses โ†’ Train preference model โ†’ RLAIF
  • Constitution: Natural language principles (e.g., "be helpful, honest, harmless")
  • Result: More harmless with minimal helpfulness impact, increased transparency

Implementation Libraries

Production-Ready Frameworks
TRL โ€” Hugging Face. PPO, DPO, GRPO, KTO, ORPO. Most accessible.
OpenRLHF โ€” Ray + vLLM. Scales to 70B+. PPO, GRPO, REINFORCE++.
veRL โ€” ByteDance. Scales to 671B. GRPO, PPO, DAPO. EuroSys 2025.
trlX โ€” CarperAI. Distributed RLHF. PPO, ILQL.
Research Implementations
SPIN โ€” UCLA. Self-play fine-tuning. ICML 2024.
SimPO โ€” Princeton. Reference-free preference optimization. NeurIPS 2024.
SOAR โ€” INRIA. Self-improving evolutionary synthesis. ICML 2025.
ReST-MCTS* โ€” Tsinghua. Process reward + tree search. NeurIPS 2024.
Specialized Tools
self-rewarding-lm โ€” Self-rewarding framework. LLM-as-judge training.
RL4LMs โ€” Allen AI. Modular RL library for NLG.
trl (legacy) โ€” Original PPO implementation by Leandro von Werra.
LLM-self-play โ€” Minimal SPIN implementation.

Which Method Should I Use?

Decision guide for RL method selection
  • Have preference pairs? โ†’ Start with DPO (simple, stable)
  • Only thumbs up/down? โ†’ Use KTO (no pairs needed)
  • Memory constrained? โ†’ Use ORPO or SimPO (no reference model)
  • Want reasoning improvement? โ†’ Use GRPO or STaR
  • Have verifiable tasks (code/math)? โ†’ Use RLVR with execution rewards
  • No external data? โ†’ Use SPIN or Self-Rewarding
  • Need maximum performance? โ†’ Use PPO with careful tuning
  • Building ARC solver? โ†’ Combine GRPO + SOAR-style evolutionary refinement
Online Methods (PPO, GRPO)
โœ“ Can explore new behaviors
โœ“ Higher ceiling with enough compute
โœ— More expensive, less stable
Offline Methods (DPO, SimPO)
โœ“ Simple, stable, efficient
โœ“ Works well with limited compute
โœ— Can't discover new strategies

Key Papers

PaperYearCodeKey Contribution
InstructGPT (RLHF)2022โ€”Original RLHF pipeline
DPO2023โ†’Reward-free preference learning
DeepSeekMath (GRPO)2024โ†’Memory-efficient PPO variant
STaR2022โ€”Self-taught reasoning
SPIN2024โ†’Self-play without rewards
Self-Rewarding2024โ†’LLM as its own judge
SOAR2025โ†’Evolutionary synthesis, 52% ARC
Constitutional AI2022โ€”RLAIF with principles

Learn

Recommended learning path
1
Play ARC puzzles (15 min)
Get intuition before reading theory.
2
Read "On the Measure of Intelligence" (2h)
Chollet's original paper. Understand the philosophy.
3
Explore winning solutions (1h)
Kaggle notebooks from 2024 competition.
4
Read NVARC paper (1h)
Understand test-time training approach.
5
Try arc-dsl (2h)
Write a simple solver using the DSL.
6
Enter ARC Prize (ongoing)
$1M+ in prizes. Open to everyone.

Key papers

PaperYearCodeWhy read
On the Measure of Intelligence2019โ†’The foundational paper
Less is More (TRM)2025โ†’7M params beats 671B models
SOAR2025โ†’Self-improving evolution, 52%
CompressARC2025โ†’MDL, no pretraining, 76K params
TTT for Abstract Reasoning2024โ†’61.9% with ensemble
GridCoder2024โ†’Neurally-guided synthesis
ARC Prize 2025 Report2026โ€”Official competition analysis

Communities

Explore further

Tools & DSLs
Synthetic Data
2025 Winners
  • NVARC โ€” 1st score (24%)
  • TRM โ€” 1st paper (7M params)
  • SOAR โ€” 2nd paper (52%)
  • CompressARC โ€” 3rd paper (MDL)
  • Poetiq โ€” 54% verified
2024 Winners
Research Code
  • TRM โ€” recursive 7M model
  • SOAR โ€” evolutionary synthesis
  • CompressARC โ€” MDL approach
  • CodeIt โ€” self-improving
2020 Foundational

nanochat: Build Your Own LLM for $100

Andrej Karpathy's nanochat is the simplest complete harness for training LLMs from scratch. It covers every stage: tokenization โ†’ pretraining โ†’ SFT โ†’ RL โ†’ evaluation โ†’ chat UI. You can train GPT-2 capability for ~$72 in 3 hours.

Core philosophy

One dial controls everything: depth. Set `--depth=26` for GPT-2 capability. All other hyperparameters (width, heads, learning rate, batch size, weight decay) are computed automatically using scaling laws.

Why nanochat matters for ARC

Architecture Principles

GPT Architecture

Modern transformer with carefully selected components:

  • Rotary Embeddings (RoPE) โ€” no learned position embeddings
  • QK Normalization โ€” stabilizes attention scores
  • Group-Query Attention (GQA) โ€” memory-efficient attention
  • ReLUยฒ activation โ€” relu(x)ยฒ in MLP for sparse, efficient computation
  • RMSNorm โ€” no learnable params, faster than LayerNorm
  • No bias โ€” removed from all linear layers
  • Untied embeddings โ€” separate token embedding and lm_head weights
  • Flash Attention 3 โ€” H100-optimized with sliding window support
Muon Optimizer

Combined optimizer: Muon for matrix params, AdamW for embeddings/scalars.

  • Muon (MomentUm Orthogonalized by Newton-schulz)
    • Standard SGD-momentum + orthogonalization post-processing
    • Uses Polar Express iteration for fast orthogonalization in bfloat16
    • NorMuon variance reduction โ€” per-neuron adaptive learning rate
    • Cautious weight decay โ€” only decays weights aligned with gradient
  • Distributed version (ZeRO-2 style)
    • Reduce-scatter gradients โ†’ compute locally โ†’ all-gather params
    • 3-phase async to overlap communication with compute
    • Sharded optimizer state across ranks

Scaling Laws Applied

nanochat automatically calculates optimal hyperparameters from depth using empirically-derived scaling laws:

HyperparameterFormulaReference
Model dimdepth ร— aspect_ratio (default 64)Width โˆ depth
Training tokenstarget_param_data_ratio ร— params (default 10.5ร—)Chinchilla ~20ร—
Batch sizeB_ref ร— (D/D_ref)^0.383Power Lines
Learning rateฮท_ref ร— โˆš(B/B_ref)AdamW scaling
Weight decayฮป_ref ร— โˆš(B/B_ref) ร— (D_ref/D)T_epoch paper
Reference model: d12

All hyperparameters tuned at d12 (GPT-1 size, ~5min training runs), then transferred to larger depths via muP-style scaling. For quick experiments: d12 (5min), d16 (15min), d20 (1h), d26 (3h).

Training Pipeline

STAGE 1: BASE PRETRAINING
Train on FineWeb (10T tokens)
Outputs: val_bpb (bits per byte), CORE metric
d26 โ†’ GPT-2 capability (CORE > 0.2565) in 2.76 hours
STAGE 2: SUPERVISED FINE-TUNING (SFT)
Dataset mixture:
SmolTalk (460K conversations)
MMLU auxiliary (100K multiple choice)
GSM8K (16K math, 2 epochs)
Identity conversations (2K synthetic)
Spelling tasks (280K)
Best-fit packing: no tokens discarded, padding masked
STAGE 3: REINFORCEMENT LEARNING
Simplified GRPO (closer to REINFORCE):
No KL regularization (deleted trust region)
On-policy โ†’ no PPO ratio/clip needed
DAPO-style token-level normalization
Mean-centered advantages (not z-score)
Task: GSM8K with verifiable rewards
Metric: Pass@k evaluation

Leaderboard: Time to GPT-2

Wall-clock time to exceed GPT-2 CORE score (0.2565) on 8xH100:

#Timeval_bpbCOREKey changeDate
0168hโ€”0.2565Original OpenAI GPT-2 (2019)2019
13.04h0.74830.2585d24 baselineJan 29 2026
22.91h0.74500.2578+FP8 trainingFeb 2 2026
32.76h0.74650.26021M token batch sizeFeb 5 2026

Cost: 8xH100 @ ~$24/hr โ†’ ~$72 to train GPT-2 equivalent (was ~$43,000 in 2019).

Key Code Files

Core Architecture
Training Scripts

Distilled Principles

1. One dial of complexity

Set --depth and everything else follows. No hyperparameter tuning needed. This is what compute-optimal training looks like when done right.

2. Muon for matrices, Adam for embeddings

Different parameter types need different optimizers. Matrix parameters (attention, MLP) benefit from orthogonalization. Embeddings/scalars use standard AdamW.

3. Flash Attention + sliding window

FA3 on H100 with alternating window patterns (SSSL = 3 sliding + 1 full context) balances speed and long-range attention.

4. RL is just REINFORCE with mean-centering

Strip PPO to its core: on-policy means no clipping needed. Token-level normalization (DAPO) + mean-centered advantages. No KL penalty.

5. SFT data quality > quantity

~856K rows total. Multiple epochs on high-value data (GSM8K 2ร—, identity 2ร—). Best-fit packing to maximize token efficiency.

Quick Start

# Clone and setup
git clone https://github.com/karpathy/nanochat
cd nanochat && uv sync

# Train GPT-2 equivalent (~3 hours on 8xH100)
bash runs/speedrun.sh

# Or quick experiment with d12 (~5 min)
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 \
  -m scripts.base_train -- --depth=12 --run="d12"

# Chat with your model
python -m scripts.chat_web

Connection to ARC Methods

nanochat's pipeline directly applies to ARC solving:

nanochat StageARC Application
Base pretrainingGeneral reasoning foundation
SFTFine-tune on ARC-style grid tasks, DSL programs
RL (GRPO)Self-improve with verifiable rewards (grid match)
TTT at inferenceTrain on each ARC task (like NVARC winner)

The RL methods in nanochat (simplified GRPO/REINFORCE) are the same foundations used by:

Resources