ARC-AGI

The Abstract Reasoning Corpus — distilled from 40+ papers
Last updated: 2026-02-15

What brings you here?

🔥 ARC-AGI-2 (2025) Harder benchmark, new winners 🎮 ARC-AGI-3 (2026) Interactive reasoning 🏗️ Build a solver What actually works ⚡ nanochat ($100 LLM) Karpathy's full training pipeline 📚 Learn more Papers & resources

One sentence

ARC-AGI tests whether AI can learn new abstractions from a few examples — the kind of reasoning humans do effortlessly but current AI struggles with, making it the benchmark for measuring progress toward general intelligence.

Understand

ARC (Abstract Reasoning Corpus) is a benchmark created by François CholletFrançois CholletCreator of Keras, researcher at Google. Designed ARC to measure intelligence as skill-acquisition efficiency.→ fchollet.com to measure machine intelligence in ways that current AI cannot easily game.

What is an ARC task?

Each task shows 2-5 input-output pairs of colored grids. You must infer the transformation rule and apply it to a new test input.

Grids are 1-30 × 1-30 cells
10 possible colors (0-9)
Rules involve Core KnowledgeCore KnowledgeInnate human concepts: objectness, counting, geometry, topology. ARC tasks build on these priors.→ Chollet 2019: objects, counting, geometry, symmetry
No task requires domain-specific knowledge

Why ARC matters

ARC is designed to be:

Novel: Each task is unique — no memorization helps
Few-shot: Only 2-5 examples per task
Resistant to scale: More parameters don't automatically help
Human-calibrated: Average humans solve ~85% on first try

The key insight

ARC measures skill-acquisition efficiency — how quickly you can learn a new skill from minimal data. This is what Chollet argues intelligence actually is.

3 numbers that matter

85%	Average human performance on ARC-AGI public eval[1]
55.5%	Best AI score (OpenAI o3-high, Dec 2024) — but at $10K+ per task[2]
$1M	ARC Prize for 85%+ on private eval with open-source code[1]

Maturity map

working promising experimental research

APPROACHES

working TTT + LLM (test-time compute)

promising Program synthesis + search

experimental Neurosymbolic (LLM + DSL)

research Active inference, object-centric

Terms & Glossary

Hover over highlighted termsTerm TooltipThroughout this page, terms with dotted underlines show definitions when hovered. anywhere on this page for definitions.

Core concepts

ARC — Abstract Reasoning Corpus.

TTT — Test-Time Training. Train on each task.

DSL — Domain-Specific Language for transforms.

Core Knowledge — Human priors (objects, counting).

Approaches

Program synthesis — Generate code that solves tasks.

Neurosymbolic — Combine neural + symbolic reasoning.

MCTS — Monte Carlo Tree Search for DSL programs.

Active inference — Model world to minimize surprise.

Benchmark versions

ARC-AGI-1 — Original 800 tasks. 2019-2024.

ARC-AGI-2 — Harder. 120 eval tasks. 2025.

ARC-AGI-3 — Interactive mini-games. March 2026.

RE-ARC — Synthetic training variants.

Key systems

o3 — OpenAI's reasoning model. 55.5% on ARC.

NVARC — NVIDIA's TTT approach. 1st 2025.

TRM — 7M params, beats large LLMs.

ARChitects — 1st place 2024, 53.5% private.

Training tools

nanochat — Full LLM pipeline for $100. Karpathy.

Muon — Orthogonalized optimizer for matrices.

TRL — HuggingFace RL training library.

OpenRLHF — Distributed RLHF framework.

ARC-AGI-2 (2025)

Released early 2025, ARC-AGI-2 is significantly harder for AI while remaining easy for humans. Pure LLMs score 0%. Even frontier reasoning systems achieve only single-digit percentages.

Key difference

ARC-AGI-1 tasks could often be solved instantly by humans. In ARC-AGI-2, every task requires deliberate thinking — average human completion time is 2.7 minutes. But 100% of tasks have been solved by at least 2 humans in under 2 attempts.

What changed from ARC-AGI-1

Aspect	ARC-AGI-1	ARC-AGI-2
Eval set size	100 tasks	120 tasks
Human avg score	~85%	~60%
Human avg time	< 1 min	2.7 min
Pure LLM score	~5-15%	0%
Best AI (open)	53.5%	~24%
Best AI (any)	55.5%	54%

Design changes

Removed brute-forceable tasks — all solved tasks from 2020 Kaggle removed
Harder generalization — symbolic interpretation, compositional reasoning, contextual rules
No public/train overlap — explicitly designed to resist memorization shortcuts
Efficiency metric required — all scores now include cost per task

Why efficiency matters now

Log-linear scaling is insufficient for ARC-AGI-2. New test-time adaptation algorithms or novel AI architectures are needed. You can't just throw more compute at it.

ARC Prize 2025 Competition

1,455 teams submitted 15,154 entries. More than $125,000 awarded.

TOP SCORE WINNERS (Kaggle)

1st: NVARC — 24% @ $0.20/task (Ivan Sorokin, Jean-Francois Puget)

2nd: ARChitects — 16.5% (masked diffusion LLM)

3rd: MindsAI — 15.4% (TTT pipeline)

PAPER AWARD WINNERS

1st: TRM — 7M params, 45% ARC-1, 8% ARC-2

2nd: SOAR — Self-improving evolutionary synthesis, 52% ARC-1

3rd: CompressARC — MDL-based, 76K params, no pretraining

2025 Winning Solutions Deep Dive

NVARC 1st Place Score

GitHub · Kaggle Writeup

NVIDIA Kaggle Grandmasters' winning approach combines Qwen3 4B with Tiny Recursive Models.

Core insight: Move heavy reasoning offline into synthetic data pipeline
Synthetic data: 103K puzzles + 3.2M augmented examples
Training: All done on single H100; batch size 2 works on 24GB GPUs
Inference: Qwen-4B alone achieved 24% — ensemble added marginal gains
Cost: Only $0.20 per task (vs $10K+ for o3)

Key strategy

Heavyweight LLM reasoning (CoT, tool use, agents) couldn't fit Kaggle's runtime. Instead: train smaller models offline that run fast during evaluation.

TRM (Tiny Recursive Model) 1st Place Paper

GitHub · Paper

Samsung researcher Alexia Jolicoeur-Martineau shows that 7M parameters can beat 671B parameter models.

Architecture: Single transformer block that iterates, not stacks
Mechanism: Recursively improves predicted answer + latent state
Results: 45% ARC-1, 8% ARC-2 — beats DeepSeek-R1 (671B), o3-mini, Gemini 2.5 Pro
Also improved: Sudoku-Extreme 87.4%, Maze-Hard 85.3%

Core principle

Recursion substitutes for depth and size. By iteratively reasoning over its own output, TRM simulates a much deeper architecture without the memory cost.

SOAR 2nd Place Paper

GitHub · Paper · Models

Self-improving evolutionary program synthesis by Julien Pourcel et al. (ICML 2025).

Method: LLM generates programs → test → refine promising ones → fine-tune LLM on attempts
Key insight: Failed programs are correct programs for different tasks — use them as training data
Results: 52% ARC-1 test set — SOTA for open-source LLMs without hand-crafted DSLs
Dataset released: 5 million ARC solutions

CompressARC 3rd Place Paper

GitHub · Paper

Isaac Liao (CMU) proves lossless compression alone can produce intelligent behavior.

Approach: Minimum Description Length (MDL) — find shortest program that outputs puzzle + solution
No pretraining: Models randomly initialized, trained only at inference time
No dataset: One model trains on one task, outputs one answer
Size: Only 76K parameters, ~20 min per puzzle on RTX 4070
Results: 20-34% ARC-1, 4% ARC-2

ARChitects 2025 2nd Place Score

Technical Report

Major evolution from their 2024 autoregressive approach to masked diffusion.

Architecture: LLaDA-8B masked diffusion LLM with 2D positional encoding
Method: Soft-masking turns grid positions into partially noisy states, iteratively refined
Key change: Autoregressive failed on tasks requiring global structure understanding
Results: 16.5% on ARC-AGI-2

Poetiq SOTA Verified (54%)

GitHub · Blog

First to break 50% on ARC-AGI-2 through refinement loops on Gemini 3 Pro.

Core method: Generate solution → receive feedback → analyze → refine (iterate)
Self-auditing: System monitors progress, decides when to stop
Cost: $30/task (vs Gemini 3 Deep Think at $77/task for 45%)
Model-agnostic: Works with OpenAI, Anthropic, xAI models too

2025's defining theme

The emergence of the refinement loop — per-task iterative optimization guided by feedback. This is how the 50% barrier was finally broken.

Commercial Model Performance

Model	ARC-AGI-2	Cost/task	Notes
Poetiq + Gemini 3 Pro	54%	$30	SOTA verified, open source
Gemini 3 Deep Think	45%	$77	Previous best
Opus 4.5 (Thinking, 64k)	37.6%	$2.20	Best single-model commercial
Gemini 3 Pro (baseline)	31%	$0.81	No refinement
NVARC	24%	$0.20	Competition winner

ARC-AGI-3 (2026)

Launching March 25, 2026. The first major format change since ARC was introduced in 2019. ARC-AGI-3 moves from static reasoning to interactive reasoning.

The shift

ARC-1 and ARC-2 test static reasoning on grids. ARC-3 tests interactive reasoning in mini-games — agents must explore, plan, remember, and adapt across multiple steps.

Current AI performance on ARC-3 preview

Humans: Solve games within minutes
AI agents: Score zero points — fail even basic preview games

What's new in ARC-AGI-3

FORMAT CHANGES

Interactive grid-world mini-games (not static puzzles)

1,000+ levels across 150+ environments

Agents must perceive → plan → act across multiple steps

Long-horizon goals requiring memory

Required AI capabilities

Exploration: Discover rules by experimenting, like a child encountering something new
Planning: Form multi-step strategies toward goals
Memory: Remember what worked and what didn't
Goal acquisition: Understand what success looks like
Alignment: Act according to inferred objectives

Why the format change?

ARC-AGI-1 and ARC-AGI-2 are being "overfit" — not through direct memorization, but because public train and private test sets share enough similarity that models trained on public data perform better than they should. ARC-3 makes this impossible.

Preview results (July 2025)

The ARC-AGI-3 preview released video-game-like environments where agents must achieve long-horizon goals.

Humans

✓ Solve games within minutes

✓ Quickly infer rules through exploration

✓ Adapt strategies based on feedback

Current AI

✗ Score zero points

✗ Fail even the developer preview (3 games)

✗ Cannot maintain coherent multi-step plans

Efficiency comparison

ARC-AGI-3 will enable formal comparison of human and AI action efficiency (learning efficiency) for the first time. Not just "did you solve it?" but "how efficiently did you learn the rules?"

ARC-AGI-3 Preview

Try it yourself · 30-day learnings

Play the preview games and see how you compare to current AI systems.

Timeline

July 2025	Preview release (3 games)
March 25, 2026	Full ARC-AGI-3 launch
2026	ARC Prize 2026 competition

Reality Check

The uncomfortable truths about current approaches.

The o3 asterisk

55.5% at $10K+ per task — not practical[2]
Low-compute config: only 25% (worse than many open approaches)
Still far from human-level efficiency
Does not qualify for ARC Prize (not open-source)

Current leaderboard (2025)

System	ARC-AGI-1	ARC-AGI-2	Approach
o3-high	55.5%	—	Massive compute ($10K+/task)
NVARC (2025 winner)	—	~24%	TTT + TRM ensemble
ARChitects (2024 winner)	53.5%	—	TTT + augmentation
marc	61.9%*	—	TTT + program synthesis
TRM (7M params)	45%	8%	Recursive reasoning
Humans	~85%	~85%	—

*marc ensemble with program synthesis. Scores as of Feb 2025.

The efficiency gap

Humans solve ARC tasks in seconds with a 20W brain. o3 needs thousands of dollars of compute per task. The gap isn't accuracy — it's efficiency.

What works

✓ Test-time training (TTT)

✓ Data augmentation

✓ Ensemble methods

✓ Chain-of-thought prompting

What doesn't (yet)

✗ Pure LLMs without TTT

✗ Memorization

✗ More parameters alone

✗ Pure symbolic search

Build

Practical approaches that actually work.

Which approach should I try first?

TTT + LLM: Best results right now. Fine-tune per task.
DSL + search: More interpretable, harder to scale.
Hybrid: LLM generates DSL, search refines.

WINNING RECIPE (2024-2025)

1. Generate synthetic variants of each task (augmentation)

2. Fine-tune a small model on those variants (TTT)

3. Sample multiple solutions

4. Verify against training examples

5. Ensemble multiple models

Result: 40-55% on private eval

NVARC approach 2025 Winner

GitHub · Paper

Qwen3 4B + Tiny Recursive Models ensemble
Synthetic data generation pipeline (SDG)
103K synthetic puzzles + 3.2M augmented
~24% on ARC-AGI-2 under contest constraints

ARChitects approach 2024 Winner

GitHub

Mistral-NeMo-Minitron-8B base model
Test-time training with augmented data
Novel augmentations + stability criterion
53.5% on private eval

Code resources

Competition Winners (Open Source)

NVARC — 1st place ARC Prize 2025. TTT + TRM ensemble. ~24% on ARC-AGI-2.

ARChitects — 1st place ARC Prize 2024. 53.5% private eval. TTT approach.

Omni-ARC — 3rd place 2024. Multi-task learning + TTT. Excellent docs.

Icecuber — 1st place Kaggle 2020. 142 hand-crafted DSL ops. Foundational.

Core DSL & Data Generation

arc-dsl — Domain-specific language. Solutions for all 400 training tasks.

Re-ARC — 1000 synthetic variants per training task. Programmatic generators.

BARC — 400K synthetic ARC tasks. LLM-based program remixing.

arcmentations — Augmentation library. Torchvision-compatible transforms.

2025 Paper Award Winners

TRM — 1st. 7M params beats 671B models. 45% ARC-1, 8% ARC-2.

SOAR — 2nd. Self-improving evolution. 52% ARC-1. 5M solutions released.

CompressARC — 3rd. MDL-based, 76K params, no pretraining needed.

Poetiq — SOTA verified. Refinement loops on Gemini 3. 54% ARC-2.

Research Implementations

marc (TTT) — "Surprising Effectiveness of TTT". 61.9% with ensemble.

GridCoder — Neurally-guided program synthesis. Runner-up paper 2024.

CodeIt — Self-improving LLMs. 15% eval, hindsight replay.

arc-mdl — Earlier MDL approach by Sebastien Ferre.

Benchmarks & Community

ARC-AGI — Official dataset. 800 tasks by Chollet.

ARC-AGI-2 — Harder 2025 variant. Harder than ARC-AGI-1.

ConceptARC — 160 tasks organized by concepts. Tests abstraction.

arc-dsl-llm — LLM-friendly DSL. Cleaner function names.

Small vs Frontier Models

Do you need GPT-4 or can a fine-tuned 4B model do the job? The evidence is clear: for specialized tasks, small fine-tuned models often match or beat frontier models at 1/100th the cost.

The key insight

A 7M parameter TRM beats 671B DeepSeek-R1 on ARC reasoning (45% vs 15.8%). A fine-tuned Qwen3-4B matches GPT-4 on specialized tasks. Architecture and training matter more than raw size.

Evidence: Small Models Beating Giants

Small Model	Large Model	Task	Result
TRM 7M	DeepSeek-R1 671B	ARC-AGI-1	45% vs 15.8% (3× better)
TRM 7M	GPT-4, Claude, R1	Hard Sudoku	87.4% vs 0%
CompressARC 76K	No pretraining	ARC-AGI-1	20-34% from scratch
Qwen3-4B fine-tuned	GPT-OSS-120B	8 benchmarks	Matches/exceeds 7 of 8
Llama 3.2 1B fine-tuned	GPT-4.1	E-commerce intent	99% vs 99%
Phi-4 14B	Llama 3.3 70B	MATH benchmark	Phi-4 wins

The 100,000× factor

TRM (7M params) outperforms DeepSeek-R1 (671B params) on ARC reasoning. That's a model 100,000× smaller achieving 3× better performance. The difference? Recursive architecture vs autoregressive generation.

When Small Models Win

Use Small Models (1B-14B)

✓ Specialized, narrow tasks

✓ Structured outputs (code, math, grids)

✓ Latency-critical applications

✓ Edge/mobile deployment

✓ Cost-sensitive production

✓ Privacy-sensitive (local inference)

Use Frontier Models (70B+)

✓ General-purpose, open-ended tasks

✓ Complex multi-step reasoning

✓ Cross-domain knowledge synthesis

✓ Low-volume, high-stakes decisions

✓ When you lack training data

✓ Rapid prototyping before specialization

The Fine-Tuning Multiplier

Fine-tuning impact by model size

Smaller models benefit disproportionately more from fine-tuning than larger ones.

Model Size	Base Performance	After Fine-tuning	Improvement
1B-3B	Low	Often matches 70B base	Dramatic (2-5×)
4B-8B	Moderate	Can match GPT-4 on task	Large (1.5-3×)
14B-32B	Good	Frontier-competitive	Moderate (1.2-2×)
70B+	Very good	Incremental gains	Small (1.1-1.3×)

Cost Comparison

Approach	Model	ARC-AGI-2 Score	Cost/Task	Ratio
Frontier + refinement	Poetiq + Gemini 3	54%	$30.00	150×
Frontier thinking	Gemini 3 Deep	45%	$77.00	385×
Small fine-tuned	NVARC Qwen3 4B	24%	$0.20	1× (baseline)
Tiny specialized	TRM 7M	8%	~$0.01	0.05×

The efficiency frontier

NVARC won ARC Prize 2025 with a 4B model at $0.20/task. Poetiq achieves 54% but at $30/task. The question isn't "which scores higher" — it's "what's the cost-performance sweet spot for your use case?"

Open-Source Small Models (2025-2026)

The latest small models often match or beat much larger models from just a year ago. Here's the complete landscape.

Under 1B Parameters (Edge/Mobile)

Qwen3-0.6B — 100+ languages. Best tiny multilingual.

SmolLM2-360M — HuggingFace. On-device, low-power.

Qwen2.5-0.5B — Multilingual, instruction-following.

1B-3B Parameters (Local/Laptop)

SmolLM3-3B — HuggingFace. Fully open. Beats Qwen3-4B on some tasks.

Ministral 3 3B — Mistral. Multimodal + vision. Apache 2.0.

OLMo 2 1B — AI2. Fully open (data+code). Beats Gemma/Llama 1B.

Llama 3.2 1B — Meta. Tool calling. Mobile-optimized.

StableLM-Zephyr 3B — Stability AI. DPO-tuned. Strong alignment.

4B Parameters (Sweet Spot)

Qwen3-4B-Instruct — Best overall for fine-tuning. ARC Prize winner base.

Phi-4-mini 3.8B — Microsoft. Reasoning rivals 7B-9B models.

Gemma 3 4B — Google. 128K context. Vision capable.

7B-8B Parameters (Workstation)

Llama 3.1 8B — Meta. Best all-around under 10B. Huge ecosystem.

Qwen3-8B — Matches Qwen2.5-14B. Great multilingual.

Ministral 3 8B — Mistral. Beats Gemma 12B. Vision + 256K context.

OLMo 2 7B — AI2. Fully open. Outperforms Llama 3.1 8B.

Mistral 7B — Classic. Sliding window attention. Apache 2.0.

Reasoning Specialists (2025-2026)

Phi-4-reasoning-plus 14B — Microsoft. 77.7% AIME 2025. Rivals DeepSeek-R1.

Phi-4-mini-reasoning 3.8B — Beats o1-mini on Math-500. Edge-deployable.

Ministral 3 14B Reasoning — 85% AIME 2025. SOTA for size class.

OLMo 3 Think — AI2. Explicit reasoning. Fully open.

DeepSeek R1 Distilled (Reasoning Transfer)

R1-Distill-Qwen-1.5B — R1 reasoning in 1.5B. 800K samples.

R1-Distill-Qwen-7B — Strong reasoning transfer.

R1-Distill-Llama-8B — Llama base + R1 reasoning.

Model Comparison Table (2026)

Model	Params	Context	Best For	License
Qwen3-4B	4B	32K	Fine-tuning, structured data	Apache 2.0
SmolLM3-3B	3B	128K	Transparency, research	Apache 2.0
Phi-4-reasoning-plus	14B	16K	Math, science, coding	MIT
Ministral 3 8B	8B	256K	Multimodal, long context	Apache 2.0
OLMo 2 7B	7B	4K	Full reproducibility	Apache 2.0
Llama 3.1 8B	8B	128K	Ecosystem, tooling	Llama 3.1
Gemma 3 4B	4B	128K	Multimodal, 140+ languages	Gemma

Benchmark Snapshot (Early 2026)

Model	MMLU	GSM8K	HumanEval	AIME 2025
Phi-4-reasoning-plus 14B	—	—	68.8%	77.7%
Ministral 3 14B Reasoning	—	—	—	85%
Phi-4-mini-reasoning 3.8B	—	—	—	33.6%
Qwen3-4B	~75%	~85%	~70%	—
SmolLM3-3B	~68%	~75%	~55%	—
OLMo 2 7B	~63%	~70%	~45%	—

2026 Trends

Ministral 3 14B Reasoning achieves 85% on AIME 2025 — matching frontier models from 2024. Phi-4-mini-reasoning (3.8B) beats o1-mini on some benchmarks. The gap between "small" and "frontier" is collapsing rapidly.

Which small model should I choose?

Best for fine-tuning? → Qwen3-4B (benchmark leader)
Need reasoning? → Phi-4-reasoning-plus or Ministral 3 14B Reasoning
Full transparency? → OLMo 2/3 or SmolLM3 (data+code open)
Multimodal + vision? → Ministral 3 or Gemma 3
Edge/mobile? → Qwen3-0.6B, SmolLM2-360M, or Phi-4-mini
Largest ecosystem? → Llama 3.1/3.2
R1-style reasoning cheap? → R1-Distill-Qwen-7B

Fine-Tuning Techniques

PARAMETER-EFFICIENT METHODS

LoRA — Low-Rank Adaptation. Train ~0.01% of params. Most popular.

Memory: ~1/3 of full fine-tuning

QLoRA — 4-bit quantized LoRA. Even more efficient.

Memory: ~1/10 of full fine-tuning. Run 70B on single GPU.

Adapters — Add small trainable modules between layers.

Prompt Tuning — Learn soft prompts, freeze all weights.

LoRA Quick Start

PEFT Library · Docs

Rank (r): Start with 8-16. Lower = fewer params, higher = more capacity.
Target modules: Usually attention layers (q_proj, v_proj, k_proj, o_proj)
Alpha: Scaling factor. Common: alpha = 2× rank
Result: 19MB adapter vs 11GB full model. Same performance.

Fine-Tuning Tools

PEFT — HuggingFace. LoRA, QLoRA, adapters. Most accessible.

Unsloth — 2× faster fine-tuning. 70% less memory. Free tier.

Axolotl — Config-based fine-tuning. Supports many methods.

LitGPT — Lightning AI. Pretrain + fine-tune from scratch.

Decision Framework

Should I fine-tune a small model or use a frontier API?

< 1000 queries/day, general tasks → Use frontier API (simpler)
> 10K queries/day, any task → Fine-tune small model (cost)
Specialized task with training data → Fine-tune Qwen3-4B or Phi-4
Latency < 100ms required → Small model, local inference
Privacy/compliance requirements → Self-hosted fine-tuned model
Structured reasoning (ARC, math, code) → Consider TRM-style architecture
No training data available → Start with frontier, collect data, then fine-tune

Key Papers

Paper	Year	Key Finding
Less is More (TRM)	2025	7M params beats 671B on reasoning via recursion
LoRA	2021	Train 0.01% of params, match full fine-tuning
QLoRA	2023	4-bit quantization + LoRA. 70B on single GPU.
Phi-4	2024	14B beats 70B on math via synthetic data
Small vs Large LLMs	2024	Specialized 1B matches GPT-4 with ~100 samples

RL Refinement Methods

A comprehensive guide to reinforcement learning techniques for training and refining LLMs — from foundational RLHF to frontier self-improvement methods. These techniques power the reasoning improvements in ARC solvers.

Why RL matters for ARC

The best ARC solvers use RL to improve reasoning at test-time. DeepSeek-R1's GRPO, SOAR's evolutionary refinement, and Poetiq's iterative loops all leverage RL principles to achieve SOTA results.

Taxonomy of RL Methods

RL FOR LLM REFINEMENT

Reward-Based (Online) — PPO, GRPO, REINFORCE++

Train reward model → Generate → Score → Update policy

Reward-Free (Offline) — DPO, IPO, KTO, ORPO, SimPO

Learn directly from preference pairs, no RM needed

Self-Improvement — STaR, SPIN, Self-Rewarding, SOAR

Model improves itself through iteration

Verifiable Rewards — RLVR for code/math

Use execution/tests as ground-truth reward signal

Foundational Methods

RLHF + PPO Industry Standard

TRL · OpenRLHF · InstructGPT Paper

The original method that powered ChatGPT. Train a reward model from human preferences, then use PPO to optimize the policy.

Components: Policy model, Reference model, Reward model, Value model (4 LLMs total)
Pros: Proven at scale, handles complex preferences, can optimize arbitrary rewards
Cons: Expensive (4 models), unstable training, requires extensive tuning
Use when: You have compute budget and need maximum control over the reward signal

DPO (Direct Preference Optimization) Most Popular

TRL DPOTrainer · Paper

Simplifies RLHF by treating alignment as a classification problem. Used to train Llama 3, Zephyr, NeuralChat.

Components: Policy model, Reference model (2 LLMs — no reward/value models)
Input: Preference pairs (prompt, chosen response, rejected response)
Pros: Simple, stable, efficient, no reward model needed
Cons: Offline only (can't explore), requires paired preference data
Use when: You have preference data and want simple, reliable alignment

DPO vs PPO

DPO is simpler and cheaper. PPO performs better on complex tasks (code generation) when properly tuned. For most use cases, start with DPO.

Preference Optimization Variants

Method	Key Innovation	When to Use	Code
IPO	Regularization to prevent overfitting	DPO overfits your data	TRL
KTO	Uses prospect theory, no pairs needed	Only have thumbs up/down data	TRL
ORPO	No reference model needed	Memory constrained	TRL
SimPO	Reference-free, simpler objective	Want stability + performance	GitHub

GRPO Family: Deep Dive (2024-2026)

GRPO and its variants are the dominant algorithms for training reasoning LLMs in 2025-2026. This section covers the full evolution from GRPO → DAPO → VAPO → Dr. GRPO.

Why GRPO matters

GRPO proved that pure RL can develop reasoning without supervised fine-tuning. DeepSeek-R1 showed emergent behaviors (reflection, self-verification, "aha moments") arose naturally from GRPO training with simple regex rewards.

GRPO (Group Relative Policy Optimization) DeepSeek 2024

DeepSeekMath Paper · DeepSeek-R1 Paper

The algorithm behind DeepSeek-R1's reasoning capabilities. Cuts PPO compute in half by eliminating the value network.

How GRPO Works

Group Sampling: For each prompt, sample G outputs (typically 16)
Reward Assignment: Score each output (accuracy + format rewards)
Baseline: Use group mean as baseline: A = r - mean(r)
Normalize: Divide by group std for stable gradients
Policy Update: PPO-style clipped objective + KL penalty

Key Differences from PPO

Aspect	PPO	GRPO
Models needed	4 (policy, ref, reward, value)	2 (policy, ref)
Value estimation	Learned critic network	Group mean baseline
Memory	High	~50% less
KL penalty	In reward signal	Direct loss term

DeepSeek-R1 Training Config

Learning rate: 3e-6
KL coefficient: 0.001
Clip ratio ε: 10
Temperature: 1.0
Samples per prompt: 16
Max length: 32,768 tokens
Batch: 32 prompts × 16 samples = 512

The "Aha Moment"

During R1-Zero training, models spontaneously developed self-reflection: "Wait, let me reconsider..." — emerging purely from RL without being taught. This suggests reasoning can be discovered, not just imitated.

DAPO (Decoupled Clip & Dynamic Sampling) ByteDance 2025

Paper · OpenRLHF

GRPO++ — fixes key issues in GRPO. 50 points on AIME 2024 with Qwen2.5-32B, outperforming DeepSeek-R1-Zero.

DAPO's Four Key Improvements

Technique	Problem Solved	How
Clip-Higher	Entropy collapse	Asymmetric clipping (0.2 lower, 0.28 upper)
No KL Loss	Restricts long CoT	Remove KL term entirely for reasoning tasks
Dynamic Sampling	Zero gradients when all correct/wrong	Oversample, filter groups with acc=0 or acc=1
Token-Level Loss	Length bias	Average loss over all tokens, not sequences

DAPO Results

AIME 2024: 50% (vs DeepSeek-R1-Zero's 47%)
Training efficiency: 50% fewer steps than R1-Zero
Model: Qwen2.5-32B base → reasoning model

VAPO (Value Augmented PPO) ByteDance 2025

Paper

Returns to value-based methods with better credit assignment. 60.4 on AIME 2024 — current SOTA.

Why Value-Based?

GRPO/DAPO use final reward only. VAPO learns per-step values for precise credit assignment — crucial for long reasoning chains where early errors compound.

VAPO's Seven Techniques

Refined value learning with better targets
Exploration-exploitation balance
Smoother training dynamics
Better length scaling
Faster score growth

VAPO Results

Method	AIME 2024	Steps to Match
GRPO (DeepSeek-R1)	47	baseline
DAPO	50	50% fewer
VAPO	60.4	60% fewer than DAPO

Dr. GRPO 2025

Paper

Fixes mathematical biases in GRPO's advantage estimation.

Three Biases Fixed

Baseline bias: Using biased baseline without correction
Length normalization: Removes length bias causing longer wrong answers
Std normalization: Removes division by group std

Result

Prevents models from generating progressively longer incorrect responses. Improves token efficiency.

REINFORCE++ 2025-2026

Paper

Simplified alternative to GRPO. More stable training, used by ProRL V2, ScaleRL, Magistral.

Key Difference

GRPO: Local normalization within group
REINFORCE++: Global normalization across batch
Result: GRPO can overfit immediately; REINFORCE++ learns more stably

Single-Sample Efficiency

REINFORCE++ with k=1 achieves top-tier scores while being more token-efficient than group-sampling GRPO.

Algorithm Comparison (2025-2026)

Algorithm	AIME 2024	Value Model	KL Loss	Best For
GRPO	47	No	Yes	General reasoning
DAPO	50	No	No	Long CoT, math
VAPO	60.4	Yes	—	Complex reasoning, SOTA
Dr. GRPO	—	No	No	Token efficiency
REINFORCE++	—	No	—	Stability, single-sample

Which GRPO variant should I use?

Just starting? → Use GRPO via TRL (simplest)
Training long reasoning? → Use DAPO (no KL, dynamic sampling)
Need SOTA performance? → Use VAPO (value-based, 60.4 AIME)
Seeing length explosion? → Use Dr. GRPO (fixes length bias)
Want stability? → Use REINFORCE++ (global normalization)
Memory constrained? → DAPO or Dr. GRPO (no value model)

Implementation Libraries

GRPO Family Implementations

TRL — HuggingFace. GRPOTrainer. Easiest to start.

OpenRLHF — Ray + vLLM. GRPO, DAPO, REINFORCE++. Scales to 70B+.

veRL — ByteDance. GRPO, VAPO. 200+ hyperparams. Scales to 671B.

simple_GRPO — Educational. ~200 lines. Understand the algorithm.

GRPO-Zero — DeepSeek-R1 GRPO from scratch.

Key Papers (Chronological)

Paper	Date	Contribution
DeepSeekMath	Feb 2024	Introduced GRPO for math reasoning
DeepSeek-R1	Jan 2025	GRPO for general reasoning, "aha moment"
REINFORCE++	Jan 2025	Stabilized critic-free RL
DAPO	Mar 2025	Four fixes for GRPO, 50 on AIME
Dr. GRPO	Mar 2025	Unbiased GRPO, fixes length explosion
VAPO	Apr 2025	Value-based, 60.4 AIME SOTA

Self-Improvement Methods

STaR (Self-Taught Reasoner) Foundational

Paper

Bootstraps reasoning ability by learning from self-generated rationales. The model teaches itself to reason.

Loop: Generate rationales → Keep correct ones → Fine-tune → Repeat
Rationalization: For wrong answers, generate rationale given the correct answer (reason backward)
Results: Comparable to 30× larger models on CommonsenseQA
Limitation: Requires base model with some reasoning ability (GPT-2 too small)

Quiet-STaR Inner Monologue

Paper

Extends STaR to teach LLMs to "think before speaking" with internal rationales at every token.

Method: Generate inner rationales in parallel before responding
Zero-shot gains: GSM8K 5.9%→10.9%, CommonsenseQA 36.3%→47.2%
Key insight: Reasoning is implicit in almost all text — learn to infer unstated rationales

SPIN (Self-Play Fine-Tuning) Adversarial

GitHub · Paper

LLM plays against previous versions of itself in a GAN-like framework.

Mechanism: Current model learns to distinguish its own outputs from human data
Roles: Opponent (generates synthetic) + Main Player (discriminates)
Results: +5% on HuggingFace benchmarks over 3 iterations
Advantage: No preference data or separate reward model needed

Self-Rewarding LMs Meta AI

GitHub · Paper

Model acts as both actor AND judge — provides its own rewards via LLM-as-a-Judge prompting.

Why: Human feedback bottlenecks superhuman performance
Method: Generate → Self-judge → Iterative DPO
Extension: Meta-Rewarding — model judges its own judgments
Results: Llama-3-8B win rate 22.9%→39.4% on AlpacaEval 2

SOAR ARC Prize 2025 2nd Paper

GitHub · Paper · Models

Self-improving evolutionary program synthesis. 52% on ARC-AGI-1 — SOTA for open-source without DSLs.

Loop: LLM generates programs → Test → Refine best ones → Fine-tune LLM on attempts
Key insight: Failed programs are correct programs for different tasks — use as training data
Hindsight: Relabel failed attempts as solutions for tasks they accidentally solve
Released: 5 million ARC solutions dataset

Verifiable Rewards (RLVR)

RL with Verifiable Rewards Code & Math

Paper

Use deterministic verifiers (code execution, math checking) as ground-truth rewards instead of learned reward models.

Domains: Code (unit tests), Math (exact match), Formal proofs
Advantages: No reward hacking, objective, scales with compute
Finding: RLVR can extend reasoning boundaries for both math and code
Caveat: May narrow exploration — models favor known high-reward paths

ReST-EM Google/DeepMind

Paper

Expectation-Maximization framework for self-training with binary rewards.

E-step: Generate samples, filter with binary reward
M-step: Fine-tune base model on filtered samples
Key: Always fine-tune base model (not previous iteration) to prevent drift
Results: Outperforms human-data fine-tuning on MATH and APPS

ReST-MCTS* NeurIPS 2024

GitHub · Paper

Combines process reward guidance with tree search for higher-quality reasoning traces.

Method: MCTS* search + per-step value training
Advantage: Higher accuracy than Best-of-N, Tree-of-Thought at same compute budget
Outperforms: ReST-EM and Self-Rewarding LM in iterative improvement

Constitutional AI & RLAIF

Constitutional AI (CAI) Anthropic

Paper · Blog

Train harmless AI using self-improvement with a "constitution" of principles. No human labels for harmful outputs.

SL Phase: Generate → Self-critique → Revise → Fine-tune on revised
RL Phase: AI evaluates responses → Train preference model → RLAIF
Constitution: Natural language principles (e.g., "be helpful, honest, harmless")
Result: More harmless with minimal helpfulness impact, increased transparency

Implementation Libraries

Production-Ready Frameworks

TRL — Hugging Face. PPO, DPO, GRPO, KTO, ORPO. Most accessible.

OpenRLHF — Ray + vLLM. Scales to 70B+. PPO, GRPO, REINFORCE++.

veRL — ByteDance. Scales to 671B. GRPO, PPO, DAPO. EuroSys 2025.

trlX — CarperAI. Distributed RLHF. PPO, ILQL.

Research Implementations

SPIN — UCLA. Self-play fine-tuning. ICML 2024.

SimPO — Princeton. Reference-free preference optimization. NeurIPS 2024.

SOAR — INRIA. Self-improving evolutionary synthesis. ICML 2025.

ReST-MCTS* — Tsinghua. Process reward + tree search. NeurIPS 2024.

Specialized Tools

self-rewarding-lm — Self-rewarding framework. LLM-as-judge training.

RL4LMs — Allen AI. Modular RL library for NLG.

trl (legacy) — Original PPO implementation by Leandro von Werra.

LLM-self-play — Minimal SPIN implementation.

Which Method Should I Use?

Decision guide for RL method selection

Have preference pairs? → Start with DPO (simple, stable)
Only thumbs up/down? → Use KTO (no pairs needed)
Memory constrained? → Use ORPO or SimPO (no reference model)
Want reasoning improvement? → Use GRPO or STaR
Have verifiable tasks (code/math)? → Use RLVR with execution rewards
No external data? → Use SPIN or Self-Rewarding
Need maximum performance? → Use PPO with careful tuning
Building ARC solver? → Combine GRPO + SOAR-style evolutionary refinement

Online Methods (PPO, GRPO)

✓ Can explore new behaviors

✓ Higher ceiling with enough compute

✗ More expensive, less stable

Offline Methods (DPO, SimPO)

✓ Simple, stable, efficient

✓ Works well with limited compute

✗ Can't discover new strategies

Key Papers

Paper	Year	Code	Key Contribution
InstructGPT (RLHF)	2022	—	Original RLHF pipeline
DPO	2023	→	Reward-free preference learning
DeepSeekMath (GRPO)	2024	→	Memory-efficient PPO variant
STaR	2022	—	Self-taught reasoning
SPIN	2024	→	Self-play without rewards
Self-Rewarding	2024	→	LLM as its own judge
SOAR	2025	→	Evolutionary synthesis, 52% ARC
Constitutional AI	2022	—	RLAIF with principles

Learn

Recommended learning path

Play ARC puzzles (15 min)

Get intuition before reading theory.

→ arcprize.org/play

Read "On the Measure of Intelligence" (2h)

Chollet's original paper. Understand the philosophy.

→ arXiv:1911.01547

Explore winning solutions (1h)

Kaggle notebooks from 2024 competition.

→ Kaggle discussions

Read NVARC paper (1h)

Understand test-time training approach.

→ arXiv:2411.07279

Try arc-dsl (2h)

Write a simple solver using the DSL.

→ GitHub

Enter ARC Prize (ongoing)

$1M+ in prizes. Open to everyone.

→ arcprize.org

Key papers

Paper	Year	Code	Why read
On the Measure of Intelligence	2019	→	The foundational paper
Less is More (TRM)	2025	→	7M params beats 671B models
SOAR	2025	→	Self-improving evolution, 52%
CompressARC	2025	→	MDL, no pretraining, 76K params
TTT for Abstract Reasoning	2024	→	61.9% with ensemble
GridCoder	2024	→	Neurally-guided synthesis
ARC Prize 2025 Report	2026	—	Official competition analysis

Communities

ARC Prize Discord — most active discussion
Kaggle — competition notebooks
GitHub arc-agi topic — open implementations

Explore further

Tools & DSLs

ARC playground
arc-dsl — original DSL
arc-dsl-llm — LLM-friendly
arcmentations

Synthetic Data

Re-ARC — programmatic
BARC — 400K tasks
arc-dataset-collection
arc_1d — simplified

2025 Winners

NVARC — 1st score (24%)
TRM — 1st paper (7M params)
SOAR — 2nd paper (52%)
CompressARC — 3rd paper (MDL)
Poetiq — 54% verified

2024 Winners

ARChitects — 1st (53.5%)
Omni-ARC — 3rd place
marc — TTT paper
GridCoder — runner-up paper

Research Code

TRM — recursive 7M model
SOAR — evolutionary synthesis
CompressARC — MDL approach
CodeIt — self-improving

2020 Foundational

Icecuber — original DSL winner
arc-dsl — 150+ primitives
Re-ARC — synthetic data

nanochat: Build Your Own LLM for $100

Andrej Karpathy's nanochat is the simplest complete harness for training LLMs from scratch. It covers every stage: tokenization → pretraining → SFT → RL → evaluation → chat UI. You can train GPT-2 capability for ~$72 in 3 hours.

Core philosophy

One dial controls everything: depth. Set `--depth=26` for GPT-2 capability. All other hyperparameters (width, heads, learning rate, batch size, weight decay) are computed automatically using scaling laws.

Why nanochat matters for ARC

Small models work: TRM (7M params) and nanochat prove you don't need frontier models
Full pipeline: Same training stages used by all SOTA models (SFT → RL)
Hackable: Single-file scripts, minimal code, no frameworks
Cost efficient: Train and iterate rapidly on 8xH100 or even single GPU

Architecture Principles

GPT Architecture

nanochat/gpt.py

Modern transformer with carefully selected components:

Rotary Embeddings (RoPE) — no learned position embeddings
QK Normalization — stabilizes attention scores
Group-Query Attention (GQA) — memory-efficient attention
ReLU² activation — relu(x)² in MLP for sparse, efficient computation
RMSNorm — no learnable params, faster than LayerNorm
No bias — removed from all linear layers
Untied embeddings — separate token embedding and lm_head weights
Flash Attention 3 — H100-optimized with sliding window support

Muon Optimizer

nanochat/optim.py · Theory

Combined optimizer: Muon for matrix params, AdamW for embeddings/scalars.

Muon (MomentUm Orthogonalized by Newton-schulz)
- Standard SGD-momentum + orthogonalization post-processing
- Uses Polar Express iteration for fast orthogonalization in bfloat16
- NorMuon variance reduction — per-neuron adaptive learning rate
- Cautious weight decay — only decays weights aligned with gradient
Distributed version (ZeRO-2 style)
- Reduce-scatter gradients → compute locally → all-gather params
- 3-phase async to overlap communication with compute
- Sharded optimizer state across ranks

Scaling Laws Applied

nanochat automatically calculates optimal hyperparameters from depth using empirically-derived scaling laws:

Hyperparameter	Formula	Reference
Model dim	depth × aspect_ratio (default 64)	Width ∝ depth
Training tokens	target_param_data_ratio × params (default 10.5×)	Chinchilla ~20×
Batch size	B_ref × (D/D_ref)^0.383	Power Lines
Learning rate	η_ref × √(B/B_ref)	AdamW scaling
Weight decay	λ_ref × √(B/B_ref) × (D_ref/D)	T_epoch paper

Reference model: d12

All hyperparameters tuned at d12 (GPT-1 size, ~5min training runs), then transferred to larger depths via muP-style scaling. For quick experiments: d12 (5min), d16 (15min), d20 (1h), d26 (3h).

Training Pipeline

STAGE 1: BASE PRETRAINING

scripts/base_train.py

Train on FineWeb (10T tokens)

Outputs: val_bpb (bits per byte), CORE metric

d26 → GPT-2 capability (CORE > 0.2565) in 2.76 hours

STAGE 2: SUPERVISED FINE-TUNING (SFT)

scripts/chat_sft.py

Dataset mixture:

SmolTalk (460K conversations)

MMLU auxiliary (100K multiple choice)

GSM8K (16K math, 2 epochs)

Identity conversations (2K synthetic)

Spelling tasks (280K)

Best-fit packing: no tokens discarded, padding masked

STAGE 3: REINFORCEMENT LEARNING

scripts/chat_rl.py

Simplified GRPO (closer to REINFORCE):

No KL regularization (deleted trust region)

On-policy → no PPO ratio/clip needed

DAPO-style token-level normalization

Mean-centered advantages (not z-score)

Task: GSM8K with verifiable rewards

Metric: Pass@k evaluation

Leaderboard: Time to GPT-2

Wall-clock time to exceed GPT-2 CORE score (0.2565) on 8xH100:

#	Time	val_bpb	CORE	Key change	Date
0	168h	—	0.2565	Original OpenAI GPT-2 (2019)	2019
1	3.04h	0.7483	0.2585	d24 baseline	Jan 29 2026
2	2.91h	0.7450	0.2578	+FP8 training	Feb 2 2026
3	2.76h	0.7465	0.2602	1M token batch size	Feb 5 2026

Cost: 8xH100 @ ~$24/hr → ~$72 to train GPT-2 equivalent (was ~$43,000 in 2019).

Key Code Files

Core Architecture

nanochat/gpt.py — GPT model
nanochat/optim.py — Muon+AdamW
nanochat/tokenizer.py — GPT-4 style BPE
nanochat/engine.py — KV cache inference

Training Scripts

scripts/base_train.py — pretraining
scripts/chat_sft.py — SFT
scripts/chat_rl.py — GRPO/RL
runs/speedrun.sh — full pipeline

Distilled Principles

1. One dial of complexity

Set --depth and everything else follows. No hyperparameter tuning needed. This is what compute-optimal training looks like when done right.

2. Muon for matrices, Adam for embeddings

Different parameter types need different optimizers. Matrix parameters (attention, MLP) benefit from orthogonalization. Embeddings/scalars use standard AdamW.

3. Flash Attention + sliding window

FA3 on H100 with alternating window patterns (SSSL = 3 sliding + 1 full context) balances speed and long-range attention.

4. RL is just REINFORCE with mean-centering

Strip PPO to its core: on-policy means no clipping needed. Token-level normalization (DAPO) + mean-centered advantages. No KL penalty.

5. SFT data quality > quantity

~856K rows total. Multiple epochs on high-value data (GSM8K 2×, identity 2×). Best-fit packing to maximize token efficiency.

Quick Start

# Clone and setup
git clone https://github.com/karpathy/nanochat
cd nanochat && uv sync

# Train GPT-2 equivalent (~3 hours on 8xH100)
bash runs/speedrun.sh

# Or quick experiment with d12 (~5 min)
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 \
  -m scripts.base_train -- --depth=12 --run="d12"

# Chat with your model
python -m scripts.chat_web

Connection to ARC Methods

nanochat's pipeline directly applies to ARC solving:

nanochat Stage	ARC Application
Base pretraining	General reasoning foundation
SFT	Fine-tune on ARC-style grid tasks, DSL programs
RL (GRPO)	Self-improve with verifiable rewards (grid match)
TTT at inference	Train on each ARC task (like NVARC winner)

The RL methods in nanochat (simplified GRPO/REINFORCE) are the same foundations used by:

SOAR — evolutionary program synthesis with self-improvement
NVARC — moved heavy reasoning into offline training
TRM — recursive refinement (conceptually similar to RL's iterative improvement)

ARC-AGI

Understand

What is an ARC task?

Why ARC matters

3 numbers that matter

Maturity map

Terms & Glossary

ARC-AGI-2 (2025)

What changed from ARC-AGI-1

Design changes

ARC Prize 2025 Competition

2025 Winning Solutions Deep Dive

Commercial Model Performance

ARC-AGI-3 (2026)

What's new in ARC-AGI-3

Required AI capabilities

Preview results (July 2025)

Efficiency comparison

Timeline

Reality Check

Current leaderboard (2025)

Build

Code resources

Small vs Frontier Models

Evidence: Small Models Beating Giants

When Small Models Win

The Fine-Tuning Multiplier

Cost Comparison

Open-Source Small Models (2025-2026)

Model Comparison Table (2026)

Benchmark Snapshot (Early 2026)

Fine-Tuning Techniques

Decision Framework

Key Papers

RL Refinement Methods

Taxonomy of RL Methods

Foundational Methods

Preference Optimization Variants

GRPO Family: Deep Dive (2024-2026)

How GRPO Works

Key Differences from PPO

DeepSeek-R1 Training Config

DAPO's Four Key Improvements

DAPO Results

Why Value-Based?

VAPO's Seven Techniques

VAPO Results

Three Biases Fixed

Result

Key Difference

Single-Sample Efficiency

Algorithm Comparison (2025-2026)

Implementation Libraries

Key Papers (Chronological)

Self-Improvement Methods

Verifiable Rewards (RLVR)

Constitutional AI & RLAIF

Implementation Libraries

Which Method Should I Use?

Key Papers

Learn

Key papers

Communities

Explore further

nanochat: Build Your Own LLM for $100

Why nanochat matters for ARC

Architecture Principles

Scaling Laws Applied

Training Pipeline

Leaderboard: Time to GPT-2

Key Code Files

Distilled Principles

Quick Start

Connection to ARC Methods

Resources

More Research Hubs

Products on krons.fiu.wtf

Sources