The Abstract Reasoning Corpus โ distilled from 40+ papers
Last updated: 2026-02-15
ARC-AGI tests whether AI can learn new abstractions from a few examples โ the kind of reasoning humans do effortlessly but current AI struggles with, making it the benchmark for measuring progress toward general intelligence.
ARC (Abstract Reasoning Corpus) is a benchmark created by Franรงois CholletFranรงois CholletCreator of Keras, researcher at Google. Designed ARC to measure intelligence as skill-acquisition efficiency.โ fchollet.com to measure machine intelligence in ways that current AI cannot easily game.
Each task shows 2-5 input-output pairs of colored grids. You must infer the transformation rule and apply it to a new test input.
ARC is designed to be:
ARC measures skill-acquisition efficiency โ how quickly you can learn a new skill from minimal data. This is what Chollet argues intelligence actually is.
| 85% | Average human performance on ARC-AGI public eval[1] |
| 55.5% | Best AI score (OpenAI o3-high, Dec 2024) โ but at $10K+ per task[2] |
| $1M | ARC Prize for 85%+ on private eval with open-source code[1] |
Hover over highlighted termsTerm TooltipThroughout this page, terms with dotted underlines show definitions when hovered. anywhere on this page for definitions.
Released early 2025, ARC-AGI-2 is significantly harder for AI while remaining easy for humans. Pure LLMs score 0%. Even frontier reasoning systems achieve only single-digit percentages.
ARC-AGI-1 tasks could often be solved instantly by humans. In ARC-AGI-2, every task requires deliberate thinking โ average human completion time is 2.7 minutes. But 100% of tasks have been solved by at least 2 humans in under 2 attempts.
| Aspect | ARC-AGI-1 | ARC-AGI-2 |
|---|---|---|
| Eval set size | 100 tasks | 120 tasks |
| Human avg score | ~85% | ~60% |
| Human avg time | < 1 min | 2.7 min |
| Pure LLM score | ~5-15% | 0% |
| Best AI (open) | 53.5% | ~24% |
| Best AI (any) | 55.5% | 54% |
Log-linear scaling is insufficient for ARC-AGI-2. New test-time adaptation algorithms or novel AI architectures are needed. You can't just throw more compute at it.
1,455 teams submitted 15,154 entries. More than $125,000 awarded.
NVIDIA Kaggle Grandmasters' winning approach combines Qwen3 4B with Tiny Recursive Models.
Heavyweight LLM reasoning (CoT, tool use, agents) couldn't fit Kaggle's runtime. Instead: train smaller models offline that run fast during evaluation.
Samsung researcher Alexia Jolicoeur-Martineau shows that 7M parameters can beat 671B parameter models.
Recursion substitutes for depth and size. By iteratively reasoning over its own output, TRM simulates a much deeper architecture without the memory cost.
Self-improving evolutionary program synthesis by Julien Pourcel et al. (ICML 2025).
Isaac Liao (CMU) proves lossless compression alone can produce intelligent behavior.
Major evolution from their 2024 autoregressive approach to masked diffusion.
First to break 50% on ARC-AGI-2 through refinement loops on Gemini 3 Pro.
The emergence of the refinement loop โ per-task iterative optimization guided by feedback. This is how the 50% barrier was finally broken.
| Model | ARC-AGI-2 | Cost/task | Notes |
|---|---|---|---|
| Poetiq + Gemini 3 Pro | 54% | $30 | SOTA verified, open source |
| Gemini 3 Deep Think | 45% | $77 | Previous best |
| Opus 4.5 (Thinking, 64k) | 37.6% | $2.20 | Best single-model commercial |
| Gemini 3 Pro (baseline) | 31% | $0.81 | No refinement |
| NVARC | 24% | $0.20 | Competition winner |
Launching March 25, 2026. The first major format change since ARC was introduced in 2019. ARC-AGI-3 moves from static reasoning to interactive reasoning.
ARC-1 and ARC-2 test static reasoning on grids. ARC-3 tests interactive reasoning in mini-games โ agents must explore, plan, remember, and adapt across multiple steps.
ARC-AGI-1 and ARC-AGI-2 are being "overfit" โ not through direct memorization, but because public train and private test sets share enough similarity that models trained on public data perform better than they should. ARC-3 makes this impossible.
The ARC-AGI-3 preview released video-game-like environments where agents must achieve long-horizon goals.
ARC-AGI-3 will enable formal comparison of human and AI action efficiency (learning efficiency) for the first time. Not just "did you solve it?" but "how efficiently did you learn the rules?"
Play the preview games and see how you compare to current AI systems.
| July 2025 | Preview release (3 games) |
| March 25, 2026 | Full ARC-AGI-3 launch |
| 2026 | ARC Prize 2026 competition |
The uncomfortable truths about current approaches.
| System | ARC-AGI-1 | ARC-AGI-2 | Approach |
|---|---|---|---|
| o3-high | 55.5% | โ | Massive compute ($10K+/task) |
| NVARC (2025 winner) | โ | ~24% | TTT + TRM ensemble |
| ARChitects (2024 winner) | 53.5% | โ | TTT + augmentation |
| marc | 61.9%* | โ | TTT + program synthesis |
| TRM (7M params) | 45% | 8% | Recursive reasoning |
| Humans | ~85% | ~85% | โ |
*marc ensemble with program synthesis. Scores as of Feb 2025.
Humans solve ARC tasks in seconds with a 20W brain. o3 needs thousands of dollars of compute per task. The gap isn't accuracy โ it's efficiency.
Practical approaches that actually work.
Do you need GPT-4 or can a fine-tuned 4B model do the job? The evidence is clear: for specialized tasks, small fine-tuned models often match or beat frontier models at 1/100th the cost.
A 7M parameter TRM beats 671B DeepSeek-R1 on ARC reasoning (45% vs 15.8%). A fine-tuned Qwen3-4B matches GPT-4 on specialized tasks. Architecture and training matter more than raw size.
| Small Model | Large Model | Task | Result |
|---|---|---|---|
| TRM 7M | DeepSeek-R1 671B | ARC-AGI-1 | 45% vs 15.8% (3ร better) |
| TRM 7M | GPT-4, Claude, R1 | Hard Sudoku | 87.4% vs 0% |
| CompressARC 76K | No pretraining | ARC-AGI-1 | 20-34% from scratch |
| Qwen3-4B fine-tuned | GPT-OSS-120B | 8 benchmarks | Matches/exceeds 7 of 8 |
| Llama 3.2 1B fine-tuned | GPT-4.1 | E-commerce intent | 99% vs 99% |
| Phi-4 14B | Llama 3.3 70B | MATH benchmark | Phi-4 wins |
TRM (7M params) outperforms DeepSeek-R1 (671B params) on ARC reasoning. That's a model 100,000ร smaller achieving 3ร better performance. The difference? Recursive architecture vs autoregressive generation.
Smaller models benefit disproportionately more from fine-tuning than larger ones.
| Model Size | Base Performance | After Fine-tuning | Improvement |
|---|---|---|---|
| 1B-3B | Low | Often matches 70B base | Dramatic (2-5ร) |
| 4B-8B | Moderate | Can match GPT-4 on task | Large (1.5-3ร) |
| 14B-32B | Good | Frontier-competitive | Moderate (1.2-2ร) |
| 70B+ | Very good | Incremental gains | Small (1.1-1.3ร) |
| Approach | Model | ARC-AGI-2 Score | Cost/Task | Ratio |
|---|---|---|---|---|
| Frontier + refinement | Poetiq + Gemini 3 | 54% | $30.00 | 150ร |
| Frontier thinking | Gemini 3 Deep | 45% | $77.00 | 385ร |
| Small fine-tuned | NVARC Qwen3 4B | 24% | $0.20 | 1ร (baseline) |
| Tiny specialized | TRM 7M | 8% | ~$0.01 | 0.05ร |
NVARC won ARC Prize 2025 with a 4B model at $0.20/task. Poetiq achieves 54% but at $30/task. The question isn't "which scores higher" โ it's "what's the cost-performance sweet spot for your use case?"
The latest small models often match or beat much larger models from just a year ago. Here's the complete landscape.
| Model | Params | Context | Best For | License |
|---|---|---|---|---|
| Qwen3-4B | 4B | 32K | Fine-tuning, structured data | Apache 2.0 |
| SmolLM3-3B | 3B | 128K | Transparency, research | Apache 2.0 |
| Phi-4-reasoning-plus | 14B | 16K | Math, science, coding | MIT |
| Ministral 3 8B | 8B | 256K | Multimodal, long context | Apache 2.0 |
| OLMo 2 7B | 7B | 4K | Full reproducibility | Apache 2.0 |
| Llama 3.1 8B | 8B | 128K | Ecosystem, tooling | Llama 3.1 |
| Gemma 3 4B | 4B | 128K | Multimodal, 140+ languages | Gemma |
| Model | MMLU | GSM8K | HumanEval | AIME 2025 |
|---|---|---|---|---|
| Phi-4-reasoning-plus 14B | โ | โ | 68.8% | 77.7% |
| Ministral 3 14B Reasoning | โ | โ | โ | 85% |
| Phi-4-mini-reasoning 3.8B | โ | โ | โ | 33.6% |
| Qwen3-4B | ~75% | ~85% | ~70% | โ |
| SmolLM3-3B | ~68% | ~75% | ~55% | โ |
| OLMo 2 7B | ~63% | ~70% | ~45% | โ |
Ministral 3 14B Reasoning achieves 85% on AIME 2025 โ matching frontier models from 2024. Phi-4-mini-reasoning (3.8B) beats o1-mini on some benchmarks. The gap between "small" and "frontier" is collapsing rapidly.
| Paper | Year | Key Finding |
|---|---|---|
| Less is More (TRM) | 2025 | 7M params beats 671B on reasoning via recursion |
| LoRA | 2021 | Train 0.01% of params, match full fine-tuning |
| QLoRA | 2023 | 4-bit quantization + LoRA. 70B on single GPU. |
| Phi-4 | 2024 | 14B beats 70B on math via synthetic data |
| Small vs Large LLMs | 2024 | Specialized 1B matches GPT-4 with ~100 samples |
A comprehensive guide to reinforcement learning techniques for training and refining LLMs โ from foundational RLHF to frontier self-improvement methods. These techniques power the reasoning improvements in ARC solvers.
The best ARC solvers use RL to improve reasoning at test-time. DeepSeek-R1's GRPO, SOAR's evolutionary refinement, and Poetiq's iterative loops all leverage RL principles to achieve SOTA results.
The original method that powered ChatGPT. Train a reward model from human preferences, then use PPO to optimize the policy.
Simplifies RLHF by treating alignment as a classification problem. Used to train Llama 3, Zephyr, NeuralChat.
DPO is simpler and cheaper. PPO performs better on complex tasks (code generation) when properly tuned. For most use cases, start with DPO.
| Method | Key Innovation | When to Use | Code |
|---|---|---|---|
| IPO | Regularization to prevent overfitting | DPO overfits your data | TRL |
| KTO | Uses prospect theory, no pairs needed | Only have thumbs up/down data | TRL |
| ORPO | No reference model needed | Memory constrained | TRL |
| SimPO | Reference-free, simpler objective | Want stability + performance | GitHub |
GRPO and its variants are the dominant algorithms for training reasoning LLMs in 2025-2026. This section covers the full evolution from GRPO โ DAPO โ VAPO โ Dr. GRPO.
GRPO proved that pure RL can develop reasoning without supervised fine-tuning. DeepSeek-R1 showed emergent behaviors (reflection, self-verification, "aha moments") arose naturally from GRPO training with simple regex rewards.
The algorithm behind DeepSeek-R1's reasoning capabilities. Cuts PPO compute in half by eliminating the value network.
| Aspect | PPO | GRPO |
|---|---|---|
| Models needed | 4 (policy, ref, reward, value) | 2 (policy, ref) |
| Value estimation | Learned critic network | Group mean baseline |
| Memory | High | ~50% less |
| KL penalty | In reward signal | Direct loss term |
During R1-Zero training, models spontaneously developed self-reflection: "Wait, let me reconsider..." โ emerging purely from RL without being taught. This suggests reasoning can be discovered, not just imitated.
GRPO++ โ fixes key issues in GRPO. 50 points on AIME 2024 with Qwen2.5-32B, outperforming DeepSeek-R1-Zero.
| Technique | Problem Solved | How |
|---|---|---|
| Clip-Higher | Entropy collapse | Asymmetric clipping (0.2 lower, 0.28 upper) |
| No KL Loss | Restricts long CoT | Remove KL term entirely for reasoning tasks |
| Dynamic Sampling | Zero gradients when all correct/wrong | Oversample, filter groups with acc=0 or acc=1 |
| Token-Level Loss | Length bias | Average loss over all tokens, not sequences |
Returns to value-based methods with better credit assignment. 60.4 on AIME 2024 โ current SOTA.
GRPO/DAPO use final reward only. VAPO learns per-step values for precise credit assignment โ crucial for long reasoning chains where early errors compound.
| Method | AIME 2024 | Steps to Match |
|---|---|---|
| GRPO (DeepSeek-R1) | 47 | baseline |
| DAPO | 50 | 50% fewer |
| VAPO | 60.4 | 60% fewer than DAPO |
Fixes mathematical biases in GRPO's advantage estimation.
Prevents models from generating progressively longer incorrect responses. Improves token efficiency.
Simplified alternative to GRPO. More stable training, used by ProRL V2, ScaleRL, Magistral.
REINFORCE++ with k=1 achieves top-tier scores while being more token-efficient than group-sampling GRPO.
| Algorithm | AIME 2024 | Value Model | KL Loss | Best For |
|---|---|---|---|---|
| GRPO | 47 | No | Yes | General reasoning |
| DAPO | 50 | No | No | Long CoT, math |
| VAPO | 60.4 | Yes | โ | Complex reasoning, SOTA |
| Dr. GRPO | โ | No | No | Token efficiency |
| REINFORCE++ | โ | No | โ | Stability, single-sample |
| Paper | Date | Contribution |
|---|---|---|
| DeepSeekMath | Feb 2024 | Introduced GRPO for math reasoning |
| DeepSeek-R1 | Jan 2025 | GRPO for general reasoning, "aha moment" |
| REINFORCE++ | Jan 2025 | Stabilized critic-free RL |
| DAPO | Mar 2025 | Four fixes for GRPO, 50 on AIME |
| Dr. GRPO | Mar 2025 | Unbiased GRPO, fixes length explosion |
| VAPO | Apr 2025 | Value-based, 60.4 AIME SOTA |
Bootstraps reasoning ability by learning from self-generated rationales. The model teaches itself to reason.
Extends STaR to teach LLMs to "think before speaking" with internal rationales at every token.
LLM plays against previous versions of itself in a GAN-like framework.
Model acts as both actor AND judge โ provides its own rewards via LLM-as-a-Judge prompting.
Self-improving evolutionary program synthesis. 52% on ARC-AGI-1 โ SOTA for open-source without DSLs.
Use deterministic verifiers (code execution, math checking) as ground-truth rewards instead of learned reward models.
Expectation-Maximization framework for self-training with binary rewards.
Combines process reward guidance with tree search for higher-quality reasoning traces.
Train harmless AI using self-improvement with a "constitution" of principles. No human labels for harmful outputs.
| Paper | Year | Code | Key Contribution |
|---|---|---|---|
| InstructGPT (RLHF) | 2022 | โ | Original RLHF pipeline |
| DPO | 2023 | โ | Reward-free preference learning |
| DeepSeekMath (GRPO) | 2024 | โ | Memory-efficient PPO variant |
| STaR | 2022 | โ | Self-taught reasoning |
| SPIN | 2024 | โ | Self-play without rewards |
| Self-Rewarding | 2024 | โ | LLM as its own judge |
| SOAR | 2025 | โ | Evolutionary synthesis, 52% ARC |
| Constitutional AI | 2022 | โ | RLAIF with principles |
| Paper | Year | Code | Why read |
|---|---|---|---|
| On the Measure of Intelligence | 2019 | โ | The foundational paper |
| Less is More (TRM) | 2025 | โ | 7M params beats 671B models |
| SOAR | 2025 | โ | Self-improving evolution, 52% |
| CompressARC | 2025 | โ | MDL, no pretraining, 76K params |
| TTT for Abstract Reasoning | 2024 | โ | 61.9% with ensemble |
| GridCoder | 2024 | โ | Neurally-guided synthesis |
| ARC Prize 2025 Report | 2026 | โ | Official competition analysis |
Andrej Karpathy's nanochat is the simplest complete harness for training LLMs from scratch. It covers every stage: tokenization โ pretraining โ SFT โ RL โ evaluation โ chat UI. You can train GPT-2 capability for ~$72 in 3 hours.
One dial controls everything: depth. Set `--depth=26` for GPT-2 capability. All other hyperparameters (width, heads, learning rate, batch size, weight decay) are computed automatically using scaling laws.
Modern transformer with carefully selected components:
Combined optimizer: Muon for matrix params, AdamW for embeddings/scalars.
nanochat automatically calculates optimal hyperparameters from depth using empirically-derived scaling laws:
| Hyperparameter | Formula | Reference |
|---|---|---|
| Model dim | depth ร aspect_ratio (default 64) | Width โ depth |
| Training tokens | target_param_data_ratio ร params (default 10.5ร) | Chinchilla ~20ร |
| Batch size | B_ref ร (D/D_ref)^0.383 | Power Lines |
| Learning rate | ฮท_ref ร โ(B/B_ref) | AdamW scaling |
| Weight decay | ฮป_ref ร โ(B/B_ref) ร (D_ref/D) | T_epoch paper |
All hyperparameters tuned at d12 (GPT-1 size, ~5min training runs), then transferred to larger depths via muP-style scaling. For quick experiments: d12 (5min), d16 (15min), d20 (1h), d26 (3h).
Wall-clock time to exceed GPT-2 CORE score (0.2565) on 8xH100:
| # | Time | val_bpb | CORE | Key change | Date |
|---|---|---|---|---|---|
| 0 | 168h | โ | 0.2565 | Original OpenAI GPT-2 (2019) | 2019 |
| 1 | 3.04h | 0.7483 | 0.2585 | d24 baseline | Jan 29 2026 |
| 2 | 2.91h | 0.7450 | 0.2578 | +FP8 training | Feb 2 2026 |
| 3 | 2.76h | 0.7465 | 0.2602 | 1M token batch size | Feb 5 2026 |
Cost: 8xH100 @ ~$24/hr โ ~$72 to train GPT-2 equivalent (was ~$43,000 in 2019).
Set --depth and everything else follows. No hyperparameter tuning needed. This is what compute-optimal training looks like when done right.
Different parameter types need different optimizers. Matrix parameters (attention, MLP) benefit from orthogonalization. Embeddings/scalars use standard AdamW.
FA3 on H100 with alternating window patterns (SSSL = 3 sliding + 1 full context) balances speed and long-range attention.
Strip PPO to its core: on-policy means no clipping needed. Token-level normalization (DAPO) + mean-centered advantages. No KL penalty.
~856K rows total. Multiple epochs on high-value data (GSM8K 2ร, identity 2ร). Best-fit packing to maximize token efficiency.
# Clone and setup git clone https://github.com/karpathy/nanochat cd nanochat && uv sync # Train GPT-2 equivalent (~3 hours on 8xH100) bash runs/speedrun.sh # Or quick experiment with d12 (~5 min) OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 \ -m scripts.base_train -- --depth=12 --run="d12" # Chat with your model python -m scripts.chat_web
nanochat's pipeline directly applies to ARC solving:
| nanochat Stage | ARC Application |
|---|---|
| Base pretraining | General reasoning foundation |
| SFT | Fine-tune on ARC-style grid tasks, DSL programs |
| RL (GRPO) | Self-improve with verifiable rewards (grid match) |
| TTT at inference | Train on each ARC task (like NVARC winner) |
The RL methods in nanochat (simplified GRPO/REINFORCE) are the same foundations used by: