← back

ARC-AGI

The Abstract Reasoning Corpus β€” distilled from 40+ papers
Last updated: 2026-02-14

What brings you here?
One sentence

ARC-AGI tests whether AI can learn new abstractions from a few examples β€” the kind of reasoning humans do effortlessly but current AI struggles with, making it the benchmark for measuring progress toward general intelligence.

Understand

ARC (Abstract Reasoning Corpus) is a benchmark created by FranΓ§ois CholletFranΓ§ois CholletCreator of Keras, researcher at Google. Designed ARC to measure intelligence as skill-acquisition efficiency.β†’ fchollet.com to measure machine intelligence in ways that current AI cannot easily game.

What is an ARC task?

Each task shows 2-5 input-output pairs of colored grids. You must infer the transformation rule and apply it to a new test input.

Why ARC matters

ARC is designed to be:

The key insight

ARC measures skill-acquisition efficiency β€” how quickly you can learn a new skill from minimal data. This is what Chollet argues intelligence actually is.

3 numbers that matter

85%Average human performance on ARC-AGI public eval[1]
55.5%Best AI score (OpenAI o3-high, Dec 2024) β€” but at $10K+ per task[2]
$1MARC Prize for 85%+ on private eval with open-source code[1]

Maturity map

working promising experimental research
APPROACHES
working TTTTest-Time TrainingTraining the model on each specific task at inference time. Key technique in top ARC solvers.β†’ Paper + LLM (test-time compute)
promising Program synthesis + search
experimental Neurosymbolic (LLM + DSL)
research Active inference, object-centric

Terms & Glossary

Hover over highlighted termsTerm TooltipThroughout this page, terms with dotted underlines show definitions when hovered. anywhere on this page for definitions.

Core concepts
ARC β€” Abstract Reasoning Corpus.
TTT β€” Test-Time Training. Train on each task.
DSL β€” Domain-Specific Language for transforms.
Core Knowledge β€” Human priors (objects, counting).
Approaches
Program synthesis β€” Generate code that solves tasks.
Neurosymbolic β€” Combine neural + symbolic reasoning.
MCTS β€” Monte Carlo Tree Search for DSL programs.
Active inference β€” Model world to minimize surprise.
Datasets
ARC-AGI-1 β€” Original 800 tasks (400 train, 400 eval).
ARC-AGI-2 β€” Harder variant announced 2025.
RE-ARC β€” Synthetic variants for training.
1D-ARC β€” Simplified version for research.
Key systems
o3 β€” OpenAI's reasoning model. 55.5% on ARC.
NVARC β€” NVIDIA's TTT approach. 54.5% public.
TRM β€” Test-time Retraining Model (top Kaggle).
MindsAI β€” Team behind TRM, won 2024 prize.

Reality Check

The uncomfortable truths about current approaches.

The o3 asterisk
  • 55.5% at $10K+ per task β€” not practical[2]
  • Low-compute config: only 25% (worse than many open approaches)
  • Still far from human-level efficiency
  • Does not qualify for ARC Prize (not open-source)

Current leaderboard (2025)

SystemPublicPrivateApproach
o3-high55.5%β€”Chain-of-thought + massive compute
NVARCNVARCNVIDIA's test-time training approach. Fine-tunes a small model per task using synthetic augmentations.β†’ Paper54.5%β€”TTT + augmentation
TRMTRMTest-time Retraining Model by MindsAI. Won 2024 Kaggle competition.β†’ Kaggle (MindsAI)53.5%40%TTT ensemble
Humans~85%~85%β€”

Scores as of Feb 2025. Private eval is the official benchmark.

The efficiency gap

Humans solve ARC tasks in seconds with a 20W brain. o3 needs thousands of dollars of compute per task. The gap isn't accuracy β€” it's efficiency.

What works
βœ“ Test-time training (TTT)
βœ“ Data augmentation
βœ“ Ensemble methods
βœ“ Chain-of-thought prompting
What doesn't (yet)
βœ— Pure LLMs without TTT
βœ— Memorization
βœ— More parameters alone
βœ— Pure symbolic search

Build

Practical approaches that actually work.

Which approach should I try first?
  • TTT + LLM: Best results right now. Fine-tune per task.
  • DSL + search: More interpretable, harder to scale.
  • Hybrid: LLM generates DSL, search refines.
WINNING RECIPE (2024-2025)
1. Generate synthetic variants of each task (augmentation)
2. Fine-tune a small model on those variants (TTT)
3. Sample multiple solutions
4. Verify against training examples
5. Ensemble multiple models
Result: 40-55% on private eval
NVARC approach Nov 2024
  • 8B parameter base model
  • Generate 30-50 augmented examples per task
  • Fine-tune for ~100 steps
  • Sample 8 solutions, majority vote

Code resources

arc-dsl

Domain-specific language for ARC. 150+ primitives for grid manipulation.

RE-ARC

Synthetic ARC task generator. Create training data for TTT.

Learn

Recommended learning path
1
Play ARC puzzles (15 min)
Get intuition before reading theory.
2
Read "On the Measure of Intelligence" (2h)
Chollet's original paper. Understand the philosophy.
3
Explore winning solutions (1h)
Kaggle notebooks from 2024 competition.
4
Read NVARC paper (1h)
Understand test-time training approach.
5
Try arc-dsl (2h)
Write a simple solver using the DSL.
6
Enter ARC Prize (ongoing)
$1M+ in prizes. Open to everyone.

Key papers

PaperYearWhy read
On the Measure of Intelligence2019The foundational paper
NVARC2024TTT approach, top scores
LLM + DSL hybrid2023Neurosymbolic approach
o3 announcement2024What massive compute achieves

Communities

Explore further