ARC-AGI

The Abstract Reasoning Corpus — distilled from 40+ papers
Last updated: 2026-02-14

What brings you here?

🏗️ Build an ARC solver What actually works 🧠 Understand ARC Why it matters 📊 Current scores 2025 leaderboard 📚 Learn more Papers & resources

One sentence

ARC-AGI tests whether AI can learn new abstractions from a few examples — the kind of reasoning humans do effortlessly but current AI struggles with, making it the benchmark for measuring progress toward general intelligence.

Understand

ARC (Abstract Reasoning Corpus) is a benchmark created by François CholletFrançois CholletCreator of Keras, researcher at Google. Designed ARC to measure intelligence as skill-acquisition efficiency.→ fchollet.com to measure machine intelligence in ways that current AI cannot easily game.

What is an ARC task?

Each task shows 2-5 input-output pairs of colored grids. You must infer the transformation rule and apply it to a new test input.

Grids are 1-30 × 1-30 cells
10 possible colors (0-9)
Rules involve Core KnowledgeCore KnowledgeInnate human concepts: objectness, counting, geometry, topology. ARC tasks build on these priors.→ Chollet 2019: objects, counting, geometry, symmetry
No task requires domain-specific knowledge

Why ARC matters

ARC is designed to be:

Novel: Each task is unique — no memorization helps
Few-shot: Only 2-5 examples per task
Resistant to scale: More parameters don't automatically help
Human-calibrated: Average humans solve ~85% on first try

The key insight

ARC measures skill-acquisition efficiency — how quickly you can learn a new skill from minimal data. This is what Chollet argues intelligence actually is.

3 numbers that matter

85%	Average human performance on ARC-AGI public eval[1]
55.5%	Best AI score (OpenAI o3-high, Dec 2024) — but at $10K+ per task[2]
$1M	ARC Prize for 85%+ on private eval with open-source code[1]

Maturity map

working promising experimental research

APPROACHES

working TTT + LLM (test-time compute)

promising Program synthesis + search

experimental Neurosymbolic (LLM + DSL)

research Active inference, object-centric

Terms & Glossary

Hover over highlighted termsTerm TooltipThroughout this page, terms with dotted underlines show definitions when hovered. anywhere on this page for definitions.

Core concepts

ARC — Abstract Reasoning Corpus.

TTT — Test-Time Training. Train on each task.

DSL — Domain-Specific Language for transforms.

Core Knowledge — Human priors (objects, counting).

Approaches

Program synthesis — Generate code that solves tasks.

Neurosymbolic — Combine neural + symbolic reasoning.

MCTS — Monte Carlo Tree Search for DSL programs.

Active inference — Model world to minimize surprise.

Datasets

ARC-AGI-1 — Original 800 tasks (400 train, 400 eval).

ARC-AGI-2 — Harder variant announced 2025.

RE-ARC — Synthetic variants for training.

1D-ARC — Simplified version for research.

Key systems

o3 — OpenAI's reasoning model. 55.5% on ARC.

NVARC — NVIDIA's TTT approach. 54.5% public.

TRM — Test-time Retraining Model (top Kaggle).

MindsAI — Team behind TRM, won 2024 prize.

Reality Check

The uncomfortable truths about current approaches.

The o3 asterisk

55.5% at $10K+ per task — not practical[2]
Low-compute config: only 25% (worse than many open approaches)
Still far from human-level efficiency
Does not qualify for ARC Prize (not open-source)

Current leaderboard (2025)

System	Public	Private	Approach
o3-high	55.5%	—	Chain-of-thought + massive compute
NVARCNVARCNVIDIA's test-time training approach. Fine-tunes a small model per task using synthetic augmentations.→ Paper	54.5%	—	TTT + augmentation
TRMTRMTest-time Retraining Model by MindsAI. Won 2024 Kaggle competition.→ Kaggle (MindsAI)	53.5%	40%	TTT ensemble
Humans	~85%	~85%	—

Scores as of Feb 2025. Private eval is the official benchmark.

The efficiency gap

Humans solve ARC tasks in seconds with a 20W brain. o3 needs thousands of dollars of compute per task. The gap isn't accuracy — it's efficiency.

What works

✓ Test-time training (TTT)

✓ Data augmentation

✓ Ensemble methods

✓ Chain-of-thought prompting

What doesn't (yet)

✗ Pure LLMs without TTT

✗ Memorization

✗ More parameters alone

✗ Pure symbolic search

Build

Practical approaches that actually work.

Which approach should I try first?

TTT + LLM: Best results right now. Fine-tune per task.
DSL + search: More interpretable, harder to scale.
Hybrid: LLM generates DSL, search refines.

WINNING RECIPE (2024-2025)

1. Generate synthetic variants of each task (augmentation)

2. Fine-tune a small model on those variants (TTT)

3. Sample multiple solutions

4. Verify against training examples

5. Ensemble multiple models

Result: 40-55% on private eval

NVARC approach Nov 2024

arXiv:2411.07279

8B parameter base model
Generate 30-50 augmented examples per task
Fine-tune for ~100 steps
Sample 8 solutions, majority vote

Code resources

arc-dsl

GitHub

Domain-specific language for ARC. 150+ primitives for grid manipulation.

RE-ARC

GitHub

Synthetic ARC task generator. Create training data for TTT.

Learn

Recommended learning path

Play ARC puzzles (15 min)

Get intuition before reading theory.

→ arcprize.org/play

Read "On the Measure of Intelligence" (2h)

Chollet's original paper. Understand the philosophy.

→ arXiv:1911.01547

Explore winning solutions (1h)

Kaggle notebooks from 2024 competition.

→ Kaggle discussions

Read NVARC paper (1h)

Understand test-time training approach.

→ arXiv:2411.07279

Try arc-dsl (2h)

Write a simple solver using the DSL.

→ GitHub

Enter ARC Prize (ongoing)

$1M+ in prizes. Open to everyone.

→ arcprize.org

Key papers

Paper	Year	Why read
On the Measure of Intelligence	2019	The foundational paper
NVARC	2024	TTT approach, top scores
LLM + DSL hybrid	2023	Neurosymbolic approach
o3 announcement	2024	What massive compute achieves

Communities

ARC Prize Discord — most active discussion
Kaggle — competition notebooks
GitHub arc-agi topic — open implementations

Explore further

Tools

Related benchmarks