The Abstract Reasoning Corpus β distilled from 40+ papers
Last updated: 2026-02-14
ARC-AGI tests whether AI can learn new abstractions from a few examples β the kind of reasoning humans do effortlessly but current AI struggles with, making it the benchmark for measuring progress toward general intelligence.
ARC (Abstract Reasoning Corpus) is a benchmark created by FranΓ§ois CholletFranΓ§ois CholletCreator of Keras, researcher at Google. Designed ARC to measure intelligence as skill-acquisition efficiency.β fchollet.com to measure machine intelligence in ways that current AI cannot easily game.
Each task shows 2-5 input-output pairs of colored grids. You must infer the transformation rule and apply it to a new test input.
ARC is designed to be:
ARC measures skill-acquisition efficiency β how quickly you can learn a new skill from minimal data. This is what Chollet argues intelligence actually is.
| 85% | Average human performance on ARC-AGI public eval[1] |
| 55.5% | Best AI score (OpenAI o3-high, Dec 2024) β but at $10K+ per task[2] |
| $1M | ARC Prize for 85%+ on private eval with open-source code[1] |
Hover over highlighted termsTerm TooltipThroughout this page, terms with dotted underlines show definitions when hovered. anywhere on this page for definitions.
The uncomfortable truths about current approaches.
| System | Public | Private | Approach |
|---|---|---|---|
| o3-high | 55.5% | β | Chain-of-thought + massive compute |
| NVARCNVARCNVIDIA's test-time training approach. Fine-tunes a small model per task using synthetic augmentations.β Paper | 54.5% | β | TTT + augmentation |
| TRMTRMTest-time Retraining Model by MindsAI. Won 2024 Kaggle competition.β Kaggle (MindsAI) | 53.5% | 40% | TTT ensemble |
| Humans | ~85% | ~85% | β |
Scores as of Feb 2025. Private eval is the official benchmark.
Humans solve ARC tasks in seconds with a 20W brain. o3 needs thousands of dollars of compute per task. The gap isn't accuracy β it's efficiency.
Practical approaches that actually work.
Domain-specific language for ARC. 150+ primitives for grid manipulation.
Synthetic ARC task generator. Create training data for TTT.
| Paper | Year | Why read |
|---|---|---|
| On the Measure of Intelligence | 2019 | The foundational paper |
| NVARC | 2024 | TTT approach, top scores |
| LLM + DSL hybrid | 2023 | Neurosymbolic approach |
| o3 announcement | 2024 | What massive compute achieves |