Benchmark testing artificial reasoning and abstraction ability through novel, unseen visual puzzles
ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark and open-source dataset created by François Chollet in 2019 to evaluate artificial intelligence systems on their ability to reason abstractly and solve novel problems through pattern recognition, not memorization. The benchmark consists of hundreds of visual puzzles where an AI system is shown a small number of input-output examples and must infer the underlying rule or transformation, then apply it to new test inputs it has never seen. Each puzzle involves colored grids and spatial transformations—for instance, given three examples of how a shape is rotated or filled, predict the transformation on a fourth, unseen example.
The design of ARC-AGI is deliberately adversarial to large language models and memorization-based approaches. Puzzles are hand-crafted to be simple for humans (often solvable by children) but require genuine reasoning: identifying abstract patterns, generalizing across few examples, and avoiding overfitting to surface-level features. Test sets are kept private to prevent benchmark gaming. Crucially, the puzzles are designed to be novel—no training data contains the exact transformation, forcing the system to rely on reasoning rather than pattern matching. The original dataset (v1.0) contained 400 puzzles; subsequent versions (ARC-AGI-2, ARC-AGI-3, released in 2025-2026) expand the corpus and increase difficulty.
ARC-AGI has become a proxy for measuring progress toward artificial general intelligence because it tests abstraction ability—the hallmark of flexible, human-like reasoning. In 2024, Chollet announced the ARC Prize, a $2 million competition for the first system to solve 85% of the private test set. As of late 2025, the best AI systems score less than 1% on ARC-AGI-3, while humans easily exceed 85%. This gap highlights a critical limitation of current AI: exceptional performance on benchmarks like ImageNet or MMLU contrasts sharply with near-total failure on novel reasoning tasks. ARC-AGI has influenced benchmark design across the field and galvanized research into inductive reasoning, few-shot learning, and genuine abstraction.