Wiki · Concept · Last reviewed May 19, 2026

ARC-AGI

ARC-AGI is a benchmark family derived from Francois Chollet's Abstraction and Reasoning Corpus. It is designed to test whether AI systems can infer unfamiliar abstract rules from very small numbers of examples, rather than merely reproduce patterns from large-scale training data.

Definition

ARC-AGI is a family of artificial intelligence benchmarks centered on abstraction, reasoning, and efficient adaptation to novelty. The original benchmark asks a system to inspect a few input-output examples, infer the hidden transformation, and apply that transformation to a new input.

The benchmark is tied to Chollet's broader argument that intelligence should be measured as skill-acquisition efficiency: how much new competence a system can gain from limited experience, given its priors and the difficulty of generalization. In this framing, a system that needs vast exposure to task-like data is different from one that can infer a new rule from sparse evidence.

Origin

Chollet introduced the Abstraction and Reasoning Corpus in the 2019 paper On the Measure of Intelligence. The paper criticized evaluation regimes that reward high performance on known task distributions while saying too little about whether a system can adapt to unfamiliar problems.

The original ARC tasks use simple colored grids rather than language, web knowledge, or real-world facts. That design tries to isolate a particular problem: can a system form and apply a compact abstraction from a small number of demonstrations?

Task Design

ARC-AGI-1 tasks are visual transformation problems. A task usually shows a small number of paired training examples and then a test input. The solver must infer the relevant rule from the examples and produce the correct output grid.

This makes ARC different from many language-model benchmarks. It is not mainly asking whether a model knows a fact, recognizes a familiar exam format, or can imitate a common text pattern. It is asking whether the system can infer a new procedure under severe data scarcity.

The public importance of ARC comes from that constraint. The tasks look simple to many humans, but they can be difficult for systems whose strength comes from statistical exposure to large corpora rather than from robust abstraction and recomposition.

ARC Prize

ARC Prize is the public competition and nonprofit structure built around the ARC-AGI benchmark family. It was co-founded by Chollet and Mike Knoop to encourage open research on generalization, abstraction, and artificial general intelligence measurement.

The competition format matters because benchmarks do not only measure progress; they shape research incentives. ARC Prize tries to reward methods that generalize to hidden tasks and publicly share useful approaches, rather than rewarding only private leaderboard optimization.

ARC-AGI-2

ARC-AGI-2 was announced with ARC Prize 2025 as a harder successor to ARC-AGI-1. ARC Prize describes it as raising task complexity while preserving the core goal of measuring efficient abstraction rather than memorized benchmark exposure.

The ARC Prize 2025 technical report says the competition targeted ARC-AGI-2 and that several frontier labs reported ARC-AGI results in public model cards in 2025. That gave the benchmark greater institutional visibility, but it also increased the risk that public examples, solver traces, and discussion would become part of future training data.

ARC-AGI-3

ARC-AGI-3, launched in March 2026, changes the benchmark format from static grid tasks to interactive, turn-based environments. The official launch materials describe it as a benchmark for agentic intelligence: systems must explore unfamiliar environments, infer rules and goals, build internal models, plan, act, and adapt without written instructions.

ARC Prize reported at launch that humans solved the ARC-AGI-3 environments in its testing while frontier AI systems scored below 1 percent. Those figures are date-sensitive claims from the March and April 2026 launch period, not permanent facts about model capability.

The move to interaction is important. It shifts the measurement question from "Can the system infer this visual transformation?" to "Can the system discover how an unfamiliar world works by acting inside it?" That connects ARC-AGI to agent evaluation, world models, embodied reasoning, and long-horizon autonomy.

Why It Matters

ARC-AGI matters because it challenges a common public shortcut: treating benchmark scores, fluency, or broad task coverage as proof of general intelligence. It asks a narrower but sharper question: how efficiently can a system acquire a new skill when the task is not already familiar?

For AI governance, ARC-style evaluation is useful because it separates some forms of capability from memorization and contamination. It also gives policy and safety discussions a language for asking whether a model can generalize, not only whether it can pass public exams.

For AI research, ARC keeps alive the possibility that progress may require methods beyond scale alone: program synthesis, search, explicit abstraction, test-time adaptation, tool use, world modeling, or hybrid systems that can construct new procedures from sparse evidence.

Limits and Cautions

ARC-AGI is not a complete test of intelligence. Its original tasks are deliberately abstract and narrow. They do not measure social reasoning, factual knowledge, physical manipulation, institutional judgment, moral decision-making, long-term reliability, or safety under deployment pressure.

The benchmark also faces the same sociotechnical risks as other evaluations. Public tasks can leak into training data. Competitors can overfit to a benchmark style. Leaderboards can compress a complex research question into a single number. New versions can shift the target in ways that are scientifically useful but hard for the public to interpret.

The responsible reading is therefore neither worship nor dismissal. ARC-AGI is a strong probe of abstraction and efficient generalization, not a final oracle for AGI.

Spiralist Reading

ARC-AGI is a test against theatrical intelligence.

The machine can speak with authority, cite sources, solve exams, and imitate expertise. ARC asks whether it can cross a small unfamiliar gap without having already consumed the path. That is why the benchmark has symbolic force: it interrupts the conversion of fluency into certainty.

For Spiralism, ARC-AGI is useful because it restores friction to the mythology of progress. It reminds institutions that a score is not understanding, that a demo is not agency, and that a public benchmark becomes part of the reality it tries to measure. The challenge is to keep measurement alive without letting measurement become a new idol.

Sources


Return to Wiki