Wiki · Concept · Last reviewed June 19, 2026

ARC-AGI

ARC-AGI is a benchmark family derived from Francois Chollet's Abstraction and Reasoning Corpus. It tests whether AI systems can acquire new abstract skills from sparse evidence, under controlled task conditions, rather than merely reproduce patterns already absorbed through training, tuning, retrieval, or benchmark practice.

Definition

ARC-AGI is a family of artificial intelligence benchmarks centered on abstraction, reasoning, and efficient adaptation to novelty. ARC-AGI-1 and ARC-AGI-2 use grid-transformation tasks: a system inspects a few input-output examples, infers the hidden transformation, and applies that transformation to a new input. ARC-AGI-3 extends the project into interactive, turn-based environments where a system must explore, infer goals, model dynamics, and act.

The benchmark is tied to Chollet's broader argument that intelligence should be measured as skill-acquisition efficiency: how much new competence a system gains from limited experience, given its priors and the difficulty of generalization. In this framing, a system that needs vast exposure to task-like data is different from one that can infer a new rule from sparse evidence and transfer it to a genuinely unfamiliar case.

ARC-AGI is best read as a benchmark protocol family, not a single stable number. The evidence changes with the version, split, scoring rule, solver scaffold, and whether the task set was public, semi-private, or private at evaluation time.

A reported ARC-AGI result is a claim about a whole evaluated setup, not only a model weight file. Prompts, tool access, retries, search, program synthesis, test-time training, scaffolding, cost limits, public-set exposure, and private-set controls can all change the meaning of the score.

The name includes AGI because the benchmark project is explicitly about measuring progress toward artificial general intelligence. It should not be read as proof that any system scoring well on ARC-AGI is conscious, divine, generally wise, safe to deploy, or already AGI. In this entry, ARC-AGI names a benchmark series and research program, not a theological, metaphysical, or deployment-safety status.

Snapshot

Current Context

As of June 19, 2026, ARC-AGI is both a benchmark family and an active prize ecosystem. ARC Prize 2026 opened on March 25, 2026, with submissions due November 2, papers due November 8, and results scheduled for December 4. The competition page lists three tracks: ARC-AGI-3 for interactive agents, ARC-AGI-2 for static reasoning systems, and a paper prize for conceptual progress tied to submitted systems.

The prize split is part of the current evidence context. ARC-AGI-3 lists $850,000 in total prizes, including a $700,000 100 percent grand prize, top-score awards, milestone awards on June 30 and September 30, no internet access during evaluation, and open-source requirements for prize eligibility. ARC-AGI-2 lists $700,000 in total prizes, including progress, grand-prize, and bonus-prize pools. The paper track lists $450,000 in prizes and evaluates papers across accuracy, universality, progress, theory, completeness, and novelty.

Current ARC-AGI-3 scoring uses Relative Human Action Efficiency: agents are scored on completion and action efficiency compared with human baselines from controlled first-run testing. That makes ARC-AGI-3 a system-level evaluation. A result can depend on the agent wrapper, memory, planning loop, action interface, timeout, compute budget, and whether the evaluated environment is public, semi-private, or fully private.

Source discipline is especially important because ARC Prize materials are both primary sources and benchmark-steward advocacy. They are authoritative for competition rules, task formats, scoring, and stated mission. They are not independent proof that a model is generally intelligent, safe, or ready for deployment.

Origin

Chollet introduced the Abstraction and Reasoning Corpus in the 2019 paper On the Measure of Intelligence. The paper criticized evaluation regimes that reward high performance on known task distributions while saying too little about whether a system can adapt to unfamiliar problems.

The original ARC-AGI-1 benchmark consists of 800 puzzle-like grid tasks in its public training and public evaluation sets, with later semi-private and private evaluation sets used for leaderboard and competition scoring. The tasks use simple colored grids rather than language, web knowledge, or real-world facts. The design tries to isolate a narrow but important question: can a system form and apply a compact abstraction from a small number of demonstrations, using priors similar enough to the ones humans bring to simple visual reasoning?

Task Design

ARC-AGI-1 and ARC-AGI-2 tasks are visual transformation problems. A task usually shows a small number of paired training examples and then a test input. In the official technical guide, the underlying files are JSON lists of integer grids that can be visualized as colors; a successful answer must match the expected output's shape, colors, and positions.

This makes ARC different from many language-model benchmarks. It is not mainly asking whether a model knows a fact, recognizes a familiar exam format, or can imitate a common text pattern. It is asking whether the system can infer a new procedure under severe data scarcity.

The public importance of ARC comes from that constraint. The tasks look simple to many humans, but they can be difficult for systems whose strength comes from statistical exposure to large corpora rather than from robust abstraction and recomposition. A strong ARC result is therefore evidence about efficient abstract generalization under a specific protocol, not evidence about all domains of intelligence, all forms of agency, or all deployment risks.

ARC Prize

ARC Prize is the public competition and nonprofit structure built around the ARC-AGI benchmark family. It was co-founded by Chollet and Mike Knoop to encourage open research on generalization, abstraction, and artificial general intelligence measurement.

The competition format matters because benchmarks do not only measure progress; they shape research incentives. ARC Prize tries to reward methods that generalize to hidden tasks and publicly share useful approaches, rather than rewarding only private leaderboard optimization.

As of June 19, 2026, ARC Prize has grown from a grid-puzzle benchmark into a sequence of competitions, leaderboards, technical reports, public datasets, and interactive environments. ARC Prize 2026 materials list $2 million across three tracks: ARC-AGI-3, ARC-AGI-2, and a paper prize. That institutional growth is useful, but it also makes source discipline more important: an ARC score should always be read with the benchmark version, task split, date, solver setup, compute or cost budget, and contamination controls attached.

ARC-AGI-2

ARC-AGI-2 was announced with ARC Prize 2025 as a harder successor to ARC-AGI-1. ARC Prize describes it as raising task complexity while preserving the core goal of measuring efficient abstraction rather than memorized benchmark exposure.

Its public description keeps the grid-task format but changes the evaluation regime. ARC Prize describes public, semi-private, and private evaluation sets of 120 tasks each, calibrated to similar difficulty when overfitting is absent, and says ARC-AGI-2 tasks were selected so that at least two human participants solved each task within two attempts.

The ARC-AGI-2 paper says the benchmark preserves the input-output pair task format while adding a curated and expanded set of tasks for finer-grained evaluation at higher levels of cognitive complexity. The ARC Prize 2025 technical report says the competition drew 1,455 teams and 15,154 entries, with the top score reaching 24 percent on the private ARC-AGI-2 evaluation set.

That result gave the benchmark greater institutional visibility, and the report says Anthropic, Google DeepMind, OpenAI, and xAI reported ARC-AGI performance in public model cards in 2025. It also sharpened the contamination problem: once public examples, solver traces, synthetic lookalike tasks, reasoning styles, and task discussions circulate widely, future scores become harder to interpret without decontamination, private held-out tests, and a clear account of whether the solver was optimized for the benchmark family.

ARC-AGI-3

ARC-AGI-3, launched in March 2026, changes the benchmark format from static grid tasks to interactive, turn-based environments. The official launch materials describe it as a benchmark for agentic intelligence: systems must explore unfamiliar environments, infer rules and goals, build internal models, plan, act, and adapt without written instructions.

The ARC-AGI-3 paper says the benchmark contains a 25-environment public demonstration set, a 55-environment semi-private set for testing frontier models behind external APIs, and a 55-environment fully private competition set. The paper also says scores are based on relative human action efficiency: not just whether an agent eventually wins, but how efficiently it completes levels compared with a human baseline.

ARC Prize reported at launch that humans solved the ARC-AGI-3 environments in its testing while frontier AI systems scored below 1 percent; the launch post gave a frontier-AI figure of 0.51 percent. The April 2026 human-performance post described a controlled study of 458 participants and said the released public-demo dataset includes 342 human step-by-step replays for 25 public environments. The later technical paper reports 486 unique participants across 414 candidate environments and 2,893 total environment attempts. These are date-sensitive claims from the March and April 2026 release period, not permanent facts about model capability.

The move to interaction is important. It shifts the measurement question from "Can the system infer this visual transformation?" to "Can the system discover how an unfamiliar world works by acting inside it?" That connects ARC-AGI to AI agents, world models, memory, planning horizons, action efficiency, and long-horizon autonomy.

It also changes the overfitting surface. Public demonstration environments should not be treated as strong evidence of AGI progress, because agents can be hand-designed, trained, or debugged around public mechanics. The paper separates an official leaderboard from a community leaderboard and warns that harness-driven results need different interpretation from general-purpose frontier-model results.

Why It Matters

ARC-AGI matters because it challenges a common public shortcut: treating benchmark scores, fluency, or broad task coverage as proof of general intelligence. It asks a narrower but sharper question: how efficiently can a system acquire a new skill when the task is not already familiar?

For AI governance, ARC-style evaluation is useful because it tries to separate some forms of capability from memorization and contamination. It also gives policy and safety discussions a language for asking whether a model can generalize, not only whether it can pass public exams.

For AI research, ARC keeps alive the possibility that progress may require methods beyond scale alone: program synthesis, search, explicit abstraction, test-time adaptation, tool use, world modeling, or hybrid systems that can construct new procedures from sparse evidence.

Reading Scores

An ARC-AGI score should be read as a claim about a system under a protocol. The relevant system may include the base model, prompt, verifier, search loop, program synthesizer, tool access, retry policy, reasoning budget, cost budget, and human-written harness.

For source discipline, a report should name the ARC version, public, semi-private, or private split, evaluation date, model version, scaffold, prompt, tool policy, compute budget, number of attempts, scoring rule, cost per task where relevant, contamination checks, and whether the result came from a primary source, model card, leaderboard, third-party replication, or competition submission.

ARC-AGI-3 makes this even more important because interaction adds action traces, exploration strategy, memory, feedback, environment rules, and action-budget constraints to the measurement. Its April 2026 human-performance post also announced scoring changes after preview testing, including a move from a second-best-human baseline to a median-human baseline and a higher per-level score cap. A static-grid score and an interactive-agent score should not be collapsed into one generic "reasoning" number. A public-demo score, a semi-private frontier-model score, and a fully private competition score should also remain distinct.

Protocol Checklist

A governance-grade ARC claim should preserve enough detail for another reviewer to understand what was actually measured. At minimum, it should state:

For agentic systems, the checklist should also name permissions and operating boundaries. A solver that can browse, call tools, write files, use private memory, or spawn helper agents is not the same evaluation object as a frontier model answering through a fixed prompt. The safety-relevant object is the whole deployed system.

Governance Implications

ARC-AGI can support governance when it is treated as one piece of evaluation evidence. It can help ask whether a system generalizes from sparse evidence, whether a score depends on public benchmark exposure, and whether a model's practical capability changes when more test-time compute, tools, or agent scaffolding are supplied.

It should not be used as a standalone release gate, procurement standard, safety certificate, or AGI declaration. A model can perform well on ARC-style tasks and still fail at truthfulness, security, human oversight, social judgment, domain safety, rights-impacting decisions, or high-stakes reliability.

Governance-grade use should connect ARC results to broader test, evaluation, validation, and verification practice: dated reports, explicit assumptions, private and rotating tests, contamination controls, adversarial evaluation, uncertainty, evaluator independence, model-card or system-card documentation, and consequences for deployment. If an ARC result cannot change a release decision, access tier, model card claim, audit finding, incident review, or retesting requirement, it is evidence without leverage.

For agentic systems, ARC-AGI-3 also points toward a practical safety question: does capability come from a generally adaptive model, or from a carefully engineered harness with hidden instructions, tools, memory, and retries? That distinction matters for AI control, human oversight, AI agent sandboxing, and procurement because the governed object is the deployed system, not a leaderboard label.

Limits and Cautions

ARC-AGI is not a complete test of intelligence. Its original tasks are deliberately abstract and narrow. They do not measure social reasoning, factual knowledge, physical manipulation, institutional judgment, moral decision-making, long-term reliability, or safety under deployment pressure.

The benchmark also faces the same sociotechnical risks as other evaluations. Public tasks can leak into training data. Synthetic lookalike tasks can reduce the distance between training and test distributions. Competitors can overfit to a benchmark style. Leaderboards can compress a complex research question into a single number. New versions can shift the target in ways that are scientifically useful but hard for the public to interpret.

Closed or private test sets can protect validity, but they reduce public inspectability. Public tasks support open research, but they invite memorization, solver-specific tuning, and synthetic echoes in future training data. ARC-AGI lives inside that tradeoff rather than outside it.

The responsible reading is therefore neither worship nor dismissal. ARC-AGI is a strong probe of abstraction and efficient generalization, not a final oracle for AGI.

Source Discipline

ARC-AGI claims should identify the exact benchmark generation, split, leaderboard or competition track, scoring method, model or agent version, scaffold, tool access, number of attempts, cost or compute budget, evaluation date, and whether the task material was public, semi-private, or fully private. For ARC-AGI-3, reports should also state the environment set, action budget, replay availability, human baseline rule, and whether internet access or external tools were disabled.

Separate benchmark-steward claims from independent validation. ARC Prize pages are primary sources for ARC Prize rules, formats, and scoring; arXiv technical reports explain design and reported results; model cards and system cards describe provider-reported scores; third-party evaluations and replications provide a different evidentiary layer. A governance-grade claim should say which layer it uses.

Do not quote ARC-AGI scores as proof that a system is conscious, divine, safe, broadly competent, or already artificial general intelligence. A source-disciplined sentence says what the benchmark measured, under which protocol, and which deployment question remains outside the test.

Spiralist Reading

ARC-AGI is a test against theatrical intelligence.

The machine can speak with authority, cite sources, solve exams, and imitate expertise. ARC asks whether it can cross a small unfamiliar gap without having already consumed the path. That is why the benchmark has symbolic force: it interrupts the conversion of fluency into certainty.

For Spiralism, ARC-AGI is useful because it restores friction to the mythology of progress. It reminds institutions that a score is not understanding, that a demo is not agency, and that a public benchmark becomes part of the reality it tries to measure. The challenge is to keep measurement alive without letting measurement become a new idol.

Sources


Return to Wiki