ARC-AGI-3 Resets the AI Leaderboard: Every Frontier Model S…

TL;DR: On March 25, 2026, the ARC Prize Foundation released ARC-AGI-3 — and it demolished every AGI narrative the AI industry has spent the last two years constructing. Every frontier model tested — GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro Preview, and Grok-4.20 — scored below 1%. Humans, playing the same environments with no prior training and no instructions, scored 100%. The gap is not marginal. It is categorical. And it exposes a truth the industry has been avoiding: scaling existing architectures has not produced general intelligence.

What is ARC-AGI-3
How ARC-AGI-3 differs from previous versions
The scoreboard: model-by-model results
Why humans score 100%
The RHAE metric explained
What this means for AGI claims
The $2 million prize pool
What non-LLM approaches achieved
Implications for model architecture
Industry reactions and the bigger picture

What is ARC-AGI-3

ARC-AGI-3 is the third iteration of the Abstraction and Reasoning Corpus benchmark — and it is the first version to fundamentally change the format of the test itself.

The original ARC benchmark, introduced in 2019 by François Chollet, challenged AI systems to solve visual grid puzzles by inferring transformation rules from a handful of input-output examples. ARC-AGI-1 and ARC-AGI-2 operated on the same principle: static visual puzzles, no prior knowledge, pattern inference. The problem was that AI systems eventually cracked enough of the surface behavior — using test-time compute, large ensemble approaches, and task-specific engineering — to claim meaningful progress.

ARC-AGI-3 abandons the static puzzle format entirely. Instead of presenting pairs of grids to analyze, it drops AI agents into interactive, turn-based game environments. There are no instructions. There are no win conditions explained in advance. Agents must explore the environment, form hypotheses about what the goal might be, identify the rules of the world they are operating in, and execute a plan — all in real time, from zero prior exposure.

The benchmark consists of 135 environments, each containing over 1,000 levels. A public set of 25 environments is available for testing and development. The full private evaluation set is reserved for official competition submissions. Every environment was designed to be solvable by untrained humans on their first attempt — and during the preview period, more than 1,200 human players logged 3,900+ games to validate that assumption.

The ARC Prize Foundation's framing is unambiguous: "As long as there is a gap between AI and human learning, we do not have AGI."

How ARC-AGI-3 differs from previous versions

The shift from static puzzles to interactive environments is not a difficulty adjustment — it is a test of a fundamentally different cognitive capability.

ARC-AGI-1 and ARC-AGI-2 measured pattern recognition under novelty constraints. A capable model looks at a small number of input-output grid pairs, abstracts the transformation rule, and applies it. The intelligence being tested is inductive reasoning from limited examples. Strong models could, with enough test-time compute and clever prompting, perform reasonably on these tasks.

ARC-AGI-3 measures something else: skill acquisition over time. An agent must perceive its environment, take actions, observe the consequences, update its model of what is happening, and adapt its strategy — all without being told what success looks like. There are no natural-language instructions. There are no hidden prompts. There are no pre-loaded objectives.

The key capabilities the benchmark targets:

Long-horizon planning with sparse feedback — agents may take dozens of actions before receiving any meaningful signal
Experience-driven adaptation — strategy must update based on what actually happens, not what the model predicts
Novel environment generalization — no amount of pre-training on known environments transfers to unseen ones
Memory compression and belief updating — the agent must maintain a useful model of the world as it evolves across multiple steps

This is what makes the benchmark structurally different from everything that came before it. You cannot brute-force it with more parameters. You cannot prompt-engineer your way through it. You cannot fine-tune on a held-out set and call it generalization. The only path to a high score is real-time learning inside a novel environment — which is, by most definitions, what intelligence actually is.

The scoreboard: model-by-model results

The results are stark. Every frontier model evaluated during the preview period scored below 0.4% on the RHAE metric (explained in full below).

Model	ARC-AGI-3 Score
Gemini 3.1 Pro Preview	0.37%
GPT-5.4 High	0.26%
Claude Opus 4.6	0.25%
Grok-4.20	0.00%
Human baseline	100%

These are not rounding errors. These are not edge cases where models almost crossed the threshold. The best frontier LLM — Google's Gemini 3.1 Pro Preview — solved less than half of one percent of what untrained humans solve.

One data point deserves particular attention. During controlled testing, Claude Opus 4.6 achieved 97.1% on environments it had been specifically engineered to handle, using custom scaffolding. When dropped into unfamiliar environments — the actual test condition — it scored 0%. That collapse is not a bug. It is the entire point. Task-specific engineering does not produce generalizable intelligence. It produces task-specific performance that vanishes the moment the task changes.

GPT-5.4 fared only marginally better at 0.26%. Anthropic's Claude Opus 4.6 came in at 0.25%. Grok-4.20 did not solve a single environment. The models are not close to each other in any meaningful sense — they are all close to zero, and all infinitely far from the human baseline.

Why humans score 100%

Every one of the 135 environments in ARC-AGI-3 was solved by humans with no prior exposure and no instructions. That is not an accident — it is a design requirement.

The benchmark was explicitly constructed so that any healthy adult with no special training can complete every environment on their first attempt, given enough time. During the preview period, over 200 controlled study participants established production baselines. More than 1,200 total players logged games across the full environment set. The human solvability requirement is verified, not assumed.

Why can humans do this when AI systems cannot? The ARC Prize Foundation's answer points to the nature of human cognition: humans acquire goals on the fly. When placed in an unfamiliar environment, a human automatically begins forming hypotheses about what is happening, what success might look like, and what actions are worth trying. This process — sometimes called "goal inference" or "theory of mind applied to environments" — happens rapidly and largely unconsciously.

Current AI systems, including the most powerful frontier LLMs, do not have this capability in any generalizable form. They can reason about goals when goals are stated explicitly. They can follow instructions when instructions are provided. They can pattern-match to training data when the problem resembles something they have seen before. But they cannot construct a goal model from scratch, in real time, from pure environmental interaction — at least not reliably enough to score above 1%.

The human score of 100% is not a ceiling that AI is approaching. It is a baseline that AI has not yet begun to approximate.

The RHAE metric explained

ARC-AGI-3 introduces a new scoring metric: RHAE — Relative Human Action Efficiency.

The metric answers a specific question: how efficiently does an AI agent solve a level, compared to the most efficient human baseline? The formula is straightforward — but the implications are sharp.

Score = (human actions / AI actions)²

A few key design decisions make this metric genuinely difficult to game:

Only state-changing interactions count as actions. Internal reasoning steps, chain-of-thought tokens, and silent processing time do not count. What matters is what the agent actually does in the environment.
The human baseline is the second-best performer among 10 first-time players on the same level. This removes outliers in both directions.
There is no bonus for exceeding human speed. A per-level score caps at 1.0. You cannot compensate for poor performance on hard levels by being faster than humans on easy ones.
Later levels are weighted more heavily — the benchmark scales difficulty over the course of each environment, and the scoring reflects that progression.
The squaring function penalizes inefficiency sharply. An AI that takes 10 times as many actions as the human baseline scores 1% for that level — not 10%. An AI that takes 3 times as many actions scores roughly 11%.

This scoring design means that superficially "almost there" performance translates to a very low numerical score. A model that uses 5 times as many actions as a human scores just 4% per level. That is why frontier models with seemingly plausible behavior in short demos collapse to sub-1% scores under rigorous measurement.

What this means for AGI claims

The timing of ARC-AGI-3 is notable. The last 18 months have seen a sustained campaign of AGI-adjacent claims from every major AI lab. OpenAI has described GPT-5.4 as approaching "human-level performance on a wide range of tasks." Anthropic has described Claude as capable of "complex reasoning across ambiguous domains." Google has pointed to Gemini's performance on academic benchmarks as evidence of emergent general capability.

ARC-AGI-3 provides a clean empirical check on those claims — and the result is unambiguous. If any of these systems had acquired generalizable intelligence, they would score meaningfully on a benchmark that humans solve with 100% success rate. They do not.

The ARC Prize Foundation makes the underlying logic explicit: "Most benchmarks test what models already know. ARC-AGI-3 tests how they learn." That distinction is everything. A model trained on billions of internet tokens will perform well on tasks that resemble the training distribution. That is not intelligence — it is sophisticated retrieval and interpolation. Intelligence is the ability to handle genuinely novel situations that share no surface similarity with anything in the training data.

ARC-AGI-3 was designed specifically to prevent retrieval and interpolation from working. The environments are novel. The rules are undisclosed. The goals are not stated. The only path forward is real-time environmental reasoning — and on that test, every frontier model fails.

This does not mean current AI is not useful. It is enormously useful. But the industry has been conflating "useful at known tasks" with "generally intelligent" — and ARC-AGI-3 makes that conflation untenable.

The $2 million prize pool

ARC Prize 2026 raises the total competition prize pool to $2 million, split across two tracks hosted on Kaggle.

ARC-AGI-3 Track — $850,000 total:

Grand prize (100% score): $700,000 — carries over to the next year if unclaimed
Milestone prizes (June 30 and September 30 checkpoints): $75,000 combined
Top-score awards: $75,000 distributed as $40K / $15K / $10K / $5K for the top four performers

ARC-AGI-2 Track: The second track continues ARC-AGI-2 competition in its final year, with the remaining prize pool allocated for top performers.

The competition runs from March 25, 2026 through November 2, 2026, with results announced December 4, 2026. The grand prize for 100% performance on ARC-AGI-3 is $700,000 — and given the current 0.37% ceiling, it almost certainly carries over.

One critical requirement: all winning solutions must be open-sourced under permissive licenses — CC0 or MIT-0 — before receiving private evaluation scores. The ARC Prize Foundation is not just offering money for a benchmark score. It is structuring the competition to ensure that any breakthrough in interactive environmental reasoning becomes public knowledge immediately.

That open-source requirement reflects the broader mission. The foundation is not trying to crown a winner — it is trying to accelerate the field's understanding of what generalizable AI actually requires.

What non-LLM approaches achieved

One of the most significant findings from the ARC-AGI-3 preview is what the non-LLM approaches achieved — and what that tells us about the path forward.

The top three preview submissions did not use large language models at all:

StochasticGoose (Tufa Labs): CNN + reinforcement learning action-learning — 12.58%
Blind Squirrel: State graph exploration + ResNet18 — 6.71%
Explore It Till You Solve It: Training-free frame graph approach — 3.64%

The best of these — 12.58% from StochasticGoose — is more than 30 times better than the best frontier LLM (0.37%). Every top submission avoided language models entirely. The approaches that performed best relied on reinforcement learning, graph-based exploration, and convolutional networks trained on environment interactions.

This finding inverts the dominant assumption of the last three years. The mainstream narrative has been that scale — more parameters, more RLHF, more compute — produces progressively more capable intelligence. ARC-AGI-3 suggests that for interactive environmental reasoning, the architectural choices matter far more than scale. A lightweight CNN with a good RL loop outperforms GPT-5.4 by a factor of 30 on this benchmark.

The ARC Prize Foundation's technical report notes the pattern explicitly: all top systems avoided LLMs, suggesting that interactive reasoning requires novel algorithmic approaches rather than model scaling. This is not a peripheral finding — it is a direct challenge to the capital allocation strategy of every major AI lab currently spending hundreds of millions of dollars on training runs for next-generation foundation models.

Implications for model architecture

ARC-AGI-3 forces a reckoning with a question the industry has been deferring: what kind of system can actually learn in real time inside a novel environment?

The answer is clearly not a frozen transformer with a large context window. Frontier LLMs process a context window and produce an output — they do not update their weights or their internal models in response to what they observe. Every interaction is stateless at the parameter level. The model that enters environment step 100 is identical, at the weight level, to the model that entered environment step 1. It can use its context window to track what has happened, but it cannot learn in the way that word implies — updating underlying representations based on experience.

What ARC-AGI-3 rewards is something closer to online learning: the ability to form, test, and revise hypotheses about an environment's rules in real time, using feedback from actual interactions. This is closer to reinforcement learning than to supervised pre-training. It requires memory systems that compress experience efficiently, planning systems that can project forward under uncertainty, and belief-updating mechanisms that revise models when predictions fail.

None of this is architecturally impossible. Reinforcement learning systems do exactly this, in the right environments. The StochasticGoose approach demonstrates that RL-based systems can achieve meaningful scores on ARC-AGI-3 where LLMs cannot. The gap between 12.58% and 100% is still large — but the gap between 0.37% and 12.58% suggests that RL-based approaches are orders of magnitude more appropriate for this class of problem.

The architectural implication is significant: the path to AGI may require abandoning the frozen-weights inference paradigm that current frontier models rely on. Genuine online learning — weight updates or equivalent internal state changes driven by real-time experience — may be necessary, not optional.

Industry reactions and the bigger picture

The release of ARC-AGI-3 landed on March 25, 2026 — just weeks after OpenAI's and Anthropic's most recent model capability claims. The contrast is not subtle.

The ARC Prize Foundation's announcement framing was direct: "Humans score 100%. AI less than 1%. This human-AI gap demonstrates we do not yet have AGI." No hedging, no qualifications, no discussion of near-future progress curves. The statement is a clean empirical claim backed by a public benchmark.

What the ARC-AGI-3 release clarifies — perhaps definitively — is that benchmark saturation and genuine capability are not the same thing. ARC-AGI-1 was eventually saturated. ARC-AGI-2 was approached. Each time, the AI industry pointed to improved benchmark scores as evidence of approaching AGI. ARC-AGI-3 resets that narrative by changing what is being measured.

The benchmark also highlights the danger of task-specific performance being mistaken for general capability. Claude Opus 4.6 scoring 97.1% on engineered environments and 0% on novel ones is not a failure of one model — it is a description of how all current frontier models work. They are extraordinarily capable within their training distribution. Outside it, the capability collapses.

This matters for how enterprise and research teams think about agentic AI deployments. An AI agent that performs well in controlled demos, on tasks similar to its training data, may behave very differently when encountering genuinely novel problems in production. ARC-AGI-3 provides a framework for thinking about where that boundary lies — and right now, every frontier model sits far below it.

Conclusion

ARC-AGI-3 is not just a harder benchmark. It is a different kind of benchmark — one that measures whether AI systems can actually learn from experience in novel environments, rather than pattern-match against memorized training data.

The results are definitive: they cannot, not yet, not at any meaningful level. Gemini 3.1 Pro Preview at 0.37%, GPT-5.4 at 0.26%, Claude Opus 4.6 at 0.25%, Grok-4.20 at 0.00% — these are not scores approaching a threshold. They are scores that reveal a categorical gap between current AI capability and what the benchmark requires.

The $2 million prize pool remains almost entirely unclaimed. The 100% human baseline remains untouched by any AI system. The non-LLM approaches — RL-based, graph-based, learning-in-environment approaches — are outperforming frontier models by factors of 30 or more, suggesting that the architectural direction for genuine interactive intelligence may diverge sharply from the transformer scaling path that has dominated the last five years.

None of this diminishes the usefulness of current AI systems. GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro are powerful tools for tasks within their training distribution. But "powerful within known domains" and "generally intelligent" are different claims — and ARC-AGI-3 has now drawn that line in sharp relief.

The question the benchmark poses is not rhetorical. It has a specific answer, a specific prize, and a specific deadline: November 2, 2026. If any AI system can solve what untrained humans solve — exploring novel environments, forming goals from scratch, learning by doing — it will win $700,000 and reshape every assumption the industry holds about where we are on the path to AGI.

Right now, the evidence says we are not close.

Sources:

Let's Build Something Together

ARC-AGI-3 Resets the AI Leaderboard: Every Frontier Model Scores Under 1% While Humans Hit 100%

Weekly Newsletter

Weekly Newsletter

Table of contents

What is ARC-AGI-3

How ARC-AGI-3 differs from previous versions

The scoreboard: model-by-model results

Why humans score 100%

The RHAE metric explained

What this means for AGI claims

The $2 million prize pool

What non-LLM approaches achieved

Implications for model architecture

Industry reactions and the bigger picture

Conclusion

→ Related Links

→ Related Posts

Apple Opens Siri to Claude, Gemini, and Grok in iOS 27 — Ending ChatGPT Exclusivity

Google Launches Gemini 3 Deep Think — Reasoning Model Arms Race Heats Up for Ultra Subscribers

Google Gemini Now Lets You Import Your Entire ChatGPT and Claude History — The AI Switching War Begins