Google DeepMind's Gemini 3 Deep Think Just Solved 18 Unsolv…

TL;DR: Google DeepMind upgraded Gemini 3 Deep Think on February 12, 2026 — and the results went far beyond a benchmark refresh. The model solved 18 previously open research problems across mathematics, physics, and computer science, disproved a 2015 mathematical conjecture that had stumped researchers for a decade, and scored 84.6% on ARC-AGI-2 — surpassing the human average. It also achieved Legendary Grandmaster status on Codeforces, gold-medal performance on the International Physics and Chemistry Olympiads, and set a new record of 48.4% on Humanity's Last Exam. Full details at Google's blog and DeepMind's research post.

What Just Happened
The Benchmark Numbers That Changed the Conversation
18 Research Problems — What That Actually Means
The Conjecture That Fell
How Deep Think Works: The Aletheia Architecture
Real Scientists, Real Laboratories
Who Is Yao Shunyu — and Why His Involvement Matters
What This Means for Scientific Research
Availability and Access
The Bigger Picture: AI as a Research Collaborator
FAQ

What Just Happened

There is a version of the Gemini 3 Deep Think story that sounds like every other AI announcement: a new model, new benchmarks, a press release. And then there is the version where a machine identifies a flaw in a mathematics paper that survived human peer review, where researchers at Duke University achieve a fabrication target for semiconductor crystals that previous methods could not reach, and where 18 problems that had been sitting open in research databases for years get solved in weeks.

Both versions describe the same release. But only one of them captures what is actually happening.

Google DeepMind's upgraded Gemini 3 Deep Think launched on February 12, 2026. By the time the research community had digested the results, the conversation had shifted. This was no longer about whether AI could assist with science. It was about how quickly the boundary between "AI assistant" and "AI collaborator" was going to disappear.

The upgrade targets a specific and historically intractable problem: the kind of research challenge where the problem statement is ambiguous, the data is incomplete, the solution space is enormous, and there is no known answer to check against. These are the problems that define the frontier of human knowledge — and they are precisely the problems that previous AI systems struggled to engage with meaningfully.

Deep Think was built, in part, to change that.

The Benchmark Numbers That Changed the Conversation

Benchmarks are a flawed proxy for intelligence, and anyone who has watched the AI industry cycle through them knows to hold them lightly. But some numbers are hard to dismiss.

Here is where Gemini 3 Deep Think stands against the field:

Benchmark	Deep Think	Claude Opus 4.6	GPT-5.2	Gemini 3 Pro
ARC-AGI-2	84.6%	68.8%	52.9%	31.1%
Humanity's Last Exam	48.4%	40.0%	34.5%	37.5%
MMMU-Pro	81.5%	73.9%	79.5%	81.0%
Codeforces Elo	3,455	2,352	—	2,512

Let's unpack each of these.

ARC-AGI-2 is the benchmark created by the ARC Prize Foundation specifically to resist pattern-matching and test for general fluid reasoning. It consists of visual puzzles that require understanding abstract rules from a tiny number of examples — the kind of task humans find approachable but that consistently defeats AI systems trained on statistical patterns. The human average is approximately 60%. Deep Think's 84.6% is not only well above human average — it is nearly 16 percentage points ahead of the next best result from Claude Opus 4.6. The ARC Prize Foundation verified this score independently.

Humanity's Last Exam is a benchmark designed to be impossible for current AI: thousands of questions written by subject matter experts, calibrated to be easy for knowledgeable humans but expected to defeat AI. The questions span graduate-level mathematics, physics, chemistry, biology, law, and philosophy. Getting to 48.4% without tools — meaning no web access, no calculators, no code execution — is a result that the benchmark's creators did not expect any model to approach for years. For context: earlier top models were scoring in the low 20s.

Codeforces Elo of 3,455 puts Deep Think in Legendary Grandmaster territory — the top tier of competitive programming. Fewer than a handful of human programmers in the world maintain an Elo above 3,400. The model is not just writing code. It is solving novel algorithmic problems under competitive constraints at a level that almost no human can match.

MMMU-Pro, the multimodal understanding benchmark, shows the one area where Deep Think's advantage narrows — 81.5% versus Gemini 3 Pro's 81.0%, a near-tie. This is actually informative: Google optimized Deep Think for reasoning, not visual processing. The tight gap on MMMU-Pro tells you what they prioritized and what they did not.

18 Research Problems — What That Actually Means

The headline number — 18 previously unsolved research problems — is more concrete than it sounds.

These are not toy problems or simplified versions of real research questions. They are open problems drawn from active research in algorithms, machine learning, combinatorial optimization, information theory, and economics. Many had been sitting unresolved in research databases for years. Some had been attempted by multiple research groups and failed.

Among the documented solutions:

The Max-Cut problem in network optimization, involving partitioning a graph to maximize the weight of edges between the two groups. Variants of this problem appear in circuit design, statistical physics, and machine learning training procedures.

The Steiner Tree problem in high-dimensional geometry, concerning how to connect a set of points with the minimum total length of connecting segments. Hard even in two dimensions; intractable in higher dimensions without heuristics.

Online submodular optimization — a class of problems where you must make decisions sequentially without seeing the future, and where the objective function has diminishing returns. A conjecture from 2015 about this problem class was proven false by Deep Think via a specific three-item combinatorial counterexample. More on this below.

Machine learning noise filtering: a previously unexplained behavior in a class of penalty methods was given a formal theoretical justification.

Extended Revelation Principle for AI token auctions: an economics result about mechanism design was extended from discrete settings to continuous (real-valued) ones — a generalization that changes the applicability of the underlying theory.

Cosmic strings gravitational radiation: analytical solutions to integral problems arising in the physics of topological defects in the early universe.

The team also evaluated 700 problems from the Bloom's Erdős Conjectures database — one of the canonical collections of open problems in combinatorics and number theory assembled by the late mathematician Paul Erdős. Deep Think solved four autonomously, including one (Erdős-1051) that it later generalized in a follow-up paper.

To contextualize the scale: resolving even a single Erdős conjecture is considered a meaningful contribution in mathematics. Doing four in a batch, plus a generalization, is unusual.

The Conjecture That Fell

One result deserves closer attention.

In 2015, researchers studying data stream optimization proposed a conjecture about the relative value of different data management strategies. Specifically: when processing an arriving stream of items, is it better to move an item to its final destination immediately, or is copying it (keeping the original in place while creating a copy elsewhere) ever actually useful?

The conjecture was that copying was always less efficient — that moving originals would dominate in terms of optimization quality. It was a reasonable hypothesis. It aligned with intuitions from related problems. And it had not been disproved in ten years.

Deep Think found a counterexample.

The counterexample is highly specific: a three-item combinatorial scenario where copying does, in fact, strictly outperform moving. The scenario is narrow enough that it would be easy to miss in a search over simple cases — but once found, it invalidates the general conjecture completely.

This is a good example of what AI-assisted mathematics actually looks like in practice. The model is not generating creative leaps from nothing. It is exhaustive where humans get tired, systematic where humans rely on intuition, and persistent where humans move on to more tractable problems. In the case of a conjecture that had accumulated a decade of unsuccessful counterexample searches, that kind of systematic exhaustion is exactly what was needed.

How Deep Think Works: The Aletheia Architecture

Google DeepMind published technical details about the underlying architecture for the mathematics research component. They call it the Aletheia framework — after the Greek concept of unconcealedness, often translated as "truth."

The framework has three components:

Generator: produces candidate solutions — proofs, counterexamples, or intermediate steps — based on the problem statement and any accumulated context.

Verifier: evaluates candidate solutions using natural language assessment rather than formal proof checking. This is a deliberate design choice. Formal proof assistants like Lean or Coq are powerful but require problems to be expressed in a rigid formal language, which limits the range of questions they can address. Natural language verification is less precise but dramatically more general.

Reviser: takes the verifier's critique and either makes targeted corrections to the candidate solution or, if the verifier identifies a fundamental flaw, signals the generator to restart from scratch.

The system also integrates Google Search and live web browsing during the research process. This addresses one of the persistent failure modes in AI-assisted research: hallucinated citations. By grounding claims against live sources, the system can acknowledge when it does not know something — a capability that turns out to be surprisingly important for research quality. A system that confidently fabricates citations is worse than useless in a scientific context.

For physics and computer science collaboration, the team uses a different approach they call "vibe-proving" — an iterative cycle where the AI and human researchers alternate between proposing directions and stress-testing them. The prompting strategy also includes what they call "balanced prompting": rather than asking the model to prove a statement, they ask it to either prove it or find a refutation. This prevents the model from becoming anchored on a particular direction when the answer might be that the statement is false.

Real Scientists, Real Laboratories

Beyond the controlled benchmark evaluations, Google DeepMind ran Deep Think in active research settings with real scientists working on real problems.

At Rutgers University, mathematician Lisa Carbone used Deep Think to review a highly technical paper in mathematical physics — specifically, a paper on structures bridging quantum mechanics and gravity. Deep Think identified a subtle logical flaw in the argument that had not been caught during human peer review. The flaw was real, not a false positive. This kind of result is notable because peer review is supposed to catch exactly this: errors in published mathematics. The fact that a model can now sometimes catch what peer reviewers miss has direct implications for how mathematical knowledge gets validated.

At Duke University's Wang Lab, researchers used Deep Think to optimize fabrication methods for a class of semiconductor materials involving complex crystal growth. The challenge was designing a process recipe for growing thin films larger than 100 micrometers — a target that previous methods had consistently failed to reach. Deep Think helped identify a viable approach. A film exceeding 100 μm was subsequently grown.

These are not demonstrations staged for a press release. They are accounts from working scientists who used a tool to advance ongoing research. The institutional names, specific problem descriptions, and verifiable results give them credibility that benchmark numbers alone cannot.

Who Is Yao Shunyu — and Why His Involvement Matters

Gemini 3 Deep Think's release brought attention to one name in particular: Yao Shunyu, a researcher who left Anthropic to join Google and contributed to the model's development.

Yao is not a typical AI hire. He studied physics at Tsinghua University — one of China's most selective institutions — and published research in Physical Review Letters as an undergraduate, presenting the first topological energy-band theory for non-Hermitian systems. He subsequently moved into AI research, becoming known for work on reasoning and agent systems.

His background in physics shapes how he thinks about reasoning benchmarks: not as scorecard metrics but as proxies for the kind of structured problem-solving that makes science possible. His public comments after the release were characteristically direct. On Codeforces performance, he noted that only seven human programmers in the world currently maintain an Elo above the model's 3,455 — a statement that prompted debate about what "defending carbon-based programming" actually means in practice.

Yao's visibility in the coverage of Deep Think's launch reflects a broader shift at Google. The company has been recruiting people who combine deep scientific domain knowledge with AI expertise — researchers who understand not just how to build models but what kinds of problems are actually worth solving. That combination is increasingly apparent in what Deep Think can do.

What This Means for Scientific Research

The question that matters most is not whether Deep Think is impressive. It clearly is. The question is what changes downstream.

Peer review has a new tool — but also a new pressure. If AI can catch errors in peer-reviewed mathematics, then the publishing process for technical results will need to integrate this capability explicitly. Some journals are already discussing AI-assisted review protocols. The Rutgers result suggests this is not a distant hypothetical.

The Erdős database of open conjectures — and similar repositories like Hilbert's problems, the Millennium Prize Problems, and various domain-specific open problem lists — are now legitimate targets for systematic AI-assisted attack. Four Erdős conjectures in a single evaluation pass is a meaningful signal that the approach scales.

Semiconductor and materials science R&D cycles faster with AI-assisted design of fabrication protocols. The Duke Wang Lab result is one example among many where the bottleneck is not the physics of the process but the optimization of the parameter space — exactly the kind of problem where AI excels.

Research in information-sparse domains — where problems lack clear guardrails and data is incomplete — was previously the domain where AI was least useful. Deep Think was specifically designed to engage with this category. The Aletheia framework's ability to acknowledge failure and redirect is a genuine capability improvement over models that generate confident but wrong answers.

The team behind Deep Think has proposed a taxonomy for AI-assisted mathematics, ranging from Level 0 (fully autonomous, trivial problems) to Level 4 (landmark breakthroughs requiring significant creative insight). Their honest assessment is that they are currently reaching Level 2 — publishable quality in collaboration with human researchers — and that Levels 3 and 4 remain beyond current reach.

That self-assessment matters. It is easy for an AI lab to oversell results. Framing the current capability as "a force multiplier for human intellect" rather than "a replacement for human researchers" is both accurate and instructive.

Availability and Access

Gemini 3 Deep Think is currently available through two channels:

Gemini app for Google AI Ultra subscribers — the consumer-facing tier for Google's most advanced AI capabilities. Ultra subscribers get direct access to the upgraded Deep Think through the standard Gemini interface.

Gemini API early access program — for researchers, engineers, and enterprises who want to integrate Deep Think into workflows or research pipelines. Access is via a signup form; Google is managing capacity through a selective early access program while scaling infrastructure.

For academic researchers specifically, Google has deployed Deep Think in at least one conference context: the STOC 2026 theoretical computer science conference, where it provided automated feedback on submitted papers — a real-world deployment at a major academic venue.

No pricing has been publicly disclosed for the API tier beyond the Ultra subscription. Given the computational intensity of extended reasoning tasks, enterprise pricing is likely to reflect usage at a different scale than standard API calls.

The Bigger Picture: AI as a Research Collaborator

The framing that keeps appearing in Google DeepMind's coverage of Deep Think — "AI as collaborator," "force multiplier for human intellect" — is not merely marketing language. It reflects a genuine design philosophy that distinguishes this class of system from what came before.

Earlier AI systems were optimized for tasks with known answers: translate this text, summarize this document, generate code that passes these tests. Deep Think is optimized for tasks where the answer is unknown, the problem statement may be ambiguous, and the right move might be to prove the opposite of what you expected.

That shift — from answer-retrieval to answer-seeking — is the core of what makes this release different.

Consider what the Aletheia architecture's "balanced prompting" approach actually implies. When a human mathematician attacks an open problem, they hold two possibilities simultaneously: the statement might be true, or it might be false. They follow evidence in both directions. Most AI systems are trained to converge on a single answer rather than to maintain and explore competing hypotheses. Balanced prompting is an attempt to build that dual-hypothesis orientation into the model's behavior at inference time.

It is a small change on the surface. But it is the kind of small change that, once you understand what it enables, looks like the right idea at the right moment.

The 18 solved research problems and the disproved conjecture are measurable outputs. The methodology that produced them — systematic, patient, capable of acknowledging failure, designed for problems without known answers — is the more durable result.

FAQ

Is Gemini 3 Deep Think available to regular users? Yes, through the Gemini app for Google AI Ultra subscribers. Developers and researchers can sign up for early API access through Vertex AI.

What is ARC-AGI-2 and why does the 84.6% score matter? ARC-AGI-2 is a benchmark designed by the ARC Prize Foundation to test general reasoning by presenting visual puzzles that require understanding abstract rules from minimal examples. Humans average about 60%. Deep Think's 84.6% — independently verified — surpasses human average and represents the highest score publicly reported on this benchmark.

What is Humanity's Last Exam? A benchmark of thousands of expert-level questions across dozens of disciplines, designed to be easy for knowledgeable humans but expected to defeat AI systems. Previous top models scored in the low-to-mid 20s. Deep Think's 48.4% without tools is a substantial jump, though it still leaves significant room before the benchmark is "saturated."

Did Deep Think really solve 18 research problems? According to Google DeepMind's published research and verified external reports, yes. The problems span algorithms, combinatorial optimization, machine learning theory, information theory, economics (mechanism design), and physics. The list includes four Erdős conjectures from the Bloom database.

What is Codeforces Legendary Grandmaster status? Codeforces is a competitive programming platform that uses an Elo-based rating system. Legendary Grandmaster (3,000+ Elo) is the highest tier. Deep Think's Elo of 3,455 places it above virtually all human competitors. The number of humans with ratings above 3,400 can be counted on one hand.

What is the Aletheia framework? The internal architecture Google DeepMind uses for Deep Think's mathematics research capabilities. It consists of a generator (produces candidate solutions), a verifier (assesses them in natural language), and a reviser (corrects or restarts based on verifier feedback). The system also integrates live web search to prevent hallucinated citations.

How does this compare to what Anthropic and OpenAI are doing in scientific AI? On the specific benchmarks reported, Deep Think leads Claude Opus 4.6 by 15.8 percentage points on ARC-AGI-2 and 8.4 points on Humanity's Last Exam. GPT-5.2 trails by 31.7 and 13.9 points respectively on those two benchmarks. Neither Anthropic nor OpenAI has published results specifically focused on open research problem solving at the scale reported for Deep Think.

Will this replace human researchers? Google DeepMind's own taxonomy explicitly stops short of that claim. The current capability is framed as Level 2 — publishable quality in human-AI collaboration — with Levels 3 and 4 (major advances and landmark breakthroughs) not yet reached. The more accurate framing is: AI can now meaningfully accelerate certain categories of research, particularly those involving exhaustive search over large solution spaces.

Let's Build Something Together

Google DeepMind's Gemini 3 Deep Think Just Solved 18 Unsolvable Research Problems

Weekly Newsletter

Weekly Newsletter

Table of Contents

What Just Happened

The Benchmark Numbers That Changed the Conversation

18 Research Problems — What That Actually Means

The Conjecture That Fell

How Deep Think Works: The Aletheia Architecture

Real Scientists, Real Laboratories

Who Is Yao Shunyu — and Why His Involvement Matters

What This Means for Scientific Research

Availability and Access

The Bigger Picture: AI as a Research Collaborator

FAQ

→ Related Links

→ Related Posts

Google Launches Gemini 3 Deep Think — Reasoning Model Arms Race Heats Up for Ultra Subscribers

Google's AI Co-Scientist uses multi-agent Gemini to accelerate biomedical breakthroughs

ARC-AGI-3 Resets the AI Leaderboard: Every Frontier Model Scores Under 1% While Humans Hit 100%