Google DeepMind's Aletheia: The AI Agent That Does Real Math Research

Winning a math competition is one thing. Producing original research is quite another. Google DeepMind's newly introduced Aletheia agent crosses that line — moving from gold-medal performance at the International Mathematical Olympiad to autonomously generating publishable mathematical research.

From Olympiad Gold to Open Problems

AI models reached IMO gold-medal level in 2025, but competition math and research math are fundamentally different beasts. Competition problems have known solutions and bounded complexity. Research requires navigating vast literature, constructing long-horizon proofs, and — critically — producing novel results.

Aletheia, powered by an advanced version of Gemini Deep Think, is purpose-built for this harder task.

The Architecture: Generate, Verify, Revise

At its core, Aletheia runs a three-part agentic loop:

Generator — proposes a candidate solution to a research problem
Verifier — checks the solution for flaws, hallucinations, and logical gaps using natural language reasoning
Reviser — corrects errors identified by the Verifier, iterating until the output passes verification

This explicit separation matters. DeepMind found that models which generate and verify in a single pass tend to overlook their own mistakes. Splitting the roles forces genuine self-critique.

To prevent citation hallucinations — a persistent problem when AI discusses existing literature — Aletheia also uses Google Search and web browsing to ground its work in real mathematical papers.

The Numbers

The results are striking:

95.1% accuracy on IMO-Proof Bench Advanced (up from 65.7% previous best)
100x reduction in compute needed for IMO-level problems compared to the 2025 Deep Think version
State-of-the-art on FutureMath Basic, a PhD-level exercise benchmark

But benchmarks tell only part of the story. The real milestone is what Aletheia has produced.

Actual Research Output

Aletheia has already contributed to peer-reviewed work:

Fully autonomous research (Feng26): Aletheia generated an entire paper calculating structure constants called eigenweights — no human intervention required. This is classified as Level A2 (essentially autonomous, publishable quality).
Collaborative research (LeeSeo26): The agent provided high-level strategy for proving bounds on independent sets, which human mathematicians then formalized into rigorous proofs.
The Erdős Conjectures: Deployed against 700 open problems from Paul Erdős's famous collection, Aletheia found 63 technically correct solutions and resolved 4 open questions autonomously.

Resolving even one Erdős conjecture is noteworthy — these problems have stumped mathematicians for decades.

A Taxonomy for AI Autonomy in Science

DeepMind also proposed a classification system for AI contributions to mathematics, reminiscent of autonomous vehicle levels:

Level 0: Primarily human work, negligible novelty (competition-level solving)
Level 1: Human-AI collaboration, minor novelty
Level 2: Essentially autonomous, publishable research

Aletheia's Feng26 paper sits at Level 2 — a landmark for AI in scientific research.

Why This Matters

Aletheia represents a shift from AI as a tool (answering questions, generating code) to AI as a research collaborator that can identify open problems, propose solutions, verify its own work, and iterate toward publishable results.

The implications extend well beyond mathematics. The generate-verify-revise pattern is domain-agnostic. If it works for proving theorems, variants could work for drug discovery, materials science, or any field where hypotheses need rigorous verification.

We're watching AI move from passing tests to doing the actual work the tests were designed to measure.