Google DeepMind's Aletheia: The AI Agent That Does Real Math Research
DeepMind's new Aletheia agent moves beyond competition math to autonomously produce publishable research, resolving open conjectures and proposing a taxonomy for AI scientific autonomy.
Google DeepMind's Aletheia: The AI Agent That Does Real Math Research
Winning a math competition is one thing. Producing original research is quite another. Google DeepMind's newly introduced Aletheia agent crosses that line — moving from gold-medal performance at the International Mathematical Olympiad to autonomously generating publishable mathematical research.
From Olympiad Gold to Open Problems
AI models reached IMO gold-medal level in 2025, but competition math and research math are fundamentally different beasts. Competition problems have known solutions and bounded complexity. Research requires navigating vast literature, constructing long-horizon proofs, and — critically — producing novel results.
Aletheia, powered by an advanced version of Gemini Deep Think, is purpose-built for this harder task.
The Architecture: Generate, Verify, Revise
At its core, Aletheia runs a three-part agentic loop:
- Generator — proposes a candidate solution to a research problem
- Verifier — checks the solution for flaws, hallucinations, and logical gaps using natural language reasoning
- Reviser — corrects errors identified by the Verifier, iterating until the output passes verification
This explicit separation matters. DeepMind found that models which generate and verify in a single pass tend to overlook their own mistakes. Splitting the roles forces genuine self-critique.
To prevent citation hallucinations — a persistent problem when AI discusses existing literature — Aletheia also uses Google Search and web browsing to ground its work in real mathematical papers.
The Numbers
The results are striking:
- 95.1% accuracy on IMO-Proof Bench Advanced (up from 65.7% previous best)
- 100x reduction in compute needed for IMO-level problems compared to the 2025 Deep Think version
- State-of-the-art on FutureMath Basic, a PhD-level exercise benchmark
But benchmarks tell only part of the story. The real milestone is what Aletheia has produced.
Actual Research Output
Aletheia has already contributed to peer-reviewed work:
- Fully autonomous research (Feng26): Aletheia generated an entire paper calculating structure constants called eigenweights — no human intervention required. This is classified as Level A2 (essentially autonomous, publishable quality).
- Collaborative research (LeeSeo26): The agent provided high-level strategy for proving bounds on independent sets, which human mathematicians then formalized into rigorous proofs.
- The Erdős Conjectures: Deployed against 700 open problems from Paul Erdős's famous collection, Aletheia found 63 technically correct solutions and resolved 4 open questions autonomously.
Resolving even one Erdős conjecture is noteworthy — these problems have stumped mathematicians for decades.
A Taxonomy for AI Autonomy in Science
DeepMind also proposed a classification system for AI contributions to mathematics, reminiscent of autonomous vehicle levels:
- Level 0: Primarily human work, negligible novelty (competition-level solving)
- Level 1: Human-AI collaboration, minor novelty
- Level 2: Essentially autonomous, publishable research
Aletheia's Feng26 paper sits at Level 2 — a landmark for AI in scientific research.
Why This Matters
Aletheia represents a shift from AI as a tool (answering questions, generating code) to AI as a research collaborator that can identify open problems, propose solutions, verify its own work, and iterate toward publishable results.
The implications extend well beyond mathematics. The generate-verify-revise pattern is domain-agnostic. If it works for proving theorems, variants could work for drug discovery, materials science, or any field where hypotheses need rigorous verification.
We're watching AI move from passing tests to doing the actual work the tests were designed to measure.