𝗠𝗔-𝗣𝗿𝗼𝗼𝗳𝗕𝗲𝗻𝗰𝗵: 𝗚𝗣𝗧-𝟱.𝟱 𝗛𝗶𝘁𝘀 𝟭𝟲% 𝗼𝗻 𝗠𝗮𝘁𝗵 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀
Current AI models struggle with advanced math.
A new benchmark called MA-ProofBench tests theorem proving in mathematical analysis. The results show a massive gap in reasoning skills.
GPT-5.5 led the tests with these scores:
- 16% on undergraduate problems (Level I).
- 5% on PhD-level problems (Level II).
Most other models scored near 0% on PhD-level problems.
The benchmark includes 200 theorems across 6 topics. These topics include measure theory and complex analysis.
Researchers found two main reasons why models fail:
- Mathlib hallucinations: Models write Lean code that looks right but uses non-existent tools.
- Incomplete proofs: Models start a proof correctly but fail to reach the end.
There is also a gap between informal and formal reasoning. Models perform better when they use natural language instead of strict code.
The low scores on PhD-level math show a ceiling for current AI. Today's frontier models lack the depth for rigorous formal proofs in analysis.
This benchmark will track if future models from OpenAI or Anthropic can cross the 20% mark on harder problems.
Source: https://arxiv.org
Optional learning community: https://t.me/GyaanSetuAi