Why Standard AI Benchmarks Systematically Underestimate Agent Capabilities

Current AI evaluation methods are failing to capture the true potential of frontier models, often mistaking a lack of computational budget for a lack of intelligence. The UK’s AI Security Institute (AISI) has revealed that AI agent performance is not a fixed score, but a scaling curve that rises sharply with increased test-time compute.

The Compute-Capability Curve

The central finding from the AISI research is that an AI agent's success rate is inextricably linked to its "test-time compute"—the amount of processing power and tokens an agent is allowed to utilize while working on a task. When researchers apply fixed budget caps to evaluations, they are measuring the minimum capability of a model rather than its maximum potential.

This phenomenon is visible across multiple high-stakes domains. In software engineering tasks using benchmarks like TerminalBench 2.0 and SWE-Bench Pro, success rates surged by approximately 25% when the token budget was increased from one million to ten million. Similarly, mathematical and academic tasks in "Humanity's Last Exam" saw a 22% gain when the budget reached five million tokens.

The Power Law of Human vs. AI Task Time

The study established a direct correlation between the time a human expert requires for a task and the token consumption required by an AI agent. This relationship follows a power law: a task that takes a human one minute costs an agent thousands of tokens, while a one-hour task costs millions.

This creates a massive blind spot in current testing. For example, the AISI cybersecurity task "The Last Ones" requires roughly 20 hours of human expertise. No model tested by the institute could solve this task with fewer than 30 million tokens. By using standard, lower-budget evaluations, researchers are effectively cutting off the most complex and critical tasks from the measurement process.

Accelerating Progress and the Three Axes of Improvement

The AISI notes that the "time horizon" of frontier models—the complexity of tasks they can handle—is expanding much faster than previously thought. While earlier estimates suggested the time horizon for cyber tasks doubled every 4.7 months at a fixed 2.5 million token budget, that rate accelerates significantly at higher budgets. At 50 million tokens, the doubling rate speeds up to every 40 to 50 days.

Newer models (such as the GPT and Claude series tested) show improvement across three specific dimensions:

  • Reach: The ability to tackle increasingly harder tasks.
  • Reliability: The ability to solve the same task more consistently.
  • Efficiency: The ability to solve tasks using fewer tokens.

Implications for AI Safety and Deployment

This research shifts the paradigm of AI evaluation from "fixed scores" to "compute-aware curves." For developers and founders, this means that a model's utility is not just a function of its training, but of how much inference compute is allocated during deployment.

As the cost per token continues to fall, capabilities that previously seemed economically unfeasible will become standard. For AI safety and security, this means that risks related to autonomous agents—such as complex cyberattacks—may be significantly underestimated if regulators and companies rely on traditional, low-budget benchmarks.

Key Takeaways

  • Benchmarks are misleading: Fixed token budgets capture a model's minimum performance, systematically underestimating the ceiling of what AI agents can achieve.
  • Compute scales capability: Success rates in software engineering and mathematics jump significantly as the test-time compute budget increases.
  • The "Doubling" rate is accelerating: At higher compute budgets, the rate at which frontier models master complex tasks is much steeper than previously estimated.