AI Models Run Nonstop for 19 Days in New MirrorCode Benchmark
The landscape of autonomous software engineering is shifting from simple code snippets to massive, multi-day programming marathons. A new benchmark from Epoch AI and METR, called MirrorCode, reveals that AI models can now tackle complex reimplementation tasks that previously required weeks of human labor.
Challenging AI with MirrorCode
MirrorCode represents a significant departure from traditional software engineering benchmarks that typically cap inference costs at a mere $1 to $10 per task. Instead, this benchmark requires AI models to reimplement complete, complex programs from scratch—ranging from Unix utilities and cryptography to bioinformatics and data serialization—without access to the original source code. To ensure true functional equivalence, every AI-generated solution must pass hidden end-to-end tests that the model never sees during its development phase.
The scale of these tasks is unprecedented. One specific task in the benchmark required an AI model to work continuously for 19 days without any human intervention, resulting in an inference cost of $2,600 for a single run.
Claude Opus 4.7 Leads the Race
The benchmark results highlight a clear hierarchy in current frontier models. Claude Opus 4.7 emerged as the leader with a 56 percent solve rate, significantly outperforming GPT-5.5, which achieved 44 percent, and Gemini 3.1 Pro Preview, which sat at 32 percent.
A standout success involved the bioinformatics toolkit gotree. This program consists of approximately 16,000 lines of Go code and features over 40 distinct commands. While a human engineer would typically require between 2 to 17 weeks to complete such a task, Claude Opus 4.7 successfully reimplemented it in just 14 hours for a cost of $251. Even in cases where models fail to achieve a 100 percent perfect reimplementation, they remarkably pass over 90 percent of the functional tests.
The Complexity Gap and Memorization Risks
Despite these leaps, the MirrorCode results reveal a distinct "complexity ceiling." While all tested models reliably handle small programs like uuid or parseqsv, no model currently has the capability to fully solve the "large" category of tasks. The frontier of AI coding still struggles when faced with the most massive, interconnected software architectures.
Epoch AI also addressed a critical concern in LLM evaluation: data contamination. Since the benchmark utilizes open-source programs, there is a risk that models have already memorized the original code during their training phases. While initial findings suggest that performance is not purely driven by memorization, researchers admit they cannot entirely rule out its contribution to the current solve rates.
Why This Matters for the AI Industry
MirrorCode signals a transition from "AI as a Copilot" to "AI as an Autonomous Agent." By proving that models can sustain reasoning over 19-day periods and handle thousands of lines of code, the industry is moving closer to agents capable of managing entire software lifecycles. As inference costs fluctuate—with GPT-5.5 costing three times more than its predecessor while Claude Opus 4.7 has become three times more efficient—the economic viability of autonomous engineering will become the next great frontier.
Key Takeaways
- New Scale of Reasoning: MirrorCode pushes AI limits by allowing massive inference budgets, with single tasks costing up to $2,600 and running for 19 days.
- Claude Leads Performance: Claude Opus 4.7 is currently the benchmark leader with a 56% solve rate, demonstrating elite capabilities in reimplementing large-scale Go codebases.
- Complexity Barriers Remain: While small-scale tasks are being solved reliably, no existing model can yet fully crack the most complex, large-scale programming tasks.
