๐๐ ๐๐ผ๐ฑ๐ถ๐ป๐ด ๐๐ด๐ฒ๐ป๐๐ ๐ ๐ถ๐๐ ๐ ๐ผ๐๐ ๐๐ฟ๐ถ๐๐ถ๐ฐ๐ฎ๐น ๐๐ผ๐ฑ๐ฒ ๐๐ถ๐ป๐ฒ๐
AI coding agents find the right files but fail at the details.
A new benchmark called SWE-Explore reveals a massive gap in AI coding. Researchers tested 848 bug-fixing tasks from 203 open-source projects. The results show a pattern that model size cannot fix.
The Findings:
โข AI agents find the correct source file easily. โข These agents cover only 14% to 19% of the critical code lines. โข They miss 81% to 86% of the lines needed to fix the bug. โข This failure happens across all major models, including Claude Code and Codex.
The problem is structural. High-performing models from OpenAI, Anthropic, and Google all show the same weakness. They can locate a file, but they cannot pinpoint the exact lines that require changes.
Why this matters for you:
Current evaluations focus on whether an agent fixes a bug. SWE-Explore shows that many successful fixes might rely on luck or broad context. If an agent does not see the exact lines causing a problem, it is not truly understanding the code.
A model upgrade is not enough. To solve this, developers need new architectures that improve line-level accuracy. An agent that scores above 30% on this benchmark would represent a real breakthrough.
Source: https://the-decoder.com
Optional learning community: https://t.me/GyaanSetuAi