๐—”๐—œ ๐—–๐—ผ๐—ฑ๐—ถ๐—ป๐—ด ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐— ๐—ถ๐˜€๐˜€ ๐— ๐—ผ๐˜€๐˜ ๐—–๐—ฟ๐—ถ๐˜๐—ถ๐—ฐ๐—ฎ๐—น ๐—–๐—ผ๐—ฑ๐—ฒ ๐—Ÿ๐—ถ๐—ป๐—ฒ๐˜€

AI coding agents find the right files but fail at the details.

A new benchmark called SWE-Explore reveals a massive gap in AI coding. Researchers tested 848 bug-fixing tasks from 203 open-source projects. The results show a pattern that model size cannot fix.

The Findings:

โ€ข AI agents find the correct source file easily. โ€ข These agents cover only 14% to 19% of the critical code lines. โ€ข They miss 81% to 86% of the lines needed to fix the bug. โ€ข This failure happens across all major models, including Claude Code and Codex.

The problem is structural. High-performing models from OpenAI, Anthropic, and Google all show the same weakness. They can locate a file, but they cannot pinpoint the exact lines that require changes.

Why this matters for you:

Current evaluations focus on whether an agent fixes a bug. SWE-Explore shows that many successful fixes might rely on luck or broad context. If an agent does not see the exact lines causing a problem, it is not truly understanding the code.

A model upgrade is not enough. To solve this, developers need new architectures that improve line-level accuracy. An agent that scores above 30% on this benchmark would represent a real breakthrough.

Source: https://the-decoder.com

Optional learning community: https://t.me/GyaanSetuAi