Why AI Coding Agents Fail: The Critical Gap Between Files and Lines

While AI coding agents are increasingly capable of resolving software bugs, a new study reveals they suffer from a significant "localization" problem. They can navigate to the correct file within a massive codebase, but they frequently fail to identify the specific lines of code required to implement a fix.

Introducing SWE-Explore: Moving Beyond Repair Rates

Historically, the effectiveness of AI coding agents has been measured by a single, binary metric: did the agent fix the bug or not? This approach ignores the "why" behind a failure. A failed repair could mean the agent wrote a bad patch, or it could mean the agent never even looked at the relevant logic.

To address this blind spot, an international research team, including scientists from Shanghai Jiao Tong University, developed SWE-Explore. Unlike traditional benchmarks, SWE-Explore isolates the upstream search phase. It evaluates an agent's ability to take a bug description and return a ranked list of the specific code sections that are actually relevant to the problem. The dataset is extensive, drawing from 848 tasks across 203 open-source projects and ten programming languages, with Python being the most prominent (547 tasks).

The Precision Gap: File Success vs. Line Failure

The study’s most striking finding is the massive disparity between file-level and line-level accuracy. When tested against general-purpose agents like Claude Code, Codex, and OpenHands, the results were telling:

Interestingly, simply upgrading the underlying Large Language Model (LLM) does not solve this. Whether using models from OpenAI, Anthropic, Google, Moonshot, or Zhipu, the pattern remains identical: high file hit rates but abysmal line coverage. The research noted that specialized systems like CoSIL outperformed general agents by treating code as a network of interconnected building blocks, suggesting that architectural changes are more important than raw model power.

The Threshold Effect: Why "Reading More" Matters

Through controlled ablation experiments, researchers discovered a "threshold effect" regarding context. By varying the amount of core code provided to the model (from 0% to 100%), they found that repairs do not improve linearly.

For easier tasks, there is a clear tipping point: if an agent sees less than 50% of the necessary core regions, the repair success rate stays near zero. A significant jump in successful repairs only occurs once the agent has access to between 50% and 75% of the required context. Crucially, the study found that providing irrelevant "noise" code does not hurt performance as much as missing the critical lines. The takeaway for developers is clear: in the era of AI agents, it is better to provide more context than to risk filtering out the essential details.

Key Takeaways