Why AI Coding Agents Fail: The Critical Gap Between Files and Lines

📅2 hours ago⏱3 min read

In this article

Why AI Coding Agents Fail: The Critical Gap Between Files and Lines

While AI coding agents are increasingly capable of resolving software bugs, a new study reveals they suffer from a significant "localization" problem. They can navigate to the correct file within a massive codebase, but they frequently fail to identify the specific lines of code required to implement a fix.

Introducing SWE-Explore: Moving Beyond Repair Rates

Historically, the effectiveness of AI coding agents has been measured by a single, binary metric: did the agent fix the bug or not? This approach ignores the "why" behind a failure. A failed repair could mean the agent wrote a bad patch, or it could mean the agent never even looked at the relevant logic.

To address this blind spot, an international research team, including scientists from Shanghai Jiao Tong University, developed SWE-Explore. Unlike traditional benchmarks, SWE-Explore isolates the upstream search phase. It evaluates an agent's ability to take a bug description and return a ranked list of the specific code sections that are actually relevant to the problem. The dataset is extensive, drawing from 848 tasks across 203 open-source projects and ten programming languages, with Python being the most prominent (547 tasks).

The Precision Gap: File Success vs. Line Failure

The study’s most striking finding is the massive disparity between file-level and line-level accuracy. When tested against general-purpose agents like Claude Code, Codex, and OpenHands, the results were telling:

File-level accuracy: Agents perform well, successfully identifying the correct source files and ranking them highly.
Line-level accuracy: Performance collapses. General coding agents covered only 14% to 19% of the actual lines of code that mattered for a fix.

Interestingly, simply upgrading the underlying Large Language Model (LLM) does not solve this. Whether using models from OpenAI, Anthropic, Google, Moonshot, or Zhipu, the pattern remains identical: high file hit rates but abysmal line coverage. The research noted that specialized systems like CoSIL outperformed general agents by treating code as a network of interconnected building blocks, suggesting that architectural changes are more important than raw model power.

The Threshold Effect: Why "Reading More" Matters

Through controlled ablation experiments, researchers discovered a "threshold effect" regarding context. By varying the amount of core code provided to the model (from 0% to 100%), they found that repairs do not improve linearly.

For easier tasks, there is a clear tipping point: if an agent sees less than 50% of the necessary core regions, the repair success rate stays near zero. A significant jump in successful repairs only occurs once the agent has access to between 50% and 75% of the required context. Crucially, the study found that providing irrelevant "noise" code does not hurt performance as much as missing the critical lines. The takeaway for developers is clear: in the era of AI agents, it is better to provide more context than to risk filtering out the essential details.

Key Takeaways

Localization is the bottleneck: AI agents are proficient at finding the right file but struggle significantly to pinpoint the specific lines of code required for a fix.
Model scaling isn't a silver bullet: Upgrading to more powerful LLMs does not fix the line-level accuracy gap; specialized architectural approaches like CoSIL are more effective.
The 50% Context Rule: AI repair success follows a threshold pattern, requiring at least 50-75% of the relevant code context to be visible before successful fixes become probable.

Why AI Coding Agents Fail: The Critical Gap Between Files and Lines

Why AI Coding Agents Fail: The Critical Gap Between Files and Lines

Introducing SWE-Explore: Moving Beyond Repair Rates

The Precision Gap: File Success vs. Line Failure

The Threshold Effect: Why "Reading More" Matters

Key Takeaways

Continue reading

𝗘𝗶𝗴𝗵𝘁 𝗔𝗴𝗲𝗻𝘁 𝗦𝗸𝗶𝗹𝗹𝘀 𝗙𝗼𝗿 𝗕𝗲𝘁𝘁𝗲𝗿 𝗔𝗜 𝗖𝗼𝗱𝗶𝗻𝗴

𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹 𝗔𝗜 𝗖𝗼𝗱𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗚𝗶𝘁 𝗪𝗼𝗿𝗸𝘁𝗿𝗲𝗲𝘀

𝗪𝗵𝗮𝘁 𝗛𝗮𝗽𝗽𝗲𝗻𝗲𝗱 𝗪𝗵𝗲𝗻 𝗜 𝗧𝗼𝗹𝗱 𝗖𝗼𝗱𝗲𝘅 𝘁𝗼 𝗖𝗮𝗹𝗺 𝗗𝗼𝘄𝗻

𝗪𝗵𝗮𝘁 𝗛𝗮𝗽𝗽𝗲𝗻𝘀 𝗪𝗵𝗲𝗻 𝗬𝗼𝘂 𝗥𝘂𝗻 𝟭𝟬 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗔𝘁 𝗢𝗻𝗰𝗲

𝗔𝗜 𝗖𝗼𝗱𝗶𝗻𝗴 𝗔𝗴𝗲𝗻𝘁𝘀 𝗠𝗶𝘀𝘀 𝗠𝗼𝘀𝘁 𝗖𝗿𝗶𝘁𝗶𝗰𝗮𝗹 𝗖𝗼𝗱𝗲 𝗟𝗶𝗻𝗲𝘀