Eval-Driven Agent Development: How I Stopped Tuning Prompts on Vibes
I changed a prompt. The next run looked better. Did the change help, or did I get lucky?
For a long time, my answer was "I think so." I would tweak a command, run the pipeline, watch it succeed, and ship it. This is vibes-based engineering. Almost everyone building agents does this because the alternative feels hard.
But coding agents are non-deterministic. You can run the same task twice and get two different results. A single good run tells you nothing. You cannot tell if your change worked or if the dice just rolled well.
I solved this by using machine learning discipline. I built an evaluation framework to wrap my entire system.
Here is how the framework works:
• Target: A frozen codebase. This stays the same so scores remain comparable. • Task: A specific benchmark item with a prompt and an oracle. • Oracle: A deterministic check. These are shell commands that must pass. • Variant: The specific change you are testing, like a new planner. • Trial: A single run. I run every task multiple times to account for randomness.
I use two types of scoring to catch different failures:
- Code Graders (Deterministic): These check test pass rates, cost, time, and file changes.
- LLM Judge (Probabilistic): A separate, fixed model scores spec quality and implementation fidelity.
The code graders tell you if the code runs. The judge tells you if the code is good. You need both.
I also stopped using averages. Means lie about agents. If a task succeeds 2 out of 3 times, it looks okay. But it is not reliable. Instead, I use two metrics:
- pass@k: Did the agent succeed at least once? (Capability)
- pass^k: Did the agent succeed every single time? (Reliability)
A jump in pass^k is the real win. It means you made the agent consistent, not just lucky.
To keep the system sharp, I add hard tasks that require deep understanding. When an agent fails on a real bug, I turn that failure into a permanent task. This creates a closed loop. The benchmark gets harder as the agent gets better.
This infrastructure is a lot of work, but it is the highest leverage thing I built. It turned "I think this is better" into "this is 20% more reliable at a lower cost."
Coding agents are easy to demo but hard to trust. If you want to move past demos, you must decide to measure.
Source: https://dev.to/rickjms/eval-driven-agent-development-how-i-stopped-tuning-prompts-on-vibes-1189
Optional learning community: https://t.me/GyaanSetuAi
