Eval Driven Agent Development: How I Stopped Tuning Prompts on Vibes

Translated for your language. Read the original.

AI-assisted draft.

Eval-Driven Agent Development: How I Stopped Tuning Prompts on Vibes

I changed a prompt. The next run looked better. Did the change help, or did I get lucky?

For a long time, my answer was "I think so." I would tweak a command, run the pipeline, watch it succeed, and ship it. This is vibes-based engineering. Almost everyone building agents does this because the alternative feels hard.

But coding agents are non-deterministic. You can run the same task twice and get two different results. A single good run tells you nothing. You cannot tell if your change worked or if the dice just rolled well.

I solved this by using machine learning discipline. I built an evaluation framework to wrap my entire system.

Here is how the framework works:

• Target: A frozen codebase. This stays the same so scores remain comparable. • Task: A specific benchmark item with a prompt and an oracle. • Oracle: A deterministic check. These are shell commands that must pass. • Variant: The specific change you are testing, like a new planner. • Trial: A single run. I run every task multiple times to account for randomness.

I use two types of scoring to catch different failures:

Code Graders (Deterministic): These check test pass rates, cost, time, and file changes.
LLM Judge (Probabilistic): A separate, fixed model scores spec quality and implementation fidelity.

The code graders tell you if the code runs. The judge tells you if the code is good. You need both.

I also stopped using averages. Means lie about agents. If a task succeeds 2 out of 3 times, it looks okay. But it is not reliable. Instead, I use two metrics:

pass@k: Did the agent succeed at least once? (Capability)
pass^k: Did the agent succeed every single time? (Reliability)

A jump in pass^k is the real win. It means you made the agent consistent, not just lucky.

To keep the system sharp, I add hard tasks that require deep understanding. When an agent fails on a real bug, I turn that failure into a permanent task. This creates a closed loop. The benchmark gets harder as the agent gets better.

This infrastructure is a lot of work, but it is the highest leverage thing I built. It turned "I think this is better" into "this is 20% more reliable at a lower cost."

Coding agents are easy to demo but hard to trust. If you want to move past demos, you must decide to measure.

Source: https://dev.to/rickjms/eval-driven-agent-development-how-i-stopped-tuning-prompts-on-vibes-1189

Optional learning community: https://t.me/GyaanSetuAi

Eval Driven Agent Development: How I Stopped Tuning Prompts on Vibes

Continue reading

𝗟𝗼𝗰𝗮𝗹 𝗖𝗼𝗱𝗶𝗻𝗴 𝗔𝗴𝗲𝗻𝘁𝘀 𝗔𝗿𝗲 𝗔𝗻 𝗘𝗻𝘃𝗶𝗿𝗼𝗻𝗺𝗲𝗻𝘁 𝗣𝗿𝗼𝗯𝗹𝗲𝗺

My AI Coding Agent Kept Breaking — What I Changed

ارزیابی‌های شما هم بی‌ثبات هستند: از اعتماد به نرخ موفقیتی که نمی‌توانید بازتولید کنید، دست بردارید

The Agentic Loop: A Practical Field Guide

عامل‌های کدنویسی هوش مصنوعی بیش از پرامپت‌ها به تست نیاز دارند