Your Evals Are Flaky Too: Stop Trusting A Pass Rate You Can't Reproduce
Most people know AI agents are non-deterministic. You send the same prompt, but you get different outputs.
We accepted this. We started using LLMs as judges to grade these agents.
But we made a massive mistake. We assumed our judges were deterministic. They are not.
Your eval suite is a random system grading another random system. If you do not measure how much your grader wobbles, you do not have a quality gate. You have a coin flip.
I saw this happen with a support agent. The dashboard stayed green for weeks. Then, customer complaints spiked. I ran the same eval on 200 old responses. 14 of them changed their verdict. The agent did not change. The judge changed its mind.
A flaky gate is worse than no gate. It gives you false confidence.
There are three reasons your evals fail:
- The judge model: Every LLM judge has variance. Even at temperature 0, providers do not guarantee the same result. A silent model update can ruin your baseline overnight.
- The harness: If your context or tool outputs change between runs, the judge sees a different question. The input drifted.
- The rubric: Vague rules like "is this good?" create variance. Tight, specific rules reduce it.
You must treat flaky evals like flaky software tests. Do not ship them. Quarantine them. Measure the flake rate.
Stop reporting a single pass rate. Start reporting agreement.
Run each judge call multiple times. If the judge cannot agree with itself, the verdict is not a signal. It is UNSTABLE.
An UNSTABLE result should be a first-class outcome in your CI/CD pipeline. It should fail loud.
To fix an unstable eval, you need two things:
- A scoring layer: This computes stability and turns results into PASS, FAIL, or UNSTABLE.
- A tracing layer: You must see the raw bytes, the exact prompt, and the tool outputs for every run.
Without traces, you will think the model is just random. You will lower the temperature and think you fixed it. You did not fix it. You just made the bug quieter.
Follow these rules to build real quality:
- Report agreement, not just averages.
- Make UNSTABLE a failing state in your pipeline.
- Pin your judge model versions.
- Read the traces when a check fails.
A green dashboard you cannot reproduce is not a signal. It is a story you tell yourself.
Optional learning community: https://t.me/GyaanSetuAi
