91% Pass Rate. Gate Green. Shipped. Worst Regression Ever.
We hit a 91% pass rate on an intent-classification test. The threshold was 90%. We cleared the bar. We shipped the code.
It was our worst regression of the quarter.
The problem was our math. Our evaluation score stayed at 96% or 97% for weeks. Then, a change broke one specific slice: ambiguous refund requests. That slice dropped from 98% to 74%.
That slice represents 4% of our total traffic. Because we looked at the average, the total score only fell to 91%. The gate stayed green.
Aggregates hide failures inside noise.
The users in that slice did not see 91%. They saw 74%. A static threshold tells you if the whole system falls off a cliff. It does not tell you if one part of your system is dying. If 96 slices are fine and one crashes, a high average hides the crash. You find the error through support tickets instead of your testing tools.
We changed our strategy. We stopped gating on absolute numbers. We now gate against the last successful run.
We use two rules. Both must pass:
- No single slice drops more than 3 points against the baseline.
- The total aggregate drops no more than 1.5 points against the baseline.
In our recent failure, the refund slice dropped 24 points. Rule one would have caught it immediately.
Watch out for delta gating traps. If your baseline updates every single run, you can drift into failure. A 0.5 point drop every day passes every test. You slowly slide into a bad product.
Follow these steps to fix your testing:
- Update your baseline only when your main branch is green.
- Require a human to approve any intentional drop in scores.
- Your baseline must be a record of what works, not just what happened last.
- Check the variance of your last 5 green runs. If a slice swings more than your threshold, your threshold is noise.
- Test your smallest slice. Ask how far it can drop before the aggregate notices. If the answer is a large number, your aggregate is hiding errors.
Optional learning community: https://t.me/GyaanSetuAi
