𝗛𝗼𝘄 𝗜 𝗦𝗲𝘁 𝗨𝗽 𝗥𝗔𝗚 𝗘𝘃𝗮𝗹𝘀 𝗶𝗻 𝗖𝗜/𝗖𝗗 𝘁𝗼 𝗖𝗮𝘁𝗰𝗵 𝗥𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝗼𝗻𝘀

A PR lands. The RAG eval runs in one minute. It shows a green check. You merge the code.

Twelve hours later, support tickets arrive.

The retriever changed its top-1 chunk for a specific query type. Your 30-example dataset never covered that case. Your test suite stayed green because it was checking the wrong things.

Most RAG gates are just smoke tests. They use small datasets and fixed floors. If the mean is above a certain number, they pass. This fails because datasets are not representative and thresholds do not account for noise.

A good gate needs three things: speed, low cost, and statistical significance. You usually only get two.

Here is my framework for a reliable RAG gate.

𝗧𝗵𝗲 𝗧𝗵𝗿𝗲𝗲-𝗧𝗶𝗲𝗿 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲

  1. Every Push (The PR Gate)
  1. Nightly Main (The Full Sweep)
  1. Canary (Production Monitoring)

𝗧𝗵𝗲 𝟱 𝗖𝗼𝗿𝗲 𝗥𝘂𝗯𝗿𝗶𝗰𝘀

To find exactly where a system breaks, split your rubrics by layer:

Note: You must use ground-truth doc IDs (expected_chunks) to score retrieval recall. Without this, you cannot find retrieval bugs.

𝗧𝗵𝗲 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝗮𝗹 𝗚𝗮𝘁𝗲

Do not just gate on a fixed number. A fixed floor misses slow drift. Use two thresholds:

Use a statistical test like Welch's t-test. Only fail the PR if the drop is statistically significant and large enough to matter. This prevents developers from ignoring the gate due to false alarms.

𝗧𝗵𝗲 𝗠𝗼𝘀𝘁 𝗜𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝗥𝘂𝗹𝗲

Ihre Baseline muss ein rollierendes Produktionsfenster sein, keine statische Zahl.

Wenn sich die Produktionsdaten ändern, muss sich auch Ihr Datensatz ändern. Nehmen Sie fehlerhafte Produktions-Traces, clustern Sie diese und führen Sie sie in Ihren Eval-Set über. Dies verwandelt Ihr Gate in ein lernendes System.

Quelle: https://dev.to/kartik-nvjk/how-i-set-up-rag-evals-in-cicd-so-they-actually-catch-regressions-46hb

Optionale Lern-Community: https://t.me/GyaanSetuAi