𝗛𝗼𝘄 𝗜 𝗦𝗲𝘁 𝗨𝗽 𝗥𝗔𝗚 𝗘𝘃𝗮𝗹𝘀 𝗶𝗻 𝗖𝗜/𝗖𝗗 𝘁𝗼 𝗖𝗮𝘁𝗰𝗵 𝗥𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝗼𝗻𝘀
A PR lands. The RAG eval runs in one minute. It shows a green check. You merge the code.
Twelve hours later, support tickets arrive.
The retriever changed its top-1 chunk for a specific query type. Your 30-example dataset never covered that case. Your test suite stayed green because it was checking the wrong things.
Most RAG gates are just smoke tests. They use small datasets and fixed floors. If the mean is above a certain number, they pass. This fails because datasets are not representative and thresholds do not account for noise.
A good gate needs three things: speed, low cost, and statistical significance. You usually only get two.
Here is my framework for a reliable RAG gate.
𝗧𝗵𝗲 𝗧𝗵𝗿𝗲𝗲-𝗧𝗶𝗲𝗿 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲
- Every Push (The PR Gate)
- Use cheap classifiers like NLI faithfulness and claim support.
- Use deterministic checks for citation validity and latency.
- Run 100 to 200 examples in under three minutes.
- This blocks the merge.
- Nightly Main (The Full Sweep)
- Run the full LLM-judge stack against your versioned dataset.
- This takes 15 to 30 minutes.
- This blocks the next promotion to canary.
- Canary (Production Monitoring)
- Run the same rubrics on 5 to 10 percent of live traffic.
- Set alarms for rolling-mean drift.
𝗧𝗵𝗲 𝟱 𝗖𝗼𝗿𝗲 𝗥𝘂𝗯𝗿𝗶𝗰𝘀
To find exactly where a system breaks, split your rubrics by layer:
- Context Relevance: If this drops but Groundedness stays high, the retriever regressed.
- Groundedness: If this drops but Context Relevance stays high, the generator regressed.
- Answer Relevance
- Citation Validity
- Retrieval Recall
Note: You must use ground-truth doc IDs (expected_chunks) to score retrieval recall. Without this, you cannot find retrieval bugs.
𝗧𝗵𝗲 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝗮𝗹 𝗚𝗮𝘁𝗲
Do not just gate on a fixed number. A fixed floor misses slow drift. Use two thresholds:
- An absolute floor for obvious breaks (e.g., Groundedness ≥ 0.85).
- A delta gate against a 7-day rolling baseline.
Use a statistical test like Welch's t-test. Only fail the PR if the drop is statistically significant and large enough to matter. This prevents developers from ignoring the gate due to false alarms.
𝗧𝗵𝗲 𝗠𝗼𝘀𝘁 𝗜𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝗥𝘂𝗹𝗲
Votre référence doit être une fenêtre de production glissante, et non un chiffre figé.
Lorsque les données de production changent, votre jeu de données doit changer également. Prenez les traces de production en échec, regroupez-les par clusters et intégrez-les à votre ensemble d'évaluation. Cela transforme votre point de contrôle en un système d'apprentissage.
Source : https://dev.to/kartik-nvjk/how-i-set-up-rag-evals-in-cicd-so-they-actually-catch-regressions-46hb
Communauté d'apprentissage optionnelle : https://t.me/GyaanSetuAi