Comment j'ai mis en place des évaluations RAG en CI/CD pour détecter les régressions

📅3 hours ago⏱2 min read

𝗛𝗼𝘄 𝗜 𝗦𝗲𝘁 𝗨𝗽 𝗥𝗔𝗚 𝗘𝘃𝗮𝗹𝘀 𝗶𝗻 𝗖𝗜/𝗖𝗗 𝘁𝗼 𝗖𝗮𝘁𝗰𝗵 𝗥𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝗼𝗻𝘀

A PR lands. The RAG eval runs in one minute. It shows a green check. You merge the code.

Twelve hours later, support tickets arrive.

The retriever changed its top-1 chunk for a specific query type. Your 30-example dataset never covered that case. Your test suite stayed green because it was checking the wrong things.

Most RAG gates are just smoke tests. They use small datasets and fixed floors. If the mean is above a certain number, they pass. This fails because datasets are not representative and thresholds do not account for noise.

A good gate needs three things: speed, low cost, and statistical significance. You usually only get two.

Here is my framework for a reliable RAG gate.

𝗧𝗵𝗲 𝗧𝗵𝗿𝗲𝗲-𝗧𝗶𝗲𝗿 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲

Every Push (The PR Gate)

Use cheap classifiers like NLI faithfulness and claim support.
Use deterministic checks for citation validity and latency.
Run 100 to 200 examples in under three minutes.
This blocks the merge.

Nightly Main (The Full Sweep)

Run the full LLM-judge stack against your versioned dataset.
This takes 15 to 30 minutes.
This blocks the next promotion to canary.

Canary (Production Monitoring)

Run the same rubrics on 5 to 10 percent of live traffic.
Set alarms for rolling-mean drift.

𝗧𝗵𝗲 𝟱 𝗖𝗼𝗿𝗲 𝗥𝘂𝗯𝗿𝗶𝗰𝘀

To find exactly where a system breaks, split your rubrics by layer:

Context Relevance: If this drops but Groundedness stays high, the retriever regressed.
Groundedness: If this drops but Context Relevance stays high, the generator regressed.
Answer Relevance
Citation Validity
Retrieval Recall

Note: You must use ground-truth doc IDs (expected_chunks) to score retrieval recall. Without this, you cannot find retrieval bugs.

𝗧𝗵𝗲 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝗮𝗹 𝗚𝗮𝘁𝗲

Do not just gate on a fixed number. A fixed floor misses slow drift. Use two thresholds:

An absolute floor for obvious breaks (e.g., Groundedness ≥ 0.85).
A delta gate against a 7-day rolling baseline.

Use a statistical test like Welch's t-test. Only fail the PR if the drop is statistically significant and large enough to matter. This prevents developers from ignoring the gate due to false alarms.

𝗧𝗵𝗲 𝗠𝗼𝘀𝘁 𝗜𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝗥𝘂𝗹𝗲

Votre référence doit être une fenêtre de production glissante, et non un chiffre figé.

Lorsque les données de production changent, votre jeu de données doit changer également. Prenez les traces de production en échec, regroupez-les par clusters et intégrez-les à votre ensemble d'évaluation. Cela transforme votre point de contrôle en un système d'apprentissage.

Source : https://dev.to/kartik-nvjk/how-i-set-up-rag-evals-in-cicd-so-they-actually-catch-regressions-46hb

Communauté d'apprentissage optionnelle : https://t.me/GyaanSetuAi

Comment j'ai mis en place des évaluations RAG en CI/CD pour détecter les régressions

Continue reading

𝗔𝗜 𝗔𝘂𝗱𝗶𝘁𝘀 𝗜𝗻 𝗬𝗼𝘂𝗿 𝗖𝗜/𝗖𝗗 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲

𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗔𝗜 𝗜𝗻𝗰𝗶𝗱𝗲𝗻𝘁 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲: 𝗥𝗼𝗹𝗹 𝗕𝗮𝗰𝗸 𝗥𝗼𝗴𝘂𝗲 𝗔𝗴𝗲𝗻𝘁𝘀

𝗩𝗶𝘀𝘂𝗮𝗹 𝗥𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 𝗳𝗼𝗿 𝗪𝗲𝗯 𝗔𝗽𝗽𝘀

𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗗𝗼𝗺𝗮𝗶𝗻 𝗦𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗟𝗟𝗠 𝗘𝘃𝗮𝗹 𝗦𝗲𝘁𝘀

𝗔𝗜 𝗖𝗼𝗱𝗲 𝗥𝗲𝘃𝗶𝗲𝘄 𝗜𝘀 𝗔 𝗥𝗼𝘂𝘁𝗶𝗻𝗴 𝗣𝗿𝗼𝗯𝗹𝗲𝗺