𝗪𝗵𝘆 𝗖𝗼𝗵𝗲𝗻𝘀 𝗞𝗮𝗽𝗽𝗮 𝗗𝗿𝗶𝗳𝘁𝘀
Your LLM-as-judge kappa changes every week. You check your labellers. They are fine. The problem is your calibration set.
Cohen's kappa formula is (Po - Pe) / (1 - Pe). Po is observed agreement. Pe is expected agreement by chance. Pe depends on the label mix in your set.
Last week, 70% of traces were acceptable. This week, 50% are acceptable. Pe shifts. Kappa moves even if your labellers do the same work.
Try these three things:
- Sample across time windows. Use a rolling 4-week window. This stops one week from dominating Pe.
- Use per-class precision and recall. One number hides the truth. Per-class metrics show where disagreements happen.
- Use Wilson confidence intervals for sets under 100 traces. This is more stable than point estimates.
Source: https://dev.to/maya_andersson_dev/why-cohens-kappa-drifts-week-to-week-and-what-to-do-about-it-2alh Optional learning community: https://t.me/GyaanSetuAi