𝗪𝗵𝘆 𝗖𝗼𝗵𝗲𝗻𝘀 𝗞𝗮𝗽𝗽𝗮 𝗗𝗿𝗶𝗳𝘁𝘀

Your LLM-as-judge kappa changes every week. You check your labellers. They are fine. The problem is your calibration set.

Cohen's kappa formula is (Po - Pe) / (1 - Pe). Po is observed agreement. Pe is expected agreement by chance. Pe depends on the label mix in your set.

Last week, 70% of traces were acceptable. This week, 50% are acceptable. Pe shifts. Kappa moves even if your labellers do the same work.

Try these three things:

  • Sample across time windows. Use a rolling 4-week window. This stops one week from dominating Pe.
  • Use per-class precision and recall. One number hides the truth. Per-class metrics show where disagreements happen.
  • Use Wilson confidence intervals for sets under 100 traces. This is more stable than point estimates.

Source: https://dev.to/maya_andersson_dev/why-cohens-kappa-drifts-week-to-week-and-what-to-do-about-it-2alh Optional learning community: https://t.me/GyaanSetuAi