𝗟𝗟𝗠-𝗔𝘀-𝗝𝘂𝗱𝗴𝗲 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝗻 𝟮𝟬𝟮𝟲

LLM-as-judge tools power most leaderboards and evaluation posts today.

Eight new studies from June 2026 show a massive problem. These studies reveal that AI judges often disagree with themselves. They act like a coin flip.

The data shows three main failures:

• Low Reliability: One study tested two OpenAI judges on 29 tasks. They repeated each test 50 times. The results were so inconsistent the authors called it "The Coin Flip Judge." A single-run verdict is mostly noise.

• Compute Sensitivity: Model performance changes based on how much compute you allow during the test. A model might look bad on a leaderboard simply because the test had a low token cap. Change the budget and the ranking flips.

• Brand Bias: Judges show a preference for well-known names like GPT or Claude. This bias tilts the results and makes comparisons unfair.

How you should act:

Stop asking which judge scores highest. Ask which judge tool helps you validate against humans most cheaply.

Source: https://dev.to/bean_bean/llm-as-judge-reliability-in-2026-what-8-june-studies-actually-show-eca