𝗬𝗼𝘂𝗿 𝗟𝗟𝗠 𝗪𝗮𝘀 𝗥𝗶𝗴𝗵𝘁, 𝗕𝘂𝘁 𝗪𝗮𝘀 𝗜𝘁 𝗥𝗶𝗴𝗵𝘁 𝗳𝗼𝗿 𝘁𝗵𝗲 𝗥𝗶𝗴𝗵𝘁 𝗥𝗲𝗮𝘀𝗼𝗻?
I built a benchmark to see if an LLM can interpret clinical genetic variants.
The initial results looked bad. The model scored 60 percent accuracy. I almost concluded the model was mediocre and unfit for use.
I was wrong.
The real insight only appeared when I stopped measuring accuracy and started measuring safety.
In clinical genetics, a wrong answer can be dangerous. There are two types of errors:
- Safe abstention: The model says "uncertain" when the truth is a confident call. This is safe because a human will investigate.
- Confident error: The model makes the opposite call (e.g., calling a disease-causing variant "benign"). This is a dangerous failure.
My benchmark showed the model had zero confident errors. It never made a dangerous mistake. It simply chose to stay silent when it lacked sufficient evidence.
When I used a simple accuracy metric, I branded a safe, well-calibrated model as a failure. My metric was the problem, not the model.
If you build benchmarks for high-stakes fields like medicine, law, or finance, follow these rules:
- Separate safe failures from dangerous ones. Never put an honest "I don't know" in the same bucket as a confident lie.
- Audit the reasoning. Accuracy alone does not show if a model is fabricating evidence or following logic.
- Keep your evidence real. Do not inject fake data into your tests. If your evaluation uses fake data, you cannot test if the model hallucinates.
- Calibrate your own analysis. Small sample sizes can lie. Do not publish findings before you verify them with larger data.
In high-stakes domains, a model that knows when to stop is more valuable than a model that guesses.
The code is on GitHub: gbadedata/clinvar-interpretation-benchmark.
Complete post: https://dev.to/gbadedata/your-llm-got-the-variant-right-but-did-it-get-it-right-for-the-right-reason-1oc3
Optional learning community: https://t.me/GyaanSetuAi