LLM ของคุณตอบถูก แต่ถูกด้วยเหตุผลที่ถูกต้องหรือไม่?

Translated for your language. Read the original.

AI-assisted draft.

4 ชั่วโมงที่ผ่านมา2min read

𝗬𝗼𝘂𝗿 𝗟𝗟𝗠 𝗪𝗮𝘀 𝗥𝗶𝗴𝗵𝘁, 𝗕𝘂𝘁 𝗪𝗮𝘀 𝗜𝘁 𝗥𝗶𝗴𝗵𝘁 𝗳𝗼𝗿 𝘁𝗵𝗲 𝗥𝗶𝗴𝗵𝘁 𝗥𝗲𝗮𝘀𝗼𝗻?

I built a benchmark to see if an LLM can interpret clinical genetic variants.

The initial results looked bad. The model scored 60 percent accuracy. I almost concluded the model was mediocre and unfit for use.

I was wrong.

The real insight only appeared when I stopped measuring accuracy and started measuring safety.

In clinical genetics, a wrong answer can be dangerous. There are two types of errors:

Safe abstention: The model says "uncertain" when the truth is a confident call. This is safe because a human will investigate.
Confident error: The model makes the opposite call (e.g., calling a disease-causing variant "benign"). This is a dangerous failure.

My benchmark showed the model had zero confident errors. It never made a dangerous mistake. It simply chose to stay silent when it lacked sufficient evidence.

When I used a simple accuracy metric, I branded a safe, well-calibrated model as a failure. My metric was the problem, not the model.

If you build benchmarks for high-stakes fields like medicine, law, or finance, follow these rules:

Separate safe failures from dangerous ones. Never put an honest "I don't know" in the same bucket as a confident lie.
Audit the reasoning. Accuracy alone does not show if a model is fabricating evidence or following logic.
Keep your evidence real. Do not inject fake data into your tests. If your evaluation uses fake data, you cannot test if the model hallucinates.
Calibrate your own analysis. Small sample sizes can lie. Do not publish findings before you verify them with larger data.

In high-stakes domains, a model that knows when to stop is more valuable than a model that guesses.

The code is on GitHub: gbadedata/clinvar-interpretation-benchmark.

Complete post: https://dev.to/gbadedata/your-llm-got-the-variant-right-but-did-it-get-it-right-for-the-right-reason-1oc3

Optional learning community: https://t.me/GyaanSetuAi

LLM ของคุณตอบถูก แต่ถูกด้วยเหตุผลที่ถูกต้องหรือไม่?

Continue reading

การสร้างชุดประเมินผล LLM เฉพาะทาง

𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗙𝗮𝗸𝗶𝗻𝗴 𝗜𝗻 𝗟𝗟𝗠𝘀

คำลวงของ LLM Benchmark

พื้นฐานช่องโหว่ของ LLM 101

ความมั่นใจของ Qwen 2.5 7B ไม่น่าเชื่อถือ