𝗤𝘄𝗲𝗻 𝟮.𝟱 𝟳𝗕 𝗖𝗼𝗻𝗳𝗶𝗱𝗲𝗻𝗰𝗲 𝗜𝘀 𝗨𝗻𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲

AI-assisted draft.

12 hours ago1min read

Large language models often lie about how sure they are.

A new study from the University of Minnesota shows a major flaw in Qwen 2.5 7B. When this model works with clinical data, its confidence scores stay almost the same.

The model reports confidence between 0.856 and 0.937. This happens even when the model is wrong.

Key findings from the research:

The model is epistemically uncalibrated. Its certainty depends on prompt format rather than accuracy.
High confidence does not mean high accuracy.
The model is most confidently wrong on easy cases.
Traditional models like XGBoost outperform LLMs on structured tabular data.

Why does this happen?

LLMs learn from natural language. They lack intuition for rows of clinical numbers. They rely on linguistic patterns instead of actual data evidence.

This creates a risk in healthcare. If you trust a model's confidence score, you might accept a wrong answer as a fact.

The researchers found a way to fix this without retraining the model:

Combine few-shot examples with SHAP attribution injection.
This increased accuracy from 49% to 75.3%.
Use a cross-model calibrator.
By comparing the LLM to a classical ML model, you can detect when the LLM is unreliable.
This method reduced the error rate significantly.

The takeaway is simple. Do not trust verbalized confidence scores for structured data. Use hybrid pipelines. Let classical models handle the numbers and use LLMs for reasoning and explanation.

Source: https://arxiv.org/abs/2606.19509

Optional learning community: https://t.me/GyaanSetuAi

𝗤𝘄𝗲𝗻 𝟮.𝟱 𝟳𝗕 𝗖𝗼𝗻𝗳𝗶𝗱𝗲𝗻𝗰𝗲 𝗜𝘀 𝗨𝗻𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲

Continue reading

𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗙𝗮𝗸𝗶𝗻𝗴 𝗜𝗻 𝗟𝗟𝗠𝘀

𝗟𝗟𝗠 𝗩𝘂𝗹𝗻𝗲𝗿𝗮𝗯𝗶𝗹𝗶𝘁𝗶𝗲𝘀 𝟭𝟬𝟭

𝗪𝗵𝘆 𝗬𝗼𝘂𝗿 𝗔𝗜 𝗖𝗼𝗻𝗳𝗶𝗱𝗲𝗻𝗰𝗲 𝗦𝗰𝗼𝗿𝗲𝘀 𝗟𝘆𝗲

𝗧𝗵𝗲 𝗧𝗲𝗹𝗹 𝗪𝗲 𝗧𝗿𝗮𝗶𝗻𝗲𝗱 𝗢𝘂𝘁

𝗬𝗼𝘂𝗿 𝗟𝗟𝗠 𝗪𝗮𝘀 𝗥𝗶𝗴𝗵𝘁, 𝗕𝘂𝘁 𝗪𝗮𝘀 𝗜𝘁 𝗥𝗶𝗴𝗵𝘁 𝗳𝗼𝗿 𝘁𝗵𝗲 𝗥𝗶𝗴𝗵𝘁 𝗥𝗲𝗮𝘀𝗼𝗻?