𝗤𝘄𝗲𝗻 𝟮.𝟱 𝟳𝗕 𝗖𝗼𝗻𝗳𝗶𝗱𝗲𝗻𝗰𝗲 𝗜𝘀 𝗨𝗻𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲
Large language models often lie about how sure they are.
A new study from the University of Minnesota shows a major flaw in Qwen 2.5 7B. When this model works with clinical data, its confidence scores stay almost the same.
The model reports confidence between 0.856 and 0.937. This happens even when the model is wrong.
Key findings from the research:
- The model is epistemically uncalibrated. Its certainty depends on prompt format rather than accuracy.
- High confidence does not mean high accuracy.
- The model is most confidently wrong on easy cases.
- Traditional models like XGBoost outperform LLMs on structured tabular data.
Why does this happen?
LLMs learn from natural language. They lack intuition for rows of clinical numbers. They rely on linguistic patterns instead of actual data evidence.
This creates a risk in healthcare. If you trust a model's confidence score, you might accept a wrong answer as a fact.
The researchers found a way to fix this without retraining the model:
- Combine few-shot examples with SHAP attribution injection.
- This increased accuracy from 49% to 75.3%.
- Use a cross-model calibrator.
- By comparing the LLM to a classical ML model, you can detect when the LLM is unreliable.
- This method reduced the error rate significantly.
The takeaway is simple. Do not trust verbalized confidence scores for structured data. Use hybrid pipelines. Let classical models handle the numbers and use LLMs for reasoning and explanation.
Source: https://arxiv.org/abs/2606.19509
Optional learning community: https://t.me/GyaanSetuAi