𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗗𝗼𝗺𝗮𝗶𝗻 𝗦𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗟𝗟𝗠 𝗘𝘃𝗮𝗹 𝗦𝗲𝘁𝘀

📅2 weeks ago⏱1 min read

𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗗𝗼𝗺𝗮𝗶𝗻-𝗦𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗟𝗟𝗠 𝗘𝘃𝗮𝗹 𝗦𝗲𝘁𝘀

You ship a model with 91% accuracy. It still fails on specific cases. Your test set was a random split. It missed the real bugs.

Hand-built eval sets fill this gap. General benchmarks do not measure your specific task. Private sets stop data contamination. They act as release gates in your CI pipeline.

Use these 5 properties for a good set:

Representative: Covers real user inputs.
Hard: Includes cases where strong models fail.
Versioned: Tracked in your repo.
Blind: Not used in training.
Scored automatically: No human votes.

Follow this process:

Sample 400 to 800 production inputs.
Remove PII.
Label data with experts.
Check agreement with Cohen's Kappa.
Fix the rubric if agreement is low.
Split data into eval and calibration.
Add a scorer to your CI.
Re-sample every quarter.

Avoid these mistakes:

Do not use a model to label its own set.
Do not make the set too easy.
Do not trust a set from a year ago.
Do not rely on a human to say it looks right.

Pick your scorer:

Exact match for classification.
Embedding similarity for paraphrasing.
LLM-as-judge for long answers.

Wait to build a set if you are still picking a base model. Wait if you have no real user data. Wait if your product changes every week.

Start with real traffic. Fix your rubric. Automate your score.

Source: https://dev.to/tech_nuggets/building-a-domain-specific-llm-evaluation-set-from-scratch-37n3 Optional learning community: https://t.me/GyaanSetuAi

𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗗𝗼𝗺𝗮𝗶𝗻 𝗦𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗟𝗟𝗠 𝗘𝘃𝗮𝗹 𝗦𝗲𝘁𝘀

Continue reading

𝗛𝗶𝗴𝗵 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗔𝗿𝗲 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗦𝘆𝘀𝘁𝗲𝗺𝘀

𝗧𝗲𝘀𝘁𝗶𝗻𝗴 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗠𝗶𝗴𝗿𝗮𝘁𝗶𝗼𝗻𝘀 𝗪𝗶𝘁𝗵 𝗦𝗵𝗮𝗱𝗼𝘄 𝗦𝗰𝗵𝗲𝗺𝗮𝘀

𝗔𝗜 𝗮𝗻𝗱 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝗦𝗼𝗳𝘁𝘄𝗮𝗿𝗲 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁

𝗬𝗢𝗨𝗥 𝗔𝗚𝗘𝗡𝗧 𝗙𝗔𝗜𝗟𝗘𝗗 𝗜𝗡 𝗣𝗥𝗢𝗗. 𝗚𝗢𝗢𝗗 𝗟𝗨𝗖𝗞 𝗥𝗘𝗣𝗥𝗢𝗗𝗨𝗖𝗜𝗡𝗚 𝗜𝗧.

使用 Ollama 构建本地 AI 代码审查工具