𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗗𝗼𝗺𝗮𝗶𝗻-𝗦𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗟𝗟𝗠 𝗘𝘃𝗮𝗹 𝗦𝗲𝘁𝘀
You ship a model with 91% accuracy. It still fails on specific cases. Your test set was a random split. It missed the real bugs.
Hand-built eval sets fill this gap. General benchmarks do not measure your specific task. Private sets stop data contamination. They act as release gates in your CI pipeline.
Use these 5 properties for a good set:
- Representative: Covers real user inputs.
- Hard: Includes cases where strong models fail.
- Versioned: Tracked in your repo.
- Blind: Not used in training.
- Scored automatically: No human votes.
Follow this process:
- Sample 400 to 800 production inputs.
- Remove PII.
- Label data with experts.
- Check agreement with Cohen's Kappa.
- Fix the rubric if agreement is low.
- Split data into eval and calibration.
- Add a scorer to your CI.
- Re-sample every quarter.
Avoid these mistakes:
- Do not use a model to label its own set.
- Do not make the set too easy.
- Do not trust a set from a year ago.
- Do not rely on a human to say it looks right.
Pick your scorer:
- Exact match for classification.
- Embedding similarity for paraphrasing.
- LLM-as-judge for long answers.
Wait to build a set if you are still picking a base model. Wait if you have no real user data. Wait if your product changes every week.
Start with real traffic. Fix your rubric. Automate your score.
Source: https://dev.to/tech_nuggets/building-a-domain-specific-llm-evaluation-set-from-scratch-37n3 Optional learning community: https://t.me/GyaanSetuAi