๐ฆ๐๐ผ๐ฝ ๐ง๐ฒ๐๐๐ถ๐ป๐ด ๐ฌ๐ผ๐๐ฟ ๐๐. ๐ฆ๐๐ฎ๐ฟ๐ ๐ ๐ฒ๐ฎ๐๐๐ฟ๐ถ๐ป๐ด ๐๐.
I built a training simulator as a side project. It uses a language model to write entire choose-your-own-adventure stories. Every scene, character, and choice is generated at once.
This created a massive testing problem. Every time I changed a prompt, I had to regenerate everything to see if it improved. Manual review was impossible because the volume was too high. I could not tell if a one-line change made the story better or worse.
I had to stop testing and start measuring. I built a harness to automate the process.
Here is how I engineered the measurement system:
Use gated pipelines I run tests in stages. I start with cheap, deterministic checks. If the structure is broken, I stop there. I do not waste money on an expensive AI judge if the basic logic fails.
Make structure self-healing Structure has a correct answer. I use code to validate the output. If the model fails, I feed the errors back for a retry. I also use a repair pass to fix broken links. This allows me to use cheaper, less reliable models for the heavy lifting.
Engineer the judge like a scientific instrument The judge is a second AI model that scores quality. To keep it accurate, I follow these rules: โข Use forced tool-use for structured data. โข Set temperature to 0. โข Run at least 3 samples and average the results to reduce noise. โข Use a different model family for the judge than the generator.
Use a deterministic walker A judge might miss a logic loop. I wrote a script that plays through every possible path in the story. This found infinite loops and empty content paths that no human or AI judge noticed.
Read the artifact, not just the score A low score tells you there is a problem. It does not tell you how to fix it. I always look at the actual text. I found that one "warm" character was ruining the tone. The fix was not a better prompt, but giving the model absolute instructions instead of relative ones.
The biggest lesson: No single model wins at everything.
I found that some models are great at structure but poor at writing. Others are brilliant writers but fail at following rules.
Instead of forcing one model to do everything, I split the work: โข Use a cheap, reliable model for structure. โข Use a high-quality model for content. โข Use a powerful, expensive model to act as the judge.
Stop treating a bad score as a bad model. Treat it as a measurement question. Ask if the number is right and if your instrument is reliable.
Source: https://dev.to/aws-builders/stop-testing-your-ai-start-measuring-it-312m
Optional learning community: https://t.me/GyaanSetuAi