๐—ฆ๐˜๐—ผ๐—ฝ ๐—ง๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด ๐—ฌ๐—ผ๐˜‚๐—ฟ ๐—”๐—œ. ๐—ฆ๐˜๐—ฎ๐—ฟ๐˜ ๐— ๐—ฒ๐—ฎ๐˜€๐˜‚๐—ฟ๐—ถ๐—ป๐—ด ๐—œ๐˜.

I built a training simulator as a side project. It uses a language model to write entire choose-your-own-adventure stories. Every scene, character, and choice is generated at once.

This created a massive testing problem. Every time I changed a prompt, I had to regenerate everything to see if it improved. Manual review was impossible because the volume was too high. I could not tell if a one-line change made the story better or worse.

I had to stop testing and start measuring. I built a harness to automate the process.

Here is how I engineered the measurement system:

The biggest lesson: No single model wins at everything.

I found that some models are great at structure but poor at writing. Others are brilliant writers but fail at following rules.

Instead of forcing one model to do everything, I split the work: โ€ข Use a cheap, reliable model for structure. โ€ข Use a high-quality model for content. โ€ข Use a powerful, expensive model to act as the judge.

Stop treating a bad score as a bad model. Treat it as a measurement question. Ask if the number is right and if your instrument is reliable.

Source: https://dev.to/aws-builders/stop-testing-your-ai-start-measuring-it-312m

Optional learning community: https://t.me/GyaanSetuAi