𝗦𝘁𝗼𝗽 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗔𝗜. 𝗦𝘁𝗮𝗿𝘁 𝗠𝗲𝗮𝘀𝘂𝗿𝗶𝗻𝗴 𝗜𝘁.

📅22 hours ago⏱2 min read

I built a training simulator as a side project. It uses a language model to write entire choose-your-own-adventure stories. Every scene, character, and choice is generated at once.

This created a massive testing problem. Every time I changed a prompt, I had to regenerate everything to see if it improved. Manual review was impossible because the volume was too high. I could not tell if a one-line change made the story better or worse.

I had to stop testing and start measuring. I built a harness to automate the process.

Here is how I engineered the measurement system:

Use gated pipelines I run tests in stages. I start with cheap, deterministic checks. If the structure is broken, I stop there. I do not waste money on an expensive AI judge if the basic logic fails.
Make structure self-healing Structure has a correct answer. I use code to validate the output. If the model fails, I feed the errors back for a retry. I also use a repair pass to fix broken links. This allows me to use cheaper, less reliable models for the heavy lifting.
Engineer the judge like a scientific instrument The judge is a second AI model that scores quality. To keep it accurate, I follow these rules: • Use forced tool-use for structured data. • Set temperature to 0. • Run at least 3 samples and average the results to reduce noise. • Use a different model family for the judge than the generator.
Use a deterministic walker A judge might miss a logic loop. I wrote a script that plays through every possible path in the story. This found infinite loops and empty content paths that no human or AI judge noticed.
Read the artifact, not just the score A low score tells you there is a problem. It does not tell you how to fix it. I always look at the actual text. I found that one "warm" character was ruining the tone. The fix was not a better prompt, but giving the model absolute instructions instead of relative ones.

The biggest lesson: No single model wins at everything.

I found that some models are great at structure but poor at writing. Others are brilliant writers but fail at following rules.

Instead of forcing one model to do everything, I split the work: • Use a cheap, reliable model for structure. • Use a high-quality model for content. • Use a powerful, expensive model to act as the judge.

Stop treating a bad score as a bad model. Treat it as a measurement question. Ask if the number is right and if your instrument is reliable.

Source: https://dev.to/aws-builders/stop-testing-your-ai-start-measuring-it-312m

Optional learning community: https://t.me/GyaanSetuAi

𝗦𝘁𝗼𝗽 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗔𝗜. 𝗦𝘁𝗮𝗿𝘁 𝗠𝗲𝗮𝘀𝘂𝗿𝗶𝗻𝗴 𝗜𝘁.

Continue reading

𝗪𝗵𝗮𝘁 𝗔𝗜 𝗠𝗼𝗱𝗲𝗹𝘀 𝗪𝗮𝗻𝘁

𝗟𝗢𝗚 𝗥𝗘𝗦𝗘𝗧: 𝗥𝗲 𝗖𝗼𝗺𝗽𝗶𝗹𝗶𝗻𝗴 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗪𝗿𝗶𝘁𝗶𝗻𝗴

𝗧𝗵𝗲 𝗔𝗜 𝗥𝗲𝘃𝗶𝗲𝘄 𝗧𝗿𝗮𝗽: 𝗪𝗵𝘆 𝗩𝗲𝗿𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗠𝗮𝘁𝘁𝗲𝗿𝘀 𝗠𝗼𝗿𝗲 𝗧𝗵𝗮𝗻 𝗣𝗿𝗼𝗺𝗽𝘁𝗶𝗻𝗴

𝗕𝘂𝗶𝗹𝗱 𝗬𝗼𝘂𝗿 𝗢𝘄𝗻 𝗦𝗵𝗮𝗸𝗲𝘀𝗽𝗲𝗮𝗿𝗲𝗮𝗻 𝗟𝗟𝗠

𝗖𝗼𝗴𝗻𝗶𝘁𝗶𝘃𝗲 𝗗𝗲𝗯𝘁: 𝗧𝗵𝗲 𝗛𝗶𝗱𝗱𝗲𝗻 𝗖𝗼𝘀𝘁 𝗼𝗳 𝗔𝗜 𝗖𝗼𝗱𝗶𝗻𝗴