I Built An AI Security Scanner — Then Found A Bug In My Own Detector
Prompt injection is a top security risk for LLM applications. You feed a model text that tells it to ignore its original instructions. Sometimes, the model listens.
I built AgentProbe to test this. It fires 49 attack prompts across 8 categories like jailbreaks and data extraction. It reports how often a model fails.
The real lesson was not the scanner. It was a bug in my detection code.
The hard part is not the attack. The hard part is detection.
How do you know if a model actually complied with an attack? Keyword matching is the easy way. You look for refusal phrases like "I cannot help" or compliance phrases like "developer mode."
But models use a "hedge-then-comply" pattern. They say "I cannot help with that," but then they provide the restricted information anyway. Keyword matching fails here because the refusal phrase is present.
To fix this, I used an "LLM-as-judge" system. I sent the exchange to a stronger model to decide if the target actually complied. My plan was to use cheap keyword checks first and only use the expensive judge when the keyword check was unsure.
Then I found my bug.
My keyword detector returned a confidence score of 1 when it found a "hedge-then-comply" pattern. However, my code only trusted the keyword stage if confidence was 2 or higher.
This meant my "cheap" detector never actually made a decision. Every single case escalated to the expensive judge. I was paying for a judge on every case, even when my free tool should have handled it.
This mistake taught me three lessons about using LLMs to grade other LLMs:
- The judge must be smarter than the target. If the judge is the same size as the target, it will share the same blind spots.
- Accuracy can lie. If most models refuse, a judge that always says "refused" looks accurate but learns nothing. Use metrics like Cohen's kappa to account for luck.
- The judge must be stable. Run the same test five times. If the judge changes its mind, the result is ambiguous and needs a human eye.
If you build with LLMs, remember these points:
- Detecting compliance is harder than causing it.
- Watch for models that refuse in the first sentence but comply in the second.
- Do not trust an LLM judge blindly. Measure its reliability.
- Share your bugs. Finding my own flaw taught me more than a perfect launch would have.
Source: https://dev.to/nar1frames/i-built-an-ai-security-scanner-then-found-a-bug-in-my-own-detector-140a
Optional learning community: https://t.me/GyaanSetuAi
