𝗬𝗼𝘂𝗿 𝗟𝗟𝗠 𝗴𝘂𝗮𝗿𝗱𝗿𝗮𝗶𝗹 𝘀𝗽𝗲𝗮𝗸𝘀 𝗘𝗻𝗴𝗹𝗶𝘀𝗵. 𝗬𝗼𝘂𝗿 𝗮𝘁𝘁𝗮𝗰𝗸𝗲𝗿 𝗱𝗼𝗲𝘀𝗻'𝘁.
I learned this the hard way by attacking my own system.
I maintain FIE, an open-source engine that screens prompts before they reach an LLM. My system blocks "Ignore all previous instructions" in English with 82% confidence.
Then I tried the same sentence in Hindi. It walked straight through my security.
Safety training relies too much on English data. Low-resource languages become an accidental way to bypass security. The same malicious intent that fails in English works in Bengali, Swahili, or Javanese.
I spent three weeks fixing this. Here is how I built a three-tier defense:
Tier 1: Script anomaly scoring. I score the Unicode of the prompt. A sudden switch to Devanagari or Arabic script in an English app is a signal. This method is fast and cheap.
Tier 2: Static phrase matching. I added 14 languages to my list. I hand-curated injection phrases in Hindi, Japanese, Korean, Turkish, Dutch, and Polish. This catches common attacks with zero extra cost.
Tier 3: Translate-then-detect. This is the most important part. If a prompt passes the first two tiers, I detect the language and translate it to English. I then run my existing classifier on that translation. An attacker can change the language, but they cannot hide the intent.
To train this, I used Meta's NLLB-200 model. I translated 1,352 attack prompts into 10 languages. This created 13,528 new training examples. I ran this entire process locally on a $300 GPU.
The results on JailbreakBench: • 93.6% recall total. • 100% on JailbreakChat. • 90% on GCG suffixes. • 90.2% on PAIR.
I also track false positives. I would rather report a true 27% false positive rate than show a fake, perfect number. Building security requires honesty.
Sources: Deng et al. (2023). Multilingual Jailbreak Challenges in LLMs. arXiv:2310.06474 NLLB Team (2022). No Language Left Behind. arXiv:2207.04672 Röttger et al. (2023). XSTest. arXiv:2308.01263 Mazeika et al. (2024). HarmBench. arXiv:2402.04249 Chao et al. (2024). JailbreakBench. arXiv:2404.01318
Full post: https://dev.to/ayush_singh_9b0d83152be5b/your-llm-guardrail-speaks-english-your-attacker-doesnt-4bf2
Optional learning community: https://t.me/GyaanSetuAi