你的 LLM 防护栏只懂英语，而攻击者却并不懂。

Translated for your language. 阅读原文.

AI-assisted draft.

𝗬𝗼𝘂𝗿 𝗟𝗟𝗠 𝗴𝘂𝗮𝗿𝗱𝗿𝗮𝗶𝗹 𝘀𝗽𝗲𝗮𝗸𝘀 𝗘𝗻𝗴𝗹𝗶𝘀𝗵. 𝗬𝗼𝘂𝗿 𝗮𝘁𝘁𝗮𝗰𝗸𝗲𝗿 𝗱𝗼𝗲𝘀𝗻'𝘁.

I learned this the hard way by attacking my own system.

I maintain FIE, an open-source engine that screens prompts before they reach an LLM. My system blocks "Ignore all previous instructions" in English with 82% confidence.

Then I tried the same sentence in Hindi. It walked straight through my security.

Safety training relies too much on English data. Low-resource languages become an accidental way to bypass security. The same malicious intent that fails in English works in Bengali, Swahili, or Javanese.

I spent three weeks fixing this. Here is how I built a three-tier defense:

Tier 1: Script anomaly scoring. I score the Unicode of the prompt. A sudden switch to Devanagari or Arabic script in an English app is a signal. This method is fast and cheap.

Tier 2: Static phrase matching. I added 14 languages to my list. I hand-curated injection phrases in Hindi, Japanese, Korean, Turkish, Dutch, and Polish. This catches common attacks with zero extra cost.

Tier 3: Translate-then-detect. This is the most important part. If a prompt passes the first two tiers, I detect the language and translate it to English. I then run my existing classifier on that translation. An attacker can change the language, but they cannot hide the intent.

To train this, I used Meta's NLLB-200 model. I translated 1,352 attack prompts into 10 languages. This created 13,528 new training examples. I ran this entire process locally on a $300 GPU.

The results on JailbreakBench: • 93.6% recall total. • 100% on JailbreakChat. • 90% on GCG suffixes. • 90.2% on PAIR.

I also track false positives. I would rather report a true 27% false positive rate than show a fake, perfect number. Building security requires honesty.

Sources: Deng et al. (2023). Multilingual Jailbreak Challenges in LLMs. arXiv:2310.06474 NLLB Team (2022). No Language Left Behind. arXiv:2207.04672 Röttger et al. (2023). XSTest. arXiv:2308.01263 Mazeika et al. (2024). HarmBench. arXiv:2402.04249 Chao et al. (2024). JailbreakBench. arXiv:2404.01318

Full post: https://dev.to/ayush_singh_9b0d83152be5b/your-llm-guardrail-speaks-english-your-attacker-doesnt-4bf2

Optional learning community: https://t.me/GyaanSetuAi

你的 LLM 防护栏只懂英语，而攻击者却并不懂。

继续阅读

LLM 提示词注入与护栏安全

LLM 漏洞入门 101

提示词注入防御：生产级护栏指南

LLM 网关：路由、回退与语义缓存

防止 LLM 失控的 7 个护栏