๐—ฃ๐—œ๐—œ ๐——๐—ฒ๐˜๐—ฒ๐—ฐ๐˜๐—ถ๐—ผ๐—ป: ๐—ฅ๐—ฒ๐—ด๐—ฒ๐˜… ๐˜ƒ๐˜€ ๐—•๐—˜๐—ฅ๐—ง-๐—ก๐—˜๐—ฅ ๐˜ƒ๐˜€ ๐—˜๐—ป๐˜€๐—ฒ๐—บ๐—ฏ๐—น๐—ฒ

You need to protect sensitive data in your LLM pipeline. But which method works best?

I tested three ways to detect PII across 9 scenarios including medical reports, HR records, and legal contracts.

The methods:

The results:

Key findings:

  1. Medical data kills BERT-NER BERT-NER only recognizes four entity types. It fails to see medical record numbers or specific date formats used in hospitals. For medical files, Regex is more reliable.

  2. The Ensemble trade-off An Ensemble uses an "OR" logic. If either method finds a match, it counts. This increases hits but also increases false positives. In some medical reports, the Ensemble scored lower than Regex because the model flagged non-sensitive names.

  3. Obfuscation is the enemy All three methods failed when people hide data. If someone writes "zero nine one two" instead of numbers, or hides PII in casual chat, these tools return zero results. Only LLM-based solutions handle these semantic tricks.

Which one should you choose?

Before you build your pipeline, ask yourself:

Source: https://dev.to/jh5_pulse/pii-zhen-ce-shi-ce-regex-vs-bert-ner-vs-ensemble-65o

Optional learning community: https://t.me/GyaanSetuAi