๐—ฃ๐—œ๐—œ ๐——๐—ฒ๐˜๐—ฒ๐—ฐ๐˜๐—ถ๐—ผ๐—ป: ๐—ฅ๐—ฒ๐—ด๐—ฒ๐˜… ๐˜ƒ๐˜€ ๐—•๐—˜๐—ฅ๐—ง-๐—ก๐—˜๐—ฅ ๐˜ƒ๐˜€ ๐—˜๐—ป๐˜€๐—ฒ๐—บ๐—ฏ๐—น๐—ฒ

You need PII detection for your LLM pipeline. Which method works best?

I tested three approaches across 9 scenarios including medical reports, HR records, and business contracts.

The methods:

The Results:

Key Findings:

  1. BERT-NER struggles with medical data. It only recognizes four entity types. It misses phone numbers, medical record numbers (MRN), and specific local formats. In radiology reports, BERT-NER scored 0.000.

  2. Regex wins on structure. Regex handles formatted data like Taiwan phone numbers and email addresses well. It is the fastest option for CPU environments.

  3. Ensemble has a trade-off. Using an ensemble increases hits, but it also increases False Positives. If BERT-NER misidentifies a hospital name as PII, the ensemble includes that error.

  4. Obfuscated data kills all methods. When users use phonetic coding or hide PII in chat contexts, all three methods failed with a score of 0.000. Only LLM-based solutions can solve these semantic puzzles.

My Recommendations:

Decision Checklist:

Source: https://dev.to/jh5_pulse/pii-zhen-ce-shi-ce-regex-vs-bert-ner-vs-ensemble-3c2j

Optional learning community: https://t.me/GyaanSetuAi