๐ฃ๐๐ ๐๐ฒ๐๐ฒ๐ฐ๐๐ถ๐ผ๐ป: ๐ฅ๐ฒ๐ด๐ฒ๐ ๐๐ ๐๐๐ฅ๐ง-๐ก๐๐ฅ ๐๐ ๐๐ป๐๐ฒ๐บ๐ฏ๐น๐ฒ
You need PII detection for your LLM pipeline. Which method works best?
I tested three approaches across 9 scenarios including medical reports, HR records, and business contracts.
The methods:
- Regex: Uses hardcoded patterns like phone numbers and IDs. Fast latency (<0.3ms).
- BERT-NER: Uses a model to find names and locations. High latency (up to 5900ms).
- Ensemble: Combines both methods using "OR" logic.
The Results:
- Ensemble achieved the highest average F1 score of 0.662.
- BERT-NER failed on medical text with an F1 score of only 0.167.
- Regex stayed stable for structured data like ID numbers and dates.
Key Findings:
BERT-NER struggles with medical data. It only recognizes four entity types. It misses phone numbers, medical record numbers (MRN), and specific local formats. In radiology reports, BERT-NER scored 0.000.
Regex wins on structure. Regex handles formatted data like Taiwan phone numbers and email addresses well. It is the fastest option for CPU environments.
Ensemble has a trade-off. Using an ensemble increases hits, but it also increases False Positives. If BERT-NER misidentifies a hospital name as PII, the ensemble includes that error.
Obfuscated data kills all methods. When users use phonetic coding or hide PII in chat contexts, all three methods failed with a score of 0.000. Only LLM-based solutions can solve these semantic puzzles.
My Recommendations:
- If you have a GPU: Use Piiranha. It reached an F1 of 0.9866 in our tests. It is built specifically for PII.
- If you only have a CPU: Use a Regex and BERT-NER ensemble. Use Regex for formats and BERT-NER to catch names.
- If you face heavy obfuscation: You must use an LLM to scan the document context.
Decision Checklist:
- Do your PII items have strict formats? Use Regex.
- Do you need to catch names and organizations? Use BERT-NER.
- Is the data intentionally hidden? Use an LLM.
- Do you have hardware? Use Piiranha for GPU or Ensemble for CPU.
Source: https://dev.to/jh5_pulse/pii-zhen-ce-shi-ce-regex-vs-bert-ner-vs-ensemble-3c2j
Optional learning community: https://t.me/GyaanSetuAi