𝗣𝗜𝗜 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻: 𝗥𝗲𝗴𝗲𝘅 𝘃𝘀 𝗕𝗘𝗥𝗧 𝗡𝗘𝗥 𝘃𝘀 𝗘𝗻𝘀𝗲𝗺𝗯𝗹𝗲

📅2 days ago⏱2 min read

𝗣𝗜𝗜 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻: 𝗥𝗲𝗴𝗲𝘅 𝘃𝘀 𝗕𝗘𝗥𝗧-𝗡𝗘𝗥 𝘃𝘀 𝗘𝗻𝘀𝗲𝗺𝗯𝗹𝗲

You need PII detection for your LLM pipeline. Which method works best?

I tested three approaches across 9 scenarios including medical reports, HR records, and business contracts.

The methods:

Regex: Uses hardcoded patterns like phone numbers and IDs. Fast latency (<0.3ms).
BERT-NER: Uses a model to find names and locations. High latency (up to 5900ms).
Ensemble: Combines both methods using "OR" logic.

The Results:

Key Findings:

BERT-NER struggles with medical data. It only recognizes four entity types. It misses phone numbers, medical record numbers (MRN), and specific local formats. In radiology reports, BERT-NER scored 0.000.
Regex wins on structure. Regex handles formatted data like Taiwan phone numbers and email addresses well. It is the fastest option for CPU environments.
Ensemble has a trade-off. Using an ensemble increases hits, but it also increases False Positives. If BERT-NER misidentifies a hospital name as PII, the ensemble includes that error.
Obfuscated data kills all methods. When users use phonetic coding or hide PII in chat contexts, all three methods failed with a score of 0.000. Only LLM-based solutions can solve these semantic puzzles.

My Recommendations:

If you have a GPU: Use Piiranha. It reached an F1 of 0.9866 in our tests. It is built specifically for PII.
If you only have a CPU: Use a Regex and BERT-NER ensemble. Use Regex for formats and BERT-NER to catch names.
If you face heavy obfuscation: You must use an LLM to scan the document context.

Decision Checklist:

Optional learning community: https://t.me/GyaanSetuAi

Continue reading