๐ฃ๐๐ ๐๐ฒ๐๐ฒ๐ฐ๐๐ถ๐ผ๐ป: ๐ฅ๐ฒ๐ด๐ฒ๐ ๐๐ ๐๐๐ฅ๐ง-๐ก๐๐ฅ ๐๐ ๐๐ป๐๐ฒ๐บ๐ฏ๐น๐ฒ
You need to protect sensitive data in your LLM pipeline. But which method works best?
I tested three ways to detect PII across 9 scenarios including medical reports, HR records, and legal contracts.
The methods:
- Regex: Rules for specific patterns like phone numbers and IDs. Fast and low latency.
- BERT-NER: A model that identifies names, organizations, and locations.
- Ensemble: A combination of Regex and BERT-NER.
The results:
- Ensemble achieved the highest average F1 score of 0.662.
- BERT-NER failed in medical text with an F1 score of only 0.167.
- Regex remains stable for formatted data like ID numbers and dates.
Key findings:
Medical data kills BERT-NER BERT-NER only recognizes four entity types. It fails to see medical record numbers or specific date formats used in hospitals. For medical files, Regex is more reliable.
The Ensemble trade-off An Ensemble uses an "OR" logic. If either method finds a match, it counts. This increases hits but also increases false positives. In some medical reports, the Ensemble scored lower than Regex because the model flagged non-sensitive names.
Obfuscation is the enemy All three methods failed when people hide data. If someone writes "zero nine one two" instead of numbers, or hides PII in casual chat, these tools return zero results. Only LLM-based solutions handle these semantic tricks.
Which one should you choose?
- Use Regex if you have clear formats like Taiwan IDs or phone numbers. It is extremely fast.
- Use BERT-NER if you specifically need to find names and organizations.
- Use an Ensemble if you want a balance of speed and coverage for standard documents.
- Use Piiranha if you have a GPU. It achieved a 0.9866 F1 score in our tests.
- Use an LLM if you must detect PII hidden in conversational text.
Before you build your pipeline, ask yourself:
- Do my PII types have fixed patterns?
- Do I need to catch names?
- Will users try to hide data using text patterns?
- Do I have a GPU available?
Source: https://dev.to/jh5_pulse/pii-zhen-ce-shi-ce-regex-vs-bert-ner-vs-ensemble-65o
Optional learning community: https://t.me/GyaanSetuAi