𝗣𝗜𝗜 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻: 𝗥𝗲𝗴𝗲𝘅 𝘃𝘀 𝗕𝗘𝗥𝗧 𝗡𝗘𝗥 𝘃𝘀 𝗘𝗻𝘀𝗲𝗺𝗯𝗹𝗲

📅2 days ago⏱2 min read

𝗣𝗜𝗜 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻: 𝗥𝗲𝗴𝗲𝘅 𝘃𝘀 𝗕𝗘𝗥𝗧-𝗡𝗘𝗥 𝘃𝘀 𝗘𝗻𝘀𝗲𝗺𝗯𝗹𝗲

You need to protect sensitive data in your LLM pipeline. But which method works best?

I tested three ways to detect PII across 9 scenarios including medical reports, HR records, and legal contracts.

The methods:

Regex: Rules for specific patterns like phone numbers and IDs. Fast and low latency.
BERT-NER: A model that identifies names, organizations, and locations.
Ensemble: A combination of Regex and BERT-NER.

The results:

Key findings:

Medical data kills BERT-NER BERT-NER only recognizes four entity types. It fails to see medical record numbers or specific date formats used in hospitals. For medical files, Regex is more reliable.
The Ensemble trade-off An Ensemble uses an "OR" logic. If either method finds a match, it counts. This increases hits but also increases false positives. In some medical reports, the Ensemble scored lower than Regex because the model flagged non-sensitive names.
Obfuscation is the enemy All three methods failed when people hide data. If someone writes "zero nine one two" instead of numbers, or hides PII in casual chat, these tools return zero results. Only LLM-based solutions handle these semantic tricks.

Which one should you choose?

Use Regex if you have clear formats like Taiwan IDs or phone numbers. It is extremely fast.
Use BERT-NER if you specifically need to find names and organizations.
Use an Ensemble if you want a balance of speed and coverage for standard documents.
Use Piiranha if you have a GPU. It achieved a 0.9866 F1 score in our tests.
Use an LLM if you must detect PII hidden in conversational text.

Before you build your pipeline, ask yourself:

Optional learning community: https://t.me/GyaanSetuAi

Continue reading