𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝘁𝗼 𝗔𝗜 𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

📅13 hours ago⏱2 min read

𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝘁𝗼 𝗔𝗜-𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

I spent three hours trying to fix a regular expression.

It was supposed to pull phone numbers from scraped HTML. It worked for 70% of the data. Then it failed on everything else.

Web data is messy. You see formats like:

(555) 123-4567
555.123.4567
5551234567
Call me at 555-123-4567 after 5

My regex patterns from Stack Overflow missed numbers in long strings. They tripped on international codes. They even matched random numbers inside JavaScript code.

I tried other methods. I stripped HTML tags. I tried using spaCy for named entity recognition. It was still inconsistent. I needed a tool that understood meaning, not just patterns.

I switched to AI-powered semantic extraction.

Instead of defining what a phone number looks like, I tell the model what I want. I let it find the boundaries.

Here is my new process:

Extract raw text from the web page.
Send small text chunks to an AI model.
Use a clear prompt to request JSON output.
Validate and remove duplicates.

I use a hybrid approach. I use regex to validate the AI output. This catches obvious junk. This method gave me 95% accuracy on my test set.

Lessons learned:

Regex is for validation. AI is for extraction.
AI understands context. It knows the difference between a footer number and a number in a blog post.
Prompt engineering is vital. Be specific about the format you want.
Watch your costs. API calls add up if you process millions of documents.

My advice: Start with a hybrid system. Use regex as a fast first pass. Send only the difficult cases to the AI. This reduces API costs by 60%.

How do you handle messy text? Do you stick to regex or use semantic tools?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-my-journey-to-ai-powered-data-extraction-1k7e

𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝘁𝗼 𝗔𝗜 𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

Continue reading

𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗕𝗲𝗮𝘁 𝗥𝗲𝗴𝗲𝘅 𝗙𝗼𝗿 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

𝗦𝘁𝗼𝗽 𝗙𝗶𝗴𝗵𝘁𝗶𝗻𝗴 𝗥𝗲𝗴𝗲𝘅 𝗪𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗥𝗲𝗴𝗲𝘅 𝗙𝗼𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴

𝗪𝗛𝗘𝗡 𝗥𝗘𝗚𝗘𝗫 𝗙𝗔𝗜𝗟𝗦 𝗙𝗢𝗥 𝗗𝗔𝗧𝗔 𝗘𝗫𝗧𝗥𝗔𝗖𝗧𝗜𝗢𝗡

𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝘁𝗼 𝗔𝗜 𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻