𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝘁𝗼 𝗔𝗜 𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

📅11 hours ago⏱2 min read

𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝘁𝗼 𝗔𝗜-𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

I spent three hours fighting a regular expression.

I wanted to extract phone numbers from scraped HTML. The regex worked for 70% of the data. Then it failed.

Web data is messy. I found numbers formatted like this:

(555) 123-4567
555.123.4567
5551234567

My regex also caught random numbers inside JavaScript variables. This caused too many false positives.

I tried several solutions:

Better Regex: I tried harder patterns. They still missed international codes and tripped on edge cases.
HTML Parsing: I stripped tags and used string operations. It was better but broke on "tel:" links.
NLP Models: I used spaCy. It is great for general text but phone detection was spotty. Training a custom model felt like too much work for a small project.

I needed a tool that understood meaning, not just patterns. I switched to a semantic extraction approach using an AI model API.

Instead of defining what a number looks like, I tell the model what I want. The model finds the boundaries for me. This works even when the text says "Please do not call after 9pm."

My new workflow:

Extract raw text from a page.
Send text chunks to an AI model with clear instructions.
Request the response in JSON format.
Validate and remove duplicates.

I use a hybrid approach. I use a simple regex to validate the AI output. This filters out junk and keeps accuracy above 95%.

Lessons learned:

Use regex for validation, not primary extraction.
AI excels at understanding context.
Prompt engineering is more important than the model choice. A specific prompt gives better results.
Consider cost and speed. API calls add latency and expense.
If you process millions of documents, use local models like Llama 3 to save money.

My advice: Start with a hybrid system. Use regex for a fast first pass. Send only the messy or ambiguous cases to the AI. This reduces API costs by 60%.

Have you struggled with regex? How do you handle messy data?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-my-journey-to-ai-powered-data-extraction-1k7e

Optional learning community: https://t.me/GyaanSetuAi

𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝘁𝗼 𝗔𝗜 𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

Continue reading

𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗕𝗲𝗮𝘁 𝗥𝗲𝗴𝗲𝘅 𝗙𝗼𝗿 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

𝗖𝗹𝗲𝗮𝗻𝗶𝗻𝗴 𝗠𝗲𝘀𝘀𝘆 𝗗𝗮𝘁𝗮 𝗪𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

𝗪𝗛𝗘𝗡 𝗥𝗘𝗚𝗘𝗫 𝗙𝗔𝗜𝗟𝗦 𝗙𝗢𝗥 𝗗𝗔𝗧𝗔 𝗘𝗫𝗧𝗥𝗔𝗖𝗧𝗜𝗢𝗡

𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝘁𝗼 𝗔𝗜 𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻