๐ช๐ต๐ฒ๐ป ๐ฅ๐ฒ๐ด๐ฒ๐ ๐๐ฎ๐ถ๐น๐: ๐ ๐ ๐๐ผ๐๐ฟ๐ป๐ฒ๐ ๐๐ผ ๐๐-๐ฃ๐ผ๐๐ฒ๐ฟ๐ฒ๐ฑ ๐๐ฎ๐๐ฎ ๐๐ ๐๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป
I spent three hours trying to fix a regular expression.
It was supposed to pull phone numbers from scraped HTML. It worked for 70% of the data. Then it failed on everything else.
Web data is messy. You see formats like:
- (555) 123-4567
- 555.123.4567
- 5551234567
- Call me at 555-123-4567 after 5
My regex patterns from Stack Overflow missed numbers in long strings. They tripped on international codes. They even matched random numbers inside JavaScript code.
I tried other methods. I stripped HTML tags. I tried using spaCy for named entity recognition. It was still inconsistent. I needed a tool that understood meaning, not just patterns.
I switched to AI-powered semantic extraction.
Instead of defining what a phone number looks like, I tell the model what I want. I let it find the boundaries.
Here is my new process:
- Extract raw text from the web page.
- Send small text chunks to an AI model.
- Use a clear prompt to request JSON output.
- Validate and remove duplicates.
I use a hybrid approach. I use regex to validate the AI output. This catches obvious junk. This method gave me 95% accuracy on my test set.
Lessons learned:
- Regex is for validation. AI is for extraction.
- AI understands context. It knows the difference between a footer number and a number in a blog post.
- Prompt engineering is vital. Be specific about the format you want.
- Watch your costs. API calls add up if you process millions of documents.
My advice: Start with a hybrid system. Use regex as a fast first pass. Send only the difficult cases to the AI. This reduces API costs by 60%.
How do you handle messy text? Do you stick to regex or use semantic tools?
Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-my-journey-to-ai-powered-data-extraction-1k7e