๐ช๐ต๐ฒ๐ป ๐ฅ๐ฒ๐ด๐ฒ๐ ๐๐ฎ๐ถ๐น๐: ๐ ๐ ๐๐ผ๐๐ฟ๐ป๐ฒ๐ ๐๐ผ ๐๐-๐ฃ๐ผ๐๐ฒ๐ฟ๐ฒ๐ฑ ๐๐ฎ๐๐ฎ ๐๐ ๐๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป
I spent three hours fighting a regular expression.
I wanted to extract phone numbers from scraped HTML. The regex worked for 70% of the data. Then it failed.
Web data is messy. I found numbers formatted like this:
- (555) 123-4567
- 555.123.4567
- 5551234567
My regex also caught random numbers inside JavaScript variables. This caused too many false positives.
I tried several solutions:
- Better Regex: I tried harder patterns. They still missed international codes and tripped on edge cases.
- HTML Parsing: I stripped tags and used string operations. It was better but broke on "tel:" links.
- NLP Models: I used spaCy. It is great for general text but phone detection was spotty. Training a custom model felt like too much work for a small project.
I needed a tool that understood meaning, not just patterns. I switched to a semantic extraction approach using an AI model API.
Instead of defining what a number looks like, I tell the model what I want. The model finds the boundaries for me. This works even when the text says "Please do not call after 9pm."
My new workflow:
- Extract raw text from a page.
- Send text chunks to an AI model with clear instructions.
- Request the response in JSON format.
- Validate and remove duplicates.
I use a hybrid approach. I use a simple regex to validate the AI output. This filters out junk and keeps accuracy above 95%.
Lessons learned:
- Use regex for validation, not primary extraction.
- AI excels at understanding context.
- Prompt engineering is more important than the model choice. A specific prompt gives better results.
- Consider cost and speed. API calls add latency and expense.
- If you process millions of documents, use local models like Llama 3 to save money.
My advice: Start with a hybrid system. Use regex for a fast first pass. Send only the messy or ambiguous cases to the AI. This reduces API costs by 60%.
Have you struggled with regex? How do you handle messy data?
Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-my-journey-to-ai-powered-data-extraction-1k7e
Optional learning community: https://t.me/GyaanSetuAi