๐—ช๐—ต๐—ฒ๐—ป ๐—ฅ๐—ฒ๐—ด๐—ฒ๐˜… ๐—™๐—ฎ๐—ถ๐—น๐˜€: ๐— ๐˜† ๐—๐—ผ๐˜‚๐—ฟ๐—ป๐—ฒ๐˜† ๐˜๐—ผ ๐—”๐—œ-๐—ฃ๐—ผ๐˜„๐—ฒ๐—ฟ๐—ฒ๐—ฑ ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐˜…๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป

I spent three hours fighting a regular expression.

I wanted to extract phone numbers from scraped HTML. The regex worked for 70% of the data. Then it failed.

Web data is messy. I found numbers formatted like this:

My regex also caught random numbers inside JavaScript variables. This caused too many false positives.

I tried several solutions:

  1. Better Regex: I tried harder patterns. They still missed international codes and tripped on edge cases.
  2. HTML Parsing: I stripped tags and used string operations. It was better but broke on "tel:" links.
  3. NLP Models: I used spaCy. It is great for general text but phone detection was spotty. Training a custom model felt like too much work for a small project.

I needed a tool that understood meaning, not just patterns. I switched to a semantic extraction approach using an AI model API.

Instead of defining what a number looks like, I tell the model what I want. The model finds the boundaries for me. This works even when the text says "Please do not call after 9pm."

My new workflow:

I use a hybrid approach. I use a simple regex to validate the AI output. This filters out junk and keeps accuracy above 95%.

Lessons learned:

My advice: Start with a hybrid system. Use regex for a fast first pass. Send only the messy or ambiguous cases to the AI. This reduces API costs by 60%.

Have you struggled with regex? How do you handle messy data?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-my-journey-to-ai-powered-data-extraction-1k7e

Optional learning community: https://t.me/GyaanSetuAi