๐—–๐—น๐—ฒ๐—ฎ๐—ป๐—ถ๐—ป๐—ด ๐— ๐—ฒ๐˜€๐˜€๐˜† ๐——๐—ฎ๐˜๐—ฎ ๐—ช๐—ถ๐˜๐—ต ๐—Ÿ๐—Ÿ๐— ๐˜€

I had to extract data from messy emails. Invoices and purchase orders had no set format.

I started with regex. I wrote patterns for every vendor.

It worked for a week. Then vendors changed their formats. My code broke. I spent two weeks fixing bugs. New fixes broke old cases.

I tried other tools. Template matching failed. Training a model took too long.

I switched to LLMs. The process is simple:

It worked immediately.

Production needs more work:

My advice for you:

How do you extract data from messy text?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/struggling-with-text-extraction-heres-how-i-finally-cleaned-up-messy-data-1nab