๐๐น๐ฒ๐ฎ๐ป๐ถ๐ป๐ด ๐ ๐ฒ๐๐๐ ๐๐ฎ๐๐ฎ ๐ช๐ถ๐๐ต ๐๐๐ ๐
I had to extract data from messy emails. Invoices and purchase orders had no set format.
I started with regex. I wrote patterns for every vendor.
It worked for a week. Then vendors changed their formats. My code broke. I spent two weeks fixing bugs. New fixes broke old cases.
I tried other tools. Template matching failed. Training a model took too long.
I switched to LLMs. The process is simple:
- Define a schema.
- Send raw text and instructions to the model.
- Parse the JSON response.
It worked immediately.
Production needs more work:
- Cost: Use cheaper models and batch tasks.
- Speed: Move extraction to background queues.
- Accuracy: Add validation rules to stop hallucinations.
My advice for you:
- Build a hybrid pipeline.
- Use regex for standard emails.
- Send messy emails to an LLM.
- Always validate the output.
- Use the smallest model you need.
How do you extract data from messy text?