๐๐น๐ฒ๐ฎ๐ป๐ถ๐ป๐ด ๐ ๐ฒ๐๐๐ ๐๐ฎ๐๐ฎ ๐ช๐ถ๐๐ต ๐๐๐ ๐
I handled a system for emails. Emails had invoices and orders. Formats varied.
I tried regex first. I wrote patterns for a few vendors. It worked for a while. Then vendors changed layouts. The regex broke. I spent weeks fixing patterns. One fix broke other cases. It was a mess.
I tried other ways. Templates failed. Custom models took too long.
I tried LLMs. I defined a schema. I sent raw text to the model. The model returned JSON. It worked immediately. No patterns. No training.
I found a few problems. Costs grew with volume. Calls took seconds. Models made mistakes.
I fixed this by:
- Batching calls to save money.
- Moving tasks to a background queue.
- Adding validation rules.
Now I use a hybrid system. Use regex for standard emails. Use LLMs for messy text. This saves money. It saves time.
My advice for you:
- Keep your schemas clear.
- Validate all output.
- Start with small models.
Source: https://dev.to/__c1b9e06dc90a7e0a676b/struggling-with-text-extraction-heres-how-i-finally-cleaned-up-messy-data-1nab Optional learning community: https://t.me/GyaanSetuAi