توقف عن تحليل ملفات PDF أثناء وقت العرض

📅3 hours ago⏱1 min read

𝗦𝘁𝗼𝗽 𝗣𝗮𝗿𝘀𝗶𝗻𝗴 𝗣𝗗𝗙𝘀 𝗮𝘁 𝗥𝗲𝗻𝗱𝗲𝗿 𝗧𝗶𝗺𝗲

Most frontend PDF extraction tools fail.

Developers try to guess document structure from visual output. They look at rendered pixels to find columns, tables, or lists. They use computer vision or pixel proximity to decide where a box starts.

This is the wrong way to build.

A PDF already contains explicit structural data in its operator stream. A table is not just a group of nearby pixels. It is drawn with specific commands like moveTo, lineTo, or rectangle. The boundaries you want to find are already encoded in the source.

If your extractor gives you different columns at 100% zoom versus 150% zoom, you are not extracting structure. You are pattern-matching visual artifacts.

Stop using visual heuristics. Start parsing the operator stream.

Why the operator stream is better:

It is deterministic. It works the same way regardless of scale or font hinting.
It uses real data. You use the actual paths and coordinates defined by the creator.
It avoids math errors. For example, using midpoints between text centers to find zones leads to rounding bugs. Using the actual top edge of a bounding box is the only correct way.

The hard path is the right path.

You must understand the CTM stack. You must track matrix states and classify subpaths. You have to read the PDF specification and source code to master it.

This takes more effort upfront. But it works for every PDF a user uploads. Pixel-based tools only work for the few files in your test suite.

Build a real extractor, not a demo.

Source: https://dev.to/bonzai2carn/stop-parsing-pdfs-at-render-time-a-better-architecture-for-structured-extraction-5fb8

توقف عن تحليل ملفات PDF أثناء وقت العرض

Continue reading

𝗖𝗼𝗹𝗱𝗙𝘂𝘀𝗶𝗼𝗻 𝗣𝗗𝗙 𝗦𝗰𝗮𝗹𝗶𝗻𝗴

𝗦𝗰𝗮𝗹𝗲 𝗬𝗼𝘂𝗿 𝗖𝗼𝗹𝗱𝗙𝘂𝘀𝗶𝗼𝗻 𝗣𝗗𝗙 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻

𝗧𝗵𝗲 𝗗𝗲𝗮𝘁𝗵 𝗼𝗳 𝘁𝗵𝗲 𝗙𝗿𝗼𝗻𝘁 𝗘𝗻𝗱

𝗔𝗜 𝗖𝗼𝗱𝗲 𝗥𝗲𝘃𝗶𝗲𝘄 𝗜𝘀 𝗔 𝗥𝗼𝘂𝘁𝗶𝗻𝗴 𝗣𝗿𝗼𝗯𝗹𝗲𝗺

توقف عن تحليل ملفات PDF عند وقت العرض