Most PDF Extractors Use the Wrong API
Most PDF tools use the wrong data source.
When developers talk about PDF extraction, they usually mean getTextContent(). This method provides text items and their positions. It is the default for almost every browser-side tool.
But getTextContent() is a processed view. It is a simplified version of what is actually available.
There are three levels of data in PDF.js:
• getStructTree(): This tells you what the document means. It contains tables, headings, and formulas. • getOperatorList(): This tells you what the document draws. It includes lines, paths, and shapes. • getTextContent(): This is a filtered view of what the document draws.
Most tools use the third option. This works for 80% of documents like simple reports. However, it fails on academic papers and complex publications.
Using only getTextContent() creates four major problems:
- You lose table structures. You have to guess where cells are based on text positions.
- You break math equations. LaTeX equations often appear as single, giant text blocks.
- You miss column lines. Many layouts use actual lines to separate columns. These lines do not exist in the text content.
- You get the wrong reading order. Text often appears in the order it was drawn, not how a human reads it.
The right way to build a PDF processor is a three-tier system:
- Check getStructTree() first. If the document has a logical structure, use it to find tables and headings immediately.
- Check getOperatorList() next. Use explicit lines and paths to find column boundaries.
- Use getTextContent() as a fallback. Use geometric math only when the first two tiers provide no data.
This approach is not more work. Tiers 1 and 2 act as fast exits. If the document is well-structured, you skip the hard math. You only use complex inference when the document is untagged.
This architecture handles both simple corporate files and complex scientific papers.
Source: https://dev.to/bonzai2carn/most-pdf-extractors-use-the-wrong-api-heres-what-we-built-instead-5dgh
