𝗧𝗮𝗺𝗶𝗻𝗴 𝗟𝗼𝗻𝗴 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝘄𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

📅2 weeks ago⏱1 min read

𝗧𝗮𝗺𝗶𝗻𝗴 𝗟𝗼𝗻𝗴 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝘄𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

I needed to answer questions from 100 page PDFs. A simple script failed. I fought token limits and high costs for weeks.

First, I tried the full text. The model forgot details in the middle. Costs hit 50 cents per call.

Then I tried these methods:

Fixed chunks: The model picked the wrong parts.
Map-reduce: Summaries lost the details.
Sliding window: It was too slow.

I decided to mimic how humans read. Humans skim first. Then they read.

Here is my process:

Create a hierarchy of chunks.
Write a short summary for each chunk.
Store both summaries and raw text in a vector database.
Use hybrid search to find the best summaries.
Fetch the raw text from those summaries.
Use a strict prompt to stop hallucinations.

This changed the results:

Costs dropped by 70 percent.
Accuracy went up.
Technical terms stayed intact.

My tips for you:

Use cheap models for summaries.
Use GPT-4 for the final answer.
Build a test dataset in the first week.
Skip this for docs under 20 pages.

What is your setup for long docs?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/how-i-finally-tamed-long-document-analysis-with-llms-it-wasnt-simple-chunking-5ed3 Optional learning community: https://t.me/GyaanSetuAi