𝗧𝗮𝗺𝗶𝗻𝗴 𝗟𝗼𝗻𝗴 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝘄𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

I needed to answer questions from 100 page PDFs. A simple script failed. I fought token limits and high costs for weeks.

First, I tried the full text. The model forgot details in the middle. Costs hit 50 cents per call.

Then I tried these methods:

  • Fixed chunks: The model picked the wrong parts.
  • Map-reduce: Summaries lost the details.
  • Sliding window: It was too slow.

I decided to mimic how humans read. Humans skim first. Then they read.

Here is my process:

  • Create a hierarchy of chunks.
  • Write a short summary for each chunk.
  • Store both summaries and raw text in a vector database.
  • Use hybrid search to find the best summaries.
  • Fetch the raw text from those summaries.
  • Use a strict prompt to stop hallucinations.

This changed the results:

  • Costs dropped by 70 percent.
  • Accuracy went up.
  • Technical terms stayed intact.

My tips for you:

  • Use cheap models for summaries.
  • Use GPT-4 for the final answer.
  • Build a test dataset in the first week.
  • Skip this for docs under 20 pages.

What is your setup for long docs?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/how-i-finally-tamed-long-document-analysis-with-llms-it-wasnt-simple-chunking-5ed3 Optional learning community: https://t.me/GyaanSetuAi