𝗥𝗔𝗚 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗳𝗼𝗿 𝗦𝗮𝗮𝗦 𝗔𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀 𝗨𝘀𝗶𝗻𝗴 𝗔𝗺𝗮𝘇𝗼𝗻 𝗕𝗲𝗱𝗿𝗼𝗰𝗸

My first production batch on Autowired.ai cost 3x my budget.

I sent full OCR text from 200 documents to a frontier model for every single field. This was a mistake. I was paying for data the model did not need.

I redesigned the architecture and cut costs by 40%. Here is how you can do the same.

  1. Stop using LLMs for everything

Textract is excellent at extracting structured fields like dates and totals. I was using Bedrock to redo work Textract already finished.

The new flow uses three stages: • Use Textract for the bulk of the work. • Send only missing fields to Bedrock for a gap-fill call. • Use Bedrock for a final verification call.

If Textract is confident, Bedrock does less work. This drops your token count immediately.

  1. Use Prompt Caching

System prompts for field definitions and schemas are static. They do not change between documents.

Amazon Bedrock allows you to cache these prompts. The first call in a batch pays a small premium. Every subsequent call in that window hits the cache at 10% of the usual price. This reduced my input costs by 20%.

  1. Filter your context

Do not send the full OCR response to Bedrock.

• For gap-fill: Send only the specific OCR blocks related to the missing fields. • For verification: Send the extracted values, not the raw OCR.

I also cleaned my prompts. Removing redundant instructions reduced my prompt size from 2,400 tokens to 1,100 tokens with zero loss in accuracy.

  1. Match the model to the task

Do not use Claude Sonnet for every task. Sonnet is 5x more expensive than Haiku.

I tested them on specific tasks: • Structured form gap-fill: Haiku was 2% as accurate as Sonnet. I switched to Haiku. • Unstructured contracts: Haiku was less accurate. I kept Sonnet. • Verification: Haiku performed well. I switched to Haiku.

Choose your model based on the task complexity, not the whole system.

  1. Implement application-layer caching

I added a cache in DynamoDB using a hash of the schema and the Textract output. If you run the same test set multiple times while tuning your code, this eliminates 80% to 90% of your Bedrock calls.

ملخص البنية الهندسية الفائزة: • ذاكرة تخزين مؤقت للتطبيق لتجنب استخدام Bedrock في الطلبات المتكررة. • ذاكرة تخزين مؤقت للأوامر (prompt cache) في Bedrock لتعليمات النظام الثابتة. • تقسيم مستويات النماذج لاستخدام Haiku كلما أمكن ذلك. • تصفية السياق لإرسال البيانات الضرورية فقط.

قم بقياس الرموز (tokens) الخاصة بك قبل البدء في التحسين. ستوضح لك البيانات أين تهدر أموالك.

المصدر: https://dev.to/yogieee/rag-architecture-for-saas-applications-using-amazon-bedrock-10df

مجتمع تعليمي اختياري: https://t.me/GyaanSetuAi