𝗥𝗔𝗚 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗳𝗼𝗿 𝗦𝗮𝗮𝗦 𝗔𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀 𝗨𝘀𝗶𝗻𝗴 𝗔𝗺𝗮𝘇𝗼𝗻 𝗕𝗲𝗱𝗿𝗼𝗰𝗸
My first production batch on Autowired.ai cost 3x my budget.
I sent full OCR text from 200 documents to a frontier model for every single field. This was a mistake. I was paying for data the model did not need.
I redesigned the architecture and cut costs by 40%. Here is how you can do the same.
- Stop using LLMs for everything
Textract is excellent at extracting structured fields like dates and totals. I was using Bedrock to redo work Textract already finished.
The new flow uses three stages: • Use Textract for the bulk of the work. • Send only missing fields to Bedrock for a gap-fill call. • Use Bedrock for a final verification call.
If Textract is confident, Bedrock does less work. This drops your token count immediately.
- Use Prompt Caching
System prompts for field definitions and schemas are static. They do not change between documents.
Amazon Bedrock allows you to cache these prompts. The first call in a batch pays a small premium. Every subsequent call in that window hits the cache at 10% of the usual price. This reduced my input costs by 20%.
- Filter your context
Do not send the full OCR response to Bedrock.
• For gap-fill: Send only the specific OCR blocks related to the missing fields. • For verification: Send the extracted values, not the raw OCR.
I also cleaned my prompts. Removing redundant instructions reduced my prompt size from 2,400 tokens to 1,100 tokens with zero loss in accuracy.
- Match the model to the task
Do not use Claude Sonnet for every task. Sonnet is 5x more expensive than Haiku.
I tested them on specific tasks: • Structured form gap-fill: Haiku was 2% as accurate as Sonnet. I switched to Haiku. • Unstructured contracts: Haiku was less accurate. I kept Sonnet. • Verification: Haiku performed well. I switched to Haiku.
Choose your model based on the task complexity, not the whole system.
- Implement application-layer caching
I added a cache in DynamoDB using a hash of the schema and the Textract output. If you run the same test set multiple times while tuning your code, this eliminates 80% to 90% of your Bedrock calls.
סיכום הארכיטקטורה המנצחת: • Application cache כדי לדלג על Bedrock בבקשות חוזרות. • Bedrock prompt cache להוראות מערכת סטטיות. • Model tiering לשימוש ב-Haiku במידת האפשר. • Context filtering כדי לשלוח רק את הנתונים הנחוצים.
מדדו את הטוקנים שלכם לפני שאתם מבצעים אופטימיזציה. הנתונים יראו לכם היכן אתם מבזבזים כסף.
מקור: https://dev.to/yogieee/rag-architecture-for-saas-applications-using-amazon-bedrock-10df
קהילת למידה אופציונלית: https://t.me/GyaanSetuAi