RAG Architecture for SaaS Applications Using Amazon Bedrock

My first production batch on Autowired.ai cost 3x my budget.

I sent full OCR text from 200 documents to a frontier model for every single field. This was a mistake. I was paying for data the model did not need.

I redesigned the architecture and cut costs by 40%. Here is how you can do the same.

  1. Stop using LLMs for everything

Textract is excellent at extracting structured fields like dates and totals. I was using Bedrock to redo work Textract already finished.

The new flow uses three stages: • Use Textract for the bulk of the work. • Send only missing fields to Bedrock for a gap-fill call. • Use Bedrock for a final verification call.

If Textract is confident, Bedrock does less work. This drops your token count immediately.

  1. Use Prompt Caching

System prompts for field definitions and schemas are static. They do not change between documents.

Amazon Bedrock allows you to cache these prompts. The first call in a batch pays a small premium. Every subsequent call in that window hits the cache at 10% of the usual price. This reduced my input costs by 20%.

  1. Filter your context

Do not send the full OCR response to Bedrock.

• For gap-fill: Send only the specific OCR blocks related to the missing fields. • For verification: Send the extracted values, not the raw OCR.

I also cleaned my prompts. Removing redundant instructions reduced my prompt size from 2,400 tokens to 1,100 tokens with zero loss in accuracy.

  1. Match the model to the task

Do not use Claude Sonnet for every task. Sonnet is 5x more expensive than Haiku.

I tested them on specific tasks: • Structured form gap-fill: Haiku was 2% as accurate as Sonnet. I switched to Haiku. • Unstructured contracts: Haiku was less accurate. I kept Sonnet. • Verification: Haiku performed well. I switched to Haiku.

Choose your model based on the task complexity, not the whole system.

  1. Implement application-layer caching

I added a cache in DynamoDB using a hash of the schema and the Textract output. If you run the same test set multiple times while tuning your code, this eliminates 80% to 90% of your Bedrock calls.

Summary of the winning architecture: • Application cache to skip Bedrock on repeats. • Bedrock prompt cache for static system instructions. • Model tiering to use Haiku where possible. • Context filtering to send only necessary data.

Measure your tokens before you optimize. The data will show you where you are wasting money.

Source: https://dev.to/yogieee/rag-architecture-for-saas-applications-using-amazon-bedrock-10df

Optional learning community: https://t.me/GyaanSetuAi