๐ ๐๐๐ ๐ ๐ ๐๐ ๐๐ฃ๐ ๐๐ผ๐๐๐ ๐๐ ๐ณ๐ฌ%
My OpenAI bill jumped from $30 to $150. A small Slack bot caused this. Repeated prompts and retries cost too much.
I tried simple fixes. I used basic caching. I switched models. Nothing worked. Users rephrase questions. Basic caching fails when words change.
I built an AI proxy. It sits between my app and the API. It does three things:
- Semantic caching. I use embeddings to find similar questions. I serve the cached answer if the match is high.
- Rate limiting. I use Redis to stop request bursts.
- Retry buffers. The proxy retries failed calls automatically.
This cut my costs by 70%.
There are trade-offs:
- Latency. It adds 200ms per request.
- Memory. Redis needs space for vectors.
- Accuracy. Some similar prompts need different answers.
Lessons for you:
- Start with open source tools like LiteLLM.
- Track your data from day one.
- Use message queues for high traffic.
Stop treating AI APIs as black boxes. They are HTTP endpoints. Use middleware to control them.
What is your setup? Do you use a service or build your own?