𝗜 𝗧𝗿𝗶𝗲𝗱 𝗧𝗼 𝗔𝗱𝗱 𝗔𝗜 𝗖𝗵𝗮𝘁 𝗧𝗼 𝗠𝘆 𝗔𝗽𝗽 𝗔𝗻𝗱 𝗛𝗶𝘁 𝗔 𝗪𝗮𝗹𝗹
I tried to add an AI chat assistant to my project management tool. I wanted users to ask questions about overdue tasks or meeting notes. It seemed easy. I thought I would just call an API and finish. I was wrong.
After 15 messages, the AI became slow and incoherent. The API started throwing errors because the conversation was too long. I used GPT-4 with an 8k token limit. Every message included long descriptions and notes. The history grew too fast.
I tried three different fixes:
- Truncating history: I kept only the last few messages. This saved speed but the AI forgot everything else.
- Summarization: I asked an AI to summarize the chat every 5 messages. This helped memory but increased my costs and latency.
- Relevance scoring: I tried to keep only the most relevant messages. This required a vector store and added too much complexity.
I realized I needed a better strategy. I settled on two methods: streaming and a fixed context window.
Streaming makes the app feel fast. Users see text appear instantly instead of waiting for the full reply. I used Server-Sent Events to send chunks of text as they arrive.
I also split my context into three parts:
- System prompt: A fixed set of instructions.
- Dynamic context: Recent project updates and task states.
- Conversation history: A sliding window of recent messages.
I do not send the whole history every time. I only send enough to answer the current question. This reduced my payload size by 40%. It saved me money and improved speed.
If you build AI features, remember: Streaming buys you speed. A good context strategy buys you intelligence.
How do you manage conversation memory in your apps? Do you use sliding windows or summarization?