𝗜 𝗧𝗿𝗶𝗲𝗱 𝗧𝗼 𝗔𝗱𝗱 𝗔𝗜 𝗖𝗵𝗮𝘁 𝗧𝗼 𝗠𝘆 𝗔𝗽𝗽 𝗔𝗻𝗱 𝗛𝗶𝘁 𝗔 𝗪𝗮𝗹𝗹

I tried to add an AI chat assistant to my project management tool. I wanted users to ask questions about overdue tasks or meeting notes. It seemed easy. I thought I would just call an API and finish. I was wrong.

After 15 messages, the AI became slow and incoherent. The API started throwing errors because the conversation was too long. I used GPT-4 with an 8k token limit. Every message included long descriptions and notes. The history grew too fast.

I tried three different fixes:

  • Truncating history: I kept only the last few messages. This saved speed but the AI forgot everything else.
  • Summarization: I asked an AI to summarize the chat every 5 messages. This helped memory but increased my costs and latency.
  • Relevance scoring: I tried to keep only the most relevant messages. This required a vector store and added too much complexity.

I realized I needed a better strategy. I settled on two methods: streaming and a fixed context window.

Streaming makes the app feel fast. Users see text appear instantly instead of waiting for the full reply. I used Server-Sent Events to send chunks of text as they arrive.

I also split my context into three parts:

  • System prompt: A fixed set of instructions.
  • Dynamic context: Recent project updates and task states.
  • Conversation history: A sliding window of recent messages.

I do not send the whole history every time. I only send enough to answer the current question. This reduced my payload size by 40%. It saved me money and improved speed.

If you build AI features, remember: Streaming buys you speed. A good context strategy buys you intelligence.

How do you manage conversation memory in your apps? Do you use sliding windows or summarization?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/i-tried-to-add-ai-chat-to-my-app-and-hit-a-wall-with-context-tokens-459b