๐ ๐๐น๐บ๐ผ๐๐ ๐๐ฎ๐๐ฒ ๐จ๐ฝ ๐ข๐ป ๐ ๐ ๐๐ ๐๐๐๐ถ๐๐๐ฎ๐ป๐
I spent months building a personal AI assistant. I wanted it to remember my notes and summarize my emails.
It started simple. A few Python scripts and an API. But then the conversations got long. The bot became useless. It forgot what I said. It contradicted itself. It repeated the same advice. My API costs also went up.
I tried three ways to fix it:
- Append everything: This hit token limits fast. The API cut off old messages and broke the flow.
- Sliding window: This kept only the last few messages. The bot lost all long-term memory.
- Constant summarization: This worked but cost too much money and added too much lag.
I needed a system that kept recent messages intact while maintaining a short summary of the past.
I found the solution: Hierarchical Context Management.
The design is simple:
- Recent messages: Keep the last 5 to 10 messages as raw text.
- Older history: Turn these into a single summary string.
The trick is not to summarize after every message. You only summarize when the conversation grows past a certain limit. I set a rule: if I have more than 6 recent messages and enough time has passed, I trigger a summary.
The result: The bot remembers key points from earlier. My token costs stay low. It works for 90% of my needs.
Lessons learned:
- Summary quality is vital. If the summary is bad, the bot gets confused.
- This method is not for legal or medical work. You lose fine details. For those cases, use a vector database.
- For web apps, run summarization as a background task so you do not slow down the user.
- Store your context in a database like Redis. If your server restarts, you do not want to lose the memory.
How do you handle context? Do you use a fixed window or a vector store?