𝟯 𝗟𝗮𝗻𝗴𝗚𝗿𝗮𝗽𝗵 𝗥𝗲𝘄𝗿𝗶𝘁𝗲𝘀: 𝗪𝗵𝗮𝘁 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗖𝗵𝗲𝗰𝗸𝗽𝗼𝗶𝗻𝘁𝗶𝗻𝗴 𝗧𝗮𝘂𝗴𝗵𝘁 𝗠𝗲
For three weeks, our onboarding agent lost every job it was supposed to save. The logs showed nothing. No errors. No warnings. The checkpointer reported success, but the state disappeared on resume.
I rewrote the same LangGraph pipeline three times before I understood the truth. The framework did exactly what I told it to do. I was just giving it the wrong instructions.
If you build stateful agents for production, do not assume defaults will save you. Here is what broke and how I fixed it.
𝗧𝗵𝗲 𝗙𝗶𝗿𝘀𝘁 𝗕𝗿𝗲𝗮𝗸: 𝗦𝗶𝗹𝗲𝗻𝘁 𝗦𝘁𝗮𝘁𝗲 𝗟𝗼𝘀𝘀
My agent handled a five-step onboarding flow. I used Postgres to save progress so users could resume later. But every resume started from step one.
The cause was my state schema. In LangGraph, every node returns an update that merges into the state. If you do not specify how to merge, the default is to overwrite.
I thought my message list would append. Instead, every new node replaced the entire history with just one message. The checkpoint saved the wrong data perfectly.
The fix: Use Annotated fields with explicit reducers.
• Use the add operator for message lists. • Use a custom merge function for dictionaries. • Use the default overwrite only for single values like "step".
𝗧𝗵𝗲 𝗦𝗲𝗰𝗼𝗻𝗱 𝗕𝗿𝗲𝗮𝗸: 𝗗𝗲𝘀𝗲𝗿𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗖𝗼𝗻𝗰𝘂𝗿𝗿𝗲𝗻𝗰𝘆
The second rewrite faced two new issues:
- Corrupt rows: I stored custom objects in the state. The serializer could not handle them. This created rows that existed but were unusable.
- Duplicate keys: WhatsApp retries webhooks if you do not respond fast. If two messages arrived at once, two graph runs tried to write to the same thread. This caused database collisions.
The fixes: • Strip custom objects. Use plain dicts and standard LangChain types only. • Handle webhooks outside the graph. Use a queue and an idempotency key to drop duplicates. • Add a database lock. Ensure only one run happens per thread at a time.
𝗧𝗵𝗲 𝗧𝗵𝗶𝗿𝗱 𝗥𝗲𝘄𝗿𝗶𝘁𝗲: 𝗧𝗵𝗲 𝗦𝘁𝗮𝗯𝗹𝗲 𝗣𝗮𝘁𝘁𝗲𝗿𝗻
The final version focused on three principles:
• Graf kecil: Saya memecah satu graf besar menjadi tiga subgraf yang lebih kecil. Ini mengurangi dampak kerusakan (blast radius) dari bug. • Checkpoint eksplisit: Saya berhenti melakukan checkpoint setelah setiap node. Saya hanya melakukan checkpoint pada titik resume yang bermakna. Ini memangkas penulisan database sebesar 60%. • Node idempotent: Ini sangat penting. Setiap node harus menghasilkan hasil yang sama jika dijalankan dua kali. Jika sebuah node melihat bahwa suatu tugas sudah selesai di dalam state, ia harus segera berhenti (return). Ini mencegah biaya ganda untuk pemanggilan model yang mahal.
𝗣𝗲𝗹𝗮𝗷𝗮𝗿𝗮𝗻 𝘂𝗻𝘁𝘂𝗸 𝗮𝗻𝗱𝗮:
- Baca semantik reducer sebelum Anda menulis kode.
- Jangan simpan objek kustom di dalam state.
- Pindahkan kontrol konkurensi ke luar graf.
- Buat setiap node menjadi idempotent.
Framework-nya tidak gagal. Asumsi sayalah yang salah.
Komunitas belajar opsional: https://t.me/GyaanSetuAi