3 LangGraph Rewrites: What Production Checkpointing Taught Me

For three weeks, our onboarding agent lost every job it was supposed to save. The logs showed nothing. No errors. No warnings. The checkpointer reported success, but the state disappeared on resume.

I rewrote the same LangGraph pipeline three times before I understood the truth. The framework did exactly what I told it to do. I was just giving it the wrong instructions.

If you build stateful agents for production, do not assume defaults will save you. Here is what broke and how I fixed it.

The First Break: Silent State Loss

My agent handled a five-step onboarding flow. I used Postgres to save progress so users could resume later. But every resume started from step one.

The cause was my state schema. In LangGraph, every node returns an update that merges into the state. If you do not specify how to merge, the default is to overwrite.

I thought my message list would append. Instead, every new node replaced the entire history with just one message. The checkpoint saved the wrong data perfectly.

The fix: Use Annotated fields with explicit reducers.

• Use the add operator for message lists. • Use a custom merge function for dictionaries. • Use the default overwrite only for single values like "step".

The Second Break: Deserialization and Concurrency

The second rewrite faced two new issues:

  1. Corrupt rows: I stored custom objects in the state. The serializer could not handle them. This created rows that existed but were unusable.
  2. Duplicate keys: WhatsApp retries webhooks if you do not respond fast. If two messages arrived at once, two graph runs tried to write to the same thread. This caused database collisions.

The fixes: • Strip custom objects. Use plain dicts and standard LangChain types only. • Handle webhooks outside the graph. Use a queue and an idempotency key to drop duplicates. • Add a database lock. Ensure only one run happens per thread at a time.

The Third Rewrite: The Stable Pattern

The final version focused on three principles:

• Small graphs: I broke one massive graph into three smaller subgraphs. This reduced the blast radius of bugs. • Explicit checkpoints: I stopped checkpointing after every single node. I only checkpoint at meaningful resume points. This cut database writes by 60%. • Idempotent nodes: This is vital. Every node must produce the same result if it runs twice. If a node sees that a task is already done in the state, it should return immediately. This prevents double-charging for expensive model calls.

Lessons for you:

  • Read reducer semantics before you write code.
  • Do not store custom objects in state.
  • Move concurrency control outside the graph.
  • Make every node idempotent.

The framework did not fail. My assumptions did.

Source: https://dev.to/elenarevicheva/three-langgraph-rewrites-what-production-checkpointing-actually-taught-me-ok9

Optional learning community: https://t.me/GyaanSetuAi