AI Autopilot Freeze: The Danger of Relying on Timestamps
Our AI agents work 24/7.
They turn requirements into tasks. They write code. They open pull requests. An AI reviewer approves the work. Then, the system merges the code automatically.
One morning, we saw zero completed tasks for three days.
The system looked healthy. Every component reported green status. Plans were generating. Agents were working. The reviewer was approving. The only thing missing was the merge.
The cause was a freshness check we built ourselves.
We wanted to prevent a specific error. We did not want to merge code if a reviewer approved it, then later asked for changes. To fix this, we told the system: "Only trust reviews that happened after we started waiting."
This logic failed during system recoveries.
If the system restarted or moved a task to a new server, it reset the "start waiting" timestamp to the current time.
If a reviewer approved the code at 10:00 AM, but the system restarted at 10:30 AM, the system would ignore the 10:00 AM approval. It thought the approval was too old.
The approval was still there on GitHub. But our code could not see it. The system entered a permanent deadlock.
I learned three hard lessons from this:
Treat external state as a snapshot, not an event. Do not ask "did something new happen?" Instead, ask "what is the current state right now?" If you look at the current state, restarts do not matter.
Do not reset timestamps during recovery. If you reset a "started at" time during a retry, you erase all facts that happened before that moment. Only reset timestamps when the actual work changes.
Monitor where tasks park, not just if they finish. A "done" count tells you that you have a problem. A distribution of tasks by step tells you where the problem is. We found the issue because we saw 28 tasks stuck at the "wait for review" step.
A live process following a wrong rule will never show up on a health check. It looks healthy because it is working exactly as programmed.
If you build automated systems, do not just watch for errors. Watch for where your processes pile up.
Optional learning community: https://t.me/GyaanSetuAi
