𝗔 𝗦𝘂𝗰𝗰𝗲𝘀𝘀𝗳𝘂𝗹 𝗣𝗮𝘆𝗺𝗲𝗻𝘁 𝗧𝗵𝗮𝘁 𝗡𝗲𝘃𝗲𝗿 𝗕𝗲𝗰𝗮𝗺𝗲 𝗮 𝗕𝗼𝗼𝗸𝗶𝗻𝗴
A customer paid. Razorpay showed success. The webhook sent an HTTP 200. The payment was captured.
Yet the booking stayed stuck on "confirming."
No errors appeared. No exceptions broke the code. No alerts went off. Every metric showed a healthy system.
But the customer had nothing. The creator had no booking.
Accepting money is easy. Ensuring every payment leads to a booking is the real challenge.
Most tutorials suggest this flow:
- Webhook receives event
- Webhook updates booking
This is dangerous. If the business logic lives inside the webhook, you depend entirely on delivery success. Webhooks face retries, duplicates, and partial failures.
We changed our architecture to separate these tasks. Webhooks now only record events. They do not perform business logic.
We introduced an event ledger with three tables:
- payment_orders: The provider truth
- payment_events: The immutable event ledger
- bookings: The business truth
The webhook now has one job:
- Verify signature
- Store event
- Return 200
This protects the system. If the webhook fails, the event is still safe.
We also learned that payment state and booking state are different. A captured payment is an input. A confirmed booking is the result. Keeping them separate allows for reconciliation.
During an investigation, we found a bug. The events existed in the database. The processor was healthy. The webhook was healthy.
But the processor never ran. Nobody was triggering the function to process pending events.
Decoupling ingestion from processing is good design. But it creates a new requirement: something must trigger the processing.
We implemented a scheduler to run several jobs:
- Process payment events
- Recover missed webhooks
- Validate system consistency
To prevent errors during retries, we use this logic:
- Select unprocessed events
- Use "SKIP LOCKED" to allow multiple workers
- Ensure duplicate deliveries do nothing
A system that only works when every webhook arrives on time is a fragile system. If your queue has no one to drain it, work waits forever.
Reliability means building for when things fail.