𝗕𝗨𝗜𝗟𝗗𝗜𝗡𝗚 𝗔 𝗖𝗛𝗔𝗢𝗦 𝗧𝗘𝗦𝗧𝗜𝗡𝗚 𝗧𝗢𝗢𝗟 𝗙𝗢𝗥 𝗩𝗜𝗗𝗘𝗢 𝗔𝗣𝗜𝗦

📅5 days ago⏱1 min read

A 200 OK response broke our discovery service.

The server cut off the body at 8 KB. Our system parsed the partial data. It wrote empty rows to the database. Users in three regions saw empty results for 40 minutes. No alerts fired. All health checks stayed green.

We assumed failures were honest. We expected 500 errors or connection failures. The server lied to us.

We built a testing rig to simulate these dishonest failures. It injects faults into the read paths.

The rig tests these failures:

Slow responses.
Truncated bodies.
Wrong content lengths.
Garbage data.
Clock skew.

We focus on four goals:

Containment: One bad region must not poison others.
Honesty: Serve stale data instead of partial data.
Latency: Slow responses must trip a deadline.
Recovery: The system must heal without manual steps.

This process revealed hidden bugs. Our retry logic failed during alternating errors. Future dated timestamps broke our cache. Slow trickles starved our workers.

Do not trust your upstreams. Build a tool to lie to your code. Find bugs before they wake you up at 2am.

Source: https://dev.to/ahmet_gedik778845/building-a-chaos-testing-harness-for-multi-region-video-api-endpoints-1oh3

𝗕𝗨𝗜𝗟𝗗𝗜𝗡𝗚 𝗔 𝗖𝗛𝗔𝗢𝗦 𝗧𝗘𝗦𝗧𝗜𝗡𝗚 𝗧𝗢𝗢𝗟 𝗙𝗢𝗥 𝗩𝗜𝗗𝗘𝗢 𝗔𝗣𝗜𝗦

Continue reading

𝗠𝗬 𝗣𝗨𝗕𝗟𝗜𝗦𝗛𝗜𝗡𝗚 𝗦𝗬𝗦𝗧𝗘𝗠 𝗪𝗘𝗡𝗧 𝗗𝗔𝗥𝗞 𝗙𝗢𝗥 𝟱𝟬 𝗛𝗢𝗨𝗥𝗦

𝗩𝗲𝗹𝘁𝗿𝗶𝘅 𝗔𝗹𝗺𝗼𝘀𝘁 𝗞𝗶𝗹𝗹𝗲𝗱 𝗢𝘂𝗿 𝗦𝗲𝗿𝘃𝗲𝗿

𝗩𝗲𝗹𝘁𝗿𝗶𝘅 𝗧𝗿𝗲𝗮𝘀𝘂𝗿𝗲 𝗛𝘂𝗻𝘁𝘀 𝗔𝗻𝗱 𝗦𝗲𝗿𝘃𝗶𝗰𝗲 𝗕𝗼𝘂𝗻𝗱𝗮𝗿𝗶𝗲𝘀

𝗔𝗜 𝗦𝗮𝗶𝗱 𝗦𝘄𝗮𝗽 𝗧𝗵𝗲 𝗣𝗦𝗨 𝗛𝗲 𝗦𝗮𝗶𝗱 𝗢𝗻𝗲 𝗠𝗼𝗿𝗲 𝗧𝗲𝘀𝘁

𝗪𝗵𝘆 𝗠𝘆 𝗔𝗜 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗙𝗮𝗶𝗹𝗲𝗱 𝗔𝗻𝗱 𝗛𝗼𝘄 𝗜 𝗙𝗶𝘅𝗲𝗱 𝗜𝘁