๐๐จ๐๐๐๐๐ก๐ ๐ ๐๐๐๐ข๐ฆ ๐ง๐๐ฆ๐ง๐๐ก๐ ๐ง๐ข๐ข๐ ๐๐ข๐ฅ ๐ฉ๐๐๐๐ข ๐๐ฃ๐๐ฆ
A 200 OK response broke our discovery service.
The server cut off the body at 8 KB. Our system parsed the partial data. It wrote empty rows to the database. Users in three regions saw empty results for 40 minutes. No alerts fired. All health checks stayed green.
We assumed failures were honest. We expected 500 errors or connection failures. The server lied to us.
We built a testing rig to simulate these dishonest failures. It injects faults into the read paths.
The rig tests these failures:
- Slow responses.
- Truncated bodies.
- Wrong content lengths.
- Garbage data.
- Clock skew.
We focus on four goals:
- Containment: One bad region must not poison others.
- Honesty: Serve stale data instead of partial data.
- Latency: Slow responses must trip a deadline.
- Recovery: The system must heal without manual steps.
This process revealed hidden bugs. Our retry logic failed during alternating errors. Future dated timestamps broke our cache. Slow trickles starved our workers.
Do not trust your upstreams. Build a tool to lie to your code. Find bugs before they wake you up at 2am.