𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝗠𝘂𝗹𝘁𝗶-𝗥𝗲𝗴𝗶𝗼𝗻 𝗛𝗲𝗮𝗹𝘁𝗵-𝗖𝗵𝗲𝗰𝗸 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗼𝗿

A user in São Paulo hits a dead edge node. They do not file a bug report. They close the tab and watch something else.

A normal uptime monitor misses this. Most monitors probe from a single location. From that one spot, everything looks green.

Our status page used to say 100% uptime while real users saw timeouts. One global health check was lying to us.

Here is how we built a system that tells the truth.

The Problem: Sampling Bias If your monitor lives in one data center, it only sees one reality. You might report green even if your Singapore and São Paulo edges are dropping connections.

Video traffic makes this worse. Common regional failures include:

A single "200 OK" response tells you almost nothing.

Our Three Rules for Health: We moved beyond status codes. We define health using three metrics:

The Solution: Multi-Region Probing We stopped using one big monitor. Instead, we deploy tiny Go binaries to cheap regional VPS instances.

Each prober:

We use SQLite for storage. It is simple and handles our workload with zero overhead. We store raw samples instead of pre-aggregated data. This allows us to re-score history or debug specific failures later.

The Secret: Quorum Networks are noisy. One dropped packet is not an outage.

We use a quorum system to prevent false alarms. We only declare an edge "down" when multiple regions agree. If one region sees a failure but others do not, we do not page the team. This design choice removed 90% of our false alerts.

Key Lessons:

Non hai bisogno di una piattaforma di osservabilità pesante. Hai bisogno di probe locali, dati grezzi e di una regola che si rifiuti di andare nel panico per il rumore.

Fonte: https://dev.to/ahmet_gedik778845/building-a-multi-region-health-check-aggregator-for-video-cdn-edges-2865