𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝗠𝘂𝗹𝘁𝗶 𝗥𝗲𝗴𝗶𝗼𝗻 𝗛𝗲𝗮𝗹𝘁𝗵 𝗖𝗵𝗲𝗰𝗸 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗼𝗿

📅3 hours ago⏱2 min read

𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝗠𝘂𝗹𝘁𝗶-𝗥𝗲𝗴𝗶𝗼𝗻 𝗛𝗲𝗮𝗹𝘁𝗵-𝗖𝗵𝗲𝗰𝗸 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗼𝗿

A user in São Paulo hits a dead edge node. They do not file a bug report. They close the tab and watch something else.

A normal uptime monitor misses this. Most monitors probe from a single location. From that one spot, everything looks green.

Our status page used to say 100% uptime while real users saw timeouts. One global health check was lying to us.

Here is how we built a system that tells the truth.

The Problem: Sampling Bias If your monitor lives in one data center, it only sees one reality. You might report green even if your Singapore and São Paulo edges are dropping connections.

Video traffic makes this worse. Common regional failures include:

Bad BGP routes affecting one continent.
Cache evictions forcing slow origin fallbacks.
Disk errors causing TLS handshake timeouts.
DNS issues at specific local resolvers.

A single "200 OK" response tells you almost nothing.

Our Three Rules for Health: We moved beyond status codes. We define health using three metrics:

Reachability: TCP and TLS handshakes must finish within 800ms.
Latency: We track p95 Time-to-First-Byte (TTFB). Averages hide the slow tail that annoys users.
Correctness: The response body must contain an expected marker. A 200 OK that returns an error page is a failure.

The Solution: Multi-Region Probing We stopped using one big monitor. Instead, we deploy tiny Go binaries to cheap regional VPS instances.

Each prober:

Checks the edges from a local vantage point.
Uses httptrace to get real TTFB data.
Posts results to a central aggregator.

We use SQLite for storage. It is simple and handles our workload with zero overhead. We store raw samples instead of pre-aggregated data. This allows us to re-score history or debug specific failures later.

The Secret: Quorum Networks are noisy. One dropped packet is not an outage.

We use a quorum system to prevent false alarms. We only declare an edge "down" when multiple regions agree. If one region sees a failure but others do not, we do not page the team. This design choice removed 90% of our false alerts.

Key Lessons:

Probe what users hit, not a synthetic path.
Track tail latency (p95), not averages.
Use disposable, cheap probers in many regions.
Use quorum to avoid pager fatigue.
Keep your storage stack simple.

Non hai bisogno di una piattaforma di osservabilità pesante. Hai bisogno di probe locali, dati grezzi e di una regola che si rifiuti di andare nel panico per il rumore.

Fonte: https://dev.to/ahmet_gedik778845/building-a-multi-region-health-check-aggregator-for-video-cdn-edges-2865

𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝗠𝘂𝗹𝘁𝗶 𝗥𝗲𝗴𝗶𝗼𝗻 𝗛𝗲𝗮𝗹𝘁𝗵 𝗖𝗵𝗲𝗰𝗸 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗼𝗿

Continue reading

L'ancora DNS mancante

𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝗻 𝗔𝗜: 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 𝗶𝘀 𝗡𝗼𝘁 𝗘𝗻𝗼𝘂𝗴𝗵

𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝗶𝗻𝗴 𝗔𝗻 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗹𝗲 𝗘𝗱𝗴𝗲 𝗣𝗼𝗱

𝗪𝗵𝘆 𝗛𝗲𝗮𝗹𝘁𝗵𝗰𝗮𝗿𝗲 𝗣𝗿𝗼𝘃𝗶𝗱𝗲𝗿𝘀 𝗙𝗮𝗶𝗹 𝘁𝗼 𝗔𝗰𝘁 𝗼𝗻 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸

𝗧𝗵𝗲 𝗗𝗮𝘆 𝗪𝗲 𝗙𝗶𝘅𝗲𝗱 𝗢𝘂𝗿 𝗦𝗶𝗴𝗻𝘂𝗽 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲