Building Video Heatmaps With HyperLogLog In PostgresQ

You want to show editors exactly where viewers stop watching a video.

To do this, you need a heatmap. This heatmap shows unique viewers for every second of a video.

At TrendVidStream, we process 40 million events every day. We use a simple stack: PHP 8.4 and Postgres.

The problem is cardinality. Counting unique viewers per second is hard.

The naive way:

  • Use a raw events table.
  • Run COUNT(DISTINCT viewer_hash).
  • Group by video and time bucket.

This failed quickly. For 90 days of data, you need 3.6 billion rows. The queries took 9 seconds. Also, unique counts do not roll up. You cannot sum "uniques per hour" to get "uniques per day" because the same person appears in both.

The solution: HyperLogLog (HLL).

HLL creates a small "sketch" of a dataset. It is tiny, around 10 KB, regardless of how many millions of people watch.

Why HLL works for heatmaps:

  • Fixed size: It stays small.
  • Mergeable: You can union two sketches to get a global count. This solves the rollup problem.

Our Implementation:

We use the postgresql-hll extension. We use two tables.

  1. A staging table: An append-only table where PHP writes raw heartbeats. We do not use indexes here to keep writes fast.
  2. A sketch table: This stores the HLL data.

We use a cron job to move data from staging to the sketch table. We use this command:

INSERT INTO video_heatmap (...) SELECT ..., hll_add_agg(hll_hash_text(viewer_hash), 14, 5) FROM beacon_staging GROUP BY ... ON CONFLICT (...) DO UPDATE SET viewers = video_heatmap.viewers || EXCLUDED.viewers;

The "||" operator merges the new data into the existing sketch.

The Results:

  • Query speed: Our heatmap queries take 15 to 40 ms. The old way took 9 seconds.
  • Storage: Our 90-day dataset is 40 GB. The old way would have taken 1.2 TB.
  • Accuracy: We see about 0.6% to 1.4% error. For a heatmap, this is perfect.

Lessons learned:

  • Hash inside the database. Do not hash in your app.
  • Batch your updates. Do not update sketches one row at a time or you will create write hotspots.
  • Use a daily salt for hashes. This protects privacy and keeps your daily sketches clean.

If your dashboard is slow because of COUNT(DISTINCT), you do not need a massive data warehouse. You need HLL.

Source: https://dev.to/ahmet_gedik778845/building-per-second-video-heatmap-analytics-with-hyperloglog-in-postgres-2p22