Building Video Heatmaps With HyperLogLog In PostgresQ
You want to show editors exactly where viewers stop watching a video.
To do this, you need a heatmap. This heatmap shows unique viewers for every second of a video.
At TrendVidStream, we process 40 million events every day. We use a simple stack: PHP 8.4 and Postgres.
The problem is cardinality. Counting unique viewers per second is hard.
The naive way:
- Use a raw events table.
- Run COUNT(DISTINCT viewer_hash).
- Group by video and time bucket.
This failed quickly. For 90 days of data, you need 3.6 billion rows. The queries took 9 seconds. Also, unique counts do not roll up. You cannot sum "uniques per hour" to get "uniques per day" because the same person appears in both.
The solution: HyperLogLog (HLL).
HLL creates a small "sketch" of a dataset. It is tiny, around 10 KB, regardless of how many millions of people watch.
Why HLL works for heatmaps:
- Fixed size: It stays small.
- Mergeable: You can union two sketches to get a global count. This solves the rollup problem.
Our Implementation:
We use the postgresql-hll extension. We use two tables.
- A staging table: An append-only table where PHP writes raw heartbeats. We do not use indexes here to keep writes fast.
- A sketch table: This stores the HLL data.
We use a cron job to move data from staging to the sketch table. We use this command:
INSERT INTO video_heatmap (...) SELECT ..., hll_add_agg(hll_hash_text(viewer_hash), 14, 5) FROM beacon_staging GROUP BY ... ON CONFLICT (...) DO UPDATE SET viewers = video_heatmap.viewers || EXCLUDED.viewers;
The "||" operator merges the new data into the existing sketch.
The Results:
- Query speed: Our heatmap queries take 15 to 40 ms. The old way took 9 seconds.
- Storage: Our 90-day dataset is 40 GB. The old way would have taken 1.2 TB.
- Accuracy: We see about 0.6% to 1.4% error. For a heatmap, this is perfect.
Lessons learned:
- Hash inside the database. Do not hash in your app.
- Batch your updates. Do not update sketches one row at a time or you will create write hotspots.
- Use a daily salt for hashes. This protects privacy and keeps your daily sketches clean.
If your dashboard is slow because of COUNT(DISTINCT), you do not need a massive data warehouse. You need HLL.
