๐—•๐˜‚๐—ถ๐—น๐—ฑ๐—ถ๐—ป๐—ด ๐—ฎ ๐—ฉ๐—ถ๐—ฑ๐—ฒ๐—ผ ๐—จ๐—ฅ๐—Ÿ ๐—–๐—ฎ๐—ป๐—ผ๐—ป๐—ถ๐—ฐ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ

One YouTube video can have a dozen different URLs.

You might see a mobile link, a short link, an embed link, or a link with tracking parameters. If your system treats these as different videos, your discovery page will show duplicates. Your search rankings will suffer. Your database will grow too fast.

I run DailyWatch, a video discovery platform. I learned that URL canonicalization is not just about cleaning strings. It is about finding the true identity of a video.

Here is the pipeline we use to solve this.

๐—ง๐—ต๐—ฒ ๐——๐—ฒ๐˜€๐—ถ๐—ด๐—ป ๐—ฆ๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป

The biggest mistake is treating the URL as the identity. The URL is just a way to find the video. You must separate your logic into two jobs:

By keeping these separate, you can change how you display links without breaking your data.

๐—ง๐—ต๐—ฒ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ ๐—ฆ๐˜๐—ฒ๐—ฝ๐˜€

  1. Normalization You must apply transformations in a strict order. I strip tracking parameters like utm_source. I collapse mobile subdomains to the main host. I sort query parameters alphabetically. This ensures that two URLs with the same parameters in different orders result in the same string.

  2. Identity Extraction I use specific patterns to pull the platform and the video ID. For YouTube, I look for the 11-character ID. I reject any ID that does not match this exact shape. A bad identity is worse than no identity because it merges two different videos into one.

  3. Canonical URL Building Instead of storing the messy URL you found, I regenerate a clean one from the identity. This guarantees every video uses the same link format. This helps with SEO and cache hits.

  4. Database Deduplication I use a unique constraint on the identity key in SQLite. During ingestion, I use an "upsert" command. If the video exists, I simply update the "last seen" timestamp. If it is new, I insert it. This prevents duplicates even when multiple processes run at once.

๐—ง๐—ต๐—ฒ ๐—ฅ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€

Stop treating URLs as strings. Start treating them as pointers to an identity.

Source: https://dev.to/ahmet_gedik778845/building-a-video-url-canonicalization-pipeline-for-a-discovery-platform-528l