𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝗩𝗶𝗱𝗲𝗼 𝗨𝗥𝗟 𝗖𝗮𝗻𝗼𝗻𝗶𝗰𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲
A single YouTube video can have a dozen different URLs.
One URL uses a mobile host. Another uses a short link. A third includes tracking parameters. If you treat these as separate rows, your platform breaks. You get duplicate cards on your discovery page. Your search ranking suffers. Your deduplication job gets slower every week.
I run DailyWatch, a video discovery platform. I learned that URL canonicalization is not just about cleaning strings. It is about finding the true identity of a video.
Here is the pipeline we use to keep our data clean.
𝗧𝗵𝗲 𝗗𝗲𝘀𝗶𝗴𝗻 𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻
Most people make the mistake of treating this as a string cleaning problem. They strip parameters and lowercase the host. This fails because the URL is just one way to show a video.
You must separate two distinct jobs:
- Normalization: Create a clean URL string for humans to see and click.
- Identity Extraction: Create a unique key like youtube:dQw4w9WgXcQ.
The identity key is your primary key for deduplication. It is not a URL. It is a stable identifier. If YouTube changes its URL structure, your deduplication history stays intact because the identity does not change.
𝗧𝗵𝗲 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝗦𝘁𝗲𝗽𝘀
𝟭. 𝗡𝗼𝗿𝗺𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 We apply a fixed sequence of transformations. We strip tracking noise like utm_source and fbclid. We use an allowlist for known platforms. For YouTube, we keep only v, t, and list. We also sort query parameters alphabetically. This ensures that ?a=1&b=2 and ?b=2&a=1 result in the same string.
𝟮. 𝗜𝗱𝗲𝗻𝘁𝗶𝘁𝘆 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻 We use specific patterns to pull the platform and video ID. We validate the ID shape strictly. A bad identity is worse than no identity. If an ID is malformed, we reject it. We do not want to merge two different videos by mistake.
𝟯. 𝗖𝗮𝗻𝗼𝗻𝗶𝗰𝗮𝗹 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 Once we have a clean identity, we regenerate the URL from scratch. We do not store the messy URL the user provided. We build the official version. This helps with SEO and ensures every video uses the same link format.
𝟰. 𝗗𝗲𝗱𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 We use the identity key as a unique constraint in our database. We use an upsert logic. If the video exists, we just update the last seen timestamp. This prevents duplicate rows even when many workers run at once.
𝗧𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁𝘀
This system solved several problems at once:
- Search results no longer show duplicates.
- Cache hit rates increased because we only cache one version of a page.
- Trending data is accurate because signals accumulate on one row.
- We avoid server-side request forgery by only following redirects for known shorteners.
Keep your identity separate from your strings. It is the only way to survive the messiness of the web.