๐๐๐ถ๐น๐ฑ๐ถ๐ป๐ด ๐ฎ ๐ฉ๐ถ๐ฑ๐ฒ๐ผ ๐จ๐ฅ๐ ๐๐ฎ๐ป๐ผ๐ป๐ถ๐ฐ๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ
A single YouTube video can have a dozen different URLs.
One URL uses a mobile host. Another uses a short link. A third includes tracking parameters. If you treat these as separate rows, your platform breaks. You get duplicate cards on your discovery page. Your search ranking suffers. Your deduplication job gets slower every week.
I run DailyWatch, a video discovery platform. I learned that URL canonicalization is not just about cleaning strings. It is about finding the true identity of a video.
Here is the pipeline we use to keep our data clean.
๐ง๐ต๐ฒ ๐๐ฒ๐๐ถ๐ด๐ป ๐๐ฒ๐ฐ๐ถ๐๐ถ๐ผ๐ป
Most people make the mistake of treating this as a string cleaning problem. They strip parameters and lowercase the host. This fails because the URL is just one way to show a video.
You must separate two distinct jobs:
- Normalization: Create a clean URL string for humans to see and click.
- Identity Extraction: Create a unique key like youtube:dQw4w9WgXcQ.
The identity key is your primary key for deduplication. It is not a URL. It is a stable identifier. If YouTube changes its URL structure, your deduplication history stays intact because the identity does not change.
๐ง๐ต๐ฒ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ ๐ฆ๐๐ฒ๐ฝ๐
๐ญ. ๐ก๐ผ๐ฟ๐บ๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป We apply a fixed sequence of transformations. We strip tracking noise like utm_source and fbclid. We use an allowlist for known platforms. For YouTube, we keep only v, t, and list. We also sort query parameters alphabetically. This ensures that ?a=1&b=2 and ?b=2&a=1 result in the same string.
๐ฎ. ๐๐ฑ๐ฒ๐ป๐๐ถ๐๐ ๐๐ ๐๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป We use specific patterns to pull the platform and video ID. We validate the ID shape strictly. A bad identity is worse than no identity. If an ID is malformed, we reject it. We do not want to merge two different videos by mistake.
๐ฏ. ๐๐ฎ๐ป๐ผ๐ป๐ถ๐ฐ๐ฎ๐น ๐๐๐ถ๐น๐ฑ๐ถ๐ป๐ด Once we have a clean identity, we regenerate the URL from scratch. We do not store the messy URL the user provided. We build the official version. This helps with SEO and ensures every video uses the same link format.
๐ฐ. ๐๐ฒ๐ฑ๐๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป We use the identity key as a unique constraint in our database. We use an upsert logic. If the video exists, we just update the last seen timestamp. This prevents duplicate rows even when many workers run at once.
๐ง๐ต๐ฒ ๐ฅ๐ฒ๐๐๐น๐๐
This system solved several problems at once:
- Search results no longer show duplicates.
- Cache hit rates increased because we only cache one version of a page.
- Trending data is accurate because signals accumulate on one row.
- We avoid server-side request forgery by only following redirects for known shorteners.
Keep your identity separate from your strings. It is the only way to survive the messiness of the web.