๐๐๐ถ๐น๐ฑ๐ถ๐ป๐ด ๐ฎ ๐ฉ๐ถ๐ฑ๐ฒ๐ผ ๐จ๐ฅ๐ ๐๐ฎ๐ป๐ผ๐ป๐ถ๐ฐ๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ
One YouTube video can have a dozen different URLs.
You might see a mobile link, a short link, an embed link, or a link with tracking parameters. If your system treats these as different videos, your discovery page will show duplicates. Your search rankings will suffer. Your database will grow too fast.
I run DailyWatch, a video discovery platform. I learned that URL canonicalization is not just about cleaning strings. It is about finding the true identity of a video.
Here is the pipeline we use to solve this.
๐ง๐ต๐ฒ ๐๐ฒ๐๐ถ๐ด๐ป ๐ฆ๐ฒ๐ฝ๐ฎ๐ฟ๐ฎ๐๐ถ๐ผ๐ป
The biggest mistake is treating the URL as the identity. The URL is just a way to find the video. You must separate your logic into two jobs:
- Normalization: Create a clean, human-readable URL for display.
- Identity Extraction: Create a unique key (like youtube:dQw4w9WgXcQ) for your database.
By keeping these separate, you can change how you display links without breaking your data.
๐ง๐ต๐ฒ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ ๐ฆ๐๐ฒ๐ฝ๐
Normalization You must apply transformations in a strict order. I strip tracking parameters like utm_source. I collapse mobile subdomains to the main host. I sort query parameters alphabetically. This ensures that two URLs with the same parameters in different orders result in the same string.
Identity Extraction I use specific patterns to pull the platform and the video ID. For YouTube, I look for the 11-character ID. I reject any ID that does not match this exact shape. A bad identity is worse than no identity because it merges two different videos into one.
Canonical URL Building Instead of storing the messy URL you found, I regenerate a clean one from the identity. This guarantees every video uses the same link format. This helps with SEO and cache hits.
Database Deduplication I use a unique constraint on the identity key in SQLite. During ingestion, I use an "upsert" command. If the video exists, I simply update the "last seen" timestamp. If it is new, I insert it. This prevents duplicates even when multiple processes run at once.
๐ง๐ต๐ฒ ๐ฅ๐ฒ๐๐๐น๐๐
- Search results never show the same video twice.
- Cache hit rates increased because we no longer cache six versions of the same page.
- Trending data is accurate because all views for one video hit a single row.
Stop treating URLs as strings. Start treating them as pointers to an identity.