๐—•๐˜‚๐—ถ๐—น๐—ฑ๐—ถ๐—ป๐—ด ๐—ฎ ๐—ฉ๐—ถ๐—ฑ๐—ฒ๐—ผ ๐—จ๐—ฅ๐—Ÿ ๐—–๐—ฎ๐—ป๐—ผ๐—ป๐—ถ๐—ฐ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ

A single YouTube video can have a dozen different URLs.

One URL uses a mobile host. Another uses a short link. A third includes tracking parameters. If you treat these as separate rows, your platform breaks. You get duplicate cards on your discovery page. Your search ranking suffers. Your deduplication job gets slower every week.

I run DailyWatch, a video discovery platform. I learned that URL canonicalization is not just about cleaning strings. It is about finding the true identity of a video.

Here is the pipeline we use to keep our data clean.

๐—ง๐—ต๐—ฒ ๐——๐—ฒ๐˜€๐—ถ๐—ด๐—ป ๐——๐—ฒ๐—ฐ๐—ถ๐˜€๐—ถ๐—ผ๐—ป

Most people make the mistake of treating this as a string cleaning problem. They strip parameters and lowercase the host. This fails because the URL is just one way to show a video.

You must separate two distinct jobs:

The identity key is your primary key for deduplication. It is not a URL. It is a stable identifier. If YouTube changes its URL structure, your deduplication history stays intact because the identity does not change.

๐—ง๐—ต๐—ฒ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ ๐—ฆ๐˜๐—ฒ๐—ฝ๐˜€

๐Ÿญ. ๐—ก๐—ผ๐—ฟ๐—บ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป We apply a fixed sequence of transformations. We strip tracking noise like utm_source and fbclid. We use an allowlist for known platforms. For YouTube, we keep only v, t, and list. We also sort query parameters alphabetically. This ensures that ?a=1&b=2 and ?b=2&a=1 result in the same string.

๐Ÿฎ. ๐—œ๐—ฑ๐—ฒ๐—ป๐˜๐—ถ๐˜๐˜† ๐—˜๐˜…๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป We use specific patterns to pull the platform and video ID. We validate the ID shape strictly. A bad identity is worse than no identity. If an ID is malformed, we reject it. We do not want to merge two different videos by mistake.

๐Ÿฏ. ๐—–๐—ฎ๐—ป๐—ผ๐—ป๐—ถ๐—ฐ๐—ฎ๐—น ๐—•๐˜‚๐—ถ๐—น๐—ฑ๐—ถ๐—ป๐—ด Once we have a clean identity, we regenerate the URL from scratch. We do not store the messy URL the user provided. We build the official version. This helps with SEO and ensures every video uses the same link format.

๐Ÿฐ. ๐——๐—ฒ๐—ฑ๐˜‚๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป We use the identity key as a unique constraint in our database. We use an upsert logic. If the video exists, we just update the last seen timestamp. This prevents duplicate rows even when many workers run at once.

๐—ง๐—ต๐—ฒ ๐—ฅ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€

This system solved several problems at once:

Keep your identity separate from your strings. It is the only way to survive the messiness of the web.

Source: https://dev.to/ahmet_gedik778845/building-a-video-url-canonicalization-pipeline-for-a-discovery-platform-528l