๐——๐—ฒ๐—ฑ๐˜‚๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—ฉ๐—ถ๐—ฑ๐—ฒ๐—ผ ๐—จ๐—ฅ๐—Ÿ๐˜€ ๐—ฎ๐˜ ๐—ฆ๐—ฐ๐—ฎ๐—น๐—ฒ

You see one video ten times in your feed. It looks like a bug. It is a URL problem.

Different links point to one video. You have tracking codes. You have mobile links. You have short links.

This ruins your data. Search results show duplicates. Trending lists are wrong.

Use a two step pipeline to fix this.

Step 1: String Normalization. Clean the URL. Remove tracking junk. Sort the query strings.

Step 2: Identity Extraction. Pull the video ID. Different URL shapes point to one ID. The ID is the source of truth.

Store a canonical key. Use a format like youtube:id. Use an upsert in your database. This keeps your data idempotent.

Handle redirects separately. Cache them in a table. Avoid network lag during ingest.

Results:

Separate string cleaning from identity. Route every write through one boundary. Your data stays honest.

Source: https://dev.to/ahmet_gedik778845/deduplicating-video-urls-at-scale-with-a-php-canonicalization-pipeline-2dgm