๐๐ฒ๐ฑ๐๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ป๐ด ๐ฉ๐ถ๐ฑ๐ฒ๐ผ ๐จ๐ฅ๐๐ ๐ฎ๐ ๐ฆ๐ฐ๐ฎ๐น๐ฒ
You see one video ten times in your feed. It looks like a bug. It is a URL problem.
Different links point to one video. You have tracking codes. You have mobile links. You have short links.
This ruins your data. Search results show duplicates. Trending lists are wrong.
Use a two step pipeline to fix this.
Step 1: String Normalization. Clean the URL. Remove tracking junk. Sort the query strings.
Step 2: Identity Extraction. Pull the video ID. Different URL shapes point to one ID. The ID is the source of truth.
Store a canonical key. Use a format like youtube:id. Use an upsert in your database. This keeps your data idempotent.
Handle redirects separately. Cache them in a table. Avoid network lag during ingest.
Results:
- Row counts dropped.
- Search results are clean.
- Trending ranks are accurate.
- Sitemaps are smaller.
Separate string cleaning from identity. Route every write through one boundary. Your data stays honest.