Building a Video URL Canonicalization Pipeline

I ran a query on my production SQLite database and found a massive problem.

The videos table had 41,283 rows, but only 29,000 unique titles. We were storing the same videos multiple times.

Why did this happen? The same YouTube video arrived in different URL formats:

• Short links: youtu.be/ID • Desktop links: youtube.com/watch?v=ID • Mobile links: m.youtube.com/watch?v=ID • Shorts: youtube.com/shorts/ID

One video meant four database rows and four near-identical web pages. This wasted my crawl budget and flagged errors in Google Search Console.

I run TrendVidStream. We pull trending data from eight different regions. Every region can surface the same viral video using different URL shapes.

I built a pipeline to fix this using PHP 8.4 and SQLite.

Here is how the pipeline works:

  1. Extract: Turn any URL variant into a stable 11-character video ID.
  2. Normalize: Create one single canonical URL from that ID.
  3. Validate: Run quick checks before touching the database.
  4. Upsert: Use SQLite UPSERT to merge data instead of adding new rows.
  5. Emit: Add canonical tags and 301 redirects so search engines find one URL.

Key technical decisions:

• Use an exact host allowlist. Do not use suffix matching. This prevents security risks. • Never lowercase IDs. YouTube IDs are case-sensitive. Lowercasing them merges different videos. • Use a UNIQUE constraint on the video_id. This is your strongest defense against duplicates. • Use SQLite WAL mode. This makes writes fast and reliable.

The result:

Our video table dropped from 41,283 rows to 28,094 rows. We lost zero videos. Google Search Console errors dropped from 412 to 9. Search results became cleaner because we stopped showing the same video four times.

The lesson is simple: when you aggregate data from many sources, build identity into your system first. Extract a stable ID, enforce it with database constraints, and make every write idempotent.

You do not need complex tools. PHP and SQLite are enough.

Source: https://dev.to/ahmet_gedik778845/building-a-video-url-canonicalization-pipeline-in-php-84-with-sqlite-32ne