Building a Video URL Canonicalization Pipeline in PHP 8.4 with SQLite

php dev.to

Three months ago I ran a sanity query against our production SQLite database and got an answer that made no sense: 41,283 rows in the videos table, but only about 29,000 distinct titles. We were not discovering 41,000 videos. We were discovering 29,000 videos and storing a third of them twice — sometimes four times — because the same YouTube video kept arriving under different URL shapes: youtu.be/dQw4w9WgXcQ from one regional feed, youtube.com/watch?v=dQw4w9WgXcQ&feature=share from another, m.youtube.com/watch?v=dQw4w9WgXcQ from a mobile-sourced list, and youtube.com/shorts/dQw4w9WgXcQ when the same upload trended as a Short. Four URLs, one video, four database rows, four near-identical public pages. Google Search Console was flagging hundreds of them as "Duplicate without user-selected canonical," which is GSC's polite way of telling you your crawl budget is on fire.

I run TrendVidStream, a global video streaming discovery site that ingests trending data from eight regions on a staggered cron schedule. Every region is a separate fetch pass, every pass can surface the same globally viral video, and every upstream source formats its URLs slightly differently. This article walks through the canonicalization pipeline I built to fix the mess: how we extract a stable video ID from any URL variant, how we collapse duplicates at write time with SQLite UPSERTs, how we backfilled 41,000 existing rows without downtime, and the edge cases that only showed up in production. The stack is PHP 8.4 and SQLite with FTS5, deployed over plain FTP to shared LiteSpeed hosting. No Kubernetes, no Redis, no excuses.

Why the Same Video Arrives Under Six Different URLs

If you only ever consume the official Data API, you mostly get clean video IDs. The moment you ingest from more than one source type — trending feeds, RSS, sitemaps, scraped "related" lists, user submissions — you inherit the full zoo of YouTube URL formats:

  • https://www.youtube.com/watch?v=<id> — the classic desktop form, usually dragging &feature=, &si=, or &pp= tracking params behind it
  • https://youtu.be/<id> — the share-button short link, often with ?si=<share-token> appended
  • https://www.youtube.com/shorts/<id> — same 11-character ID, different path shape
  • https://www.youtube.com/embed/<id> and https://www.youtube-nocookie.com/embed/<id> — harvested from embedded players on third-party pages
  • https://m.youtube.com/watch?v=<id> and https://music.youtube.com/watch?v=<id> — host variants
  • https://www.youtube.com/live/<id> — live streams, which keep the same ID after the broadcast ends
  • https://www.youtube.com/attribution_link?u=%2Fwatch%3Fv%3D<id> — the truly cursed one, with a URL-encoded URL nested inside a query parameter

All of these point at the same underlying resource. The 11-character ID is the only stable identity. On a single-region site you might get away with naive string matching because one source tends to be internally consistent. On a multi-region pipeline you will not: our eight regional fetch passes run at different hours, hit different feed endpoints, and routinely surface the same viral video in five or six regions within the same day. Without canonicalization, every region writes its own row.

The Architecture in One Paragraph

The pipeline is five pure stages: extract (turn any URL variant into an 11-character video ID or reject it), normalize (build exactly one canonical source URL and one canonical internal path from that ID), validate (cheap syntactic checks before anything touches the database), upsert (one row per video ID, enforced by a UNIQUE constraint and an ON CONFLICT clause that merges region metadata instead of duplicating), and emit (canonical <link> tags plus 301 redirects so search engines converge on one URL per video). Each stage is a function with no side effects except the upsert, which makes the whole thing trivially testable and — more importantly for a cron-driven system — idempotent. Running the same regional fetch twice produces zero new rows.

Stage 1 Extracting the Video ID Without Regex Soup

My first version was a single 200-character regex. It worked until it didn't: it matched IDs out of playlist URLs that had no video, and it happily extracted IDs from evil.example.com/watch?v=... because it never checked the host. The rewrite uses parse_url() and an explicit host allowlist, which is both faster and auditable:

<?php

declare(strict_types=1);

final class VideoUrlCanonicalizer
{
    private const VALID_HOSTS = [
        'youtube.com', 'www.youtube.com', 'm.youtube.com',
        'music.youtube.com', 'youtu.be',
        'youtube-nocookie.com', 'www.youtube-nocookie.com',
    ];

    private const PATH_PREFIXES = ['/shorts/', '/embed/', '/live/', '/v/'];

    public function extractId(string $url): ?string
    {
        $url = trim($url);
        if ($url === '') {
            return null;
        }
        if (!str_contains($url, '://')) {
            $url = 'https://' . $url; // feeds love scheme-less URLs
        }

        $parts = parse_url($url);
        if ($parts === false || !isset($parts['host'])) {
            return null;
        }

        $host = strtolower($parts['host']);
        if (!in_array($host, self::VALID_HOSTS, true)) {
            return null; // unknown host: reject, never guess
        }

        $path = $parts['path'] ?? '/';

        if ($host === 'youtu.be') {
            return $this->validateId(explode('/', ltrim($path, '/'))[0]);
        }

        if ($path === '/watch' && isset($parts['query'])) {
            parse_str($parts['query'], $query);
            return $this->validateId((string)($query['v'] ?? ''));
        }

        foreach (self::PATH_PREFIXES as $prefix) {
            if (str_starts_with($path, $prefix)) {
                $segment = explode('/', substr($path, strlen($prefix)))[0];
                return $this->validateId($segment);
            }
        }

        // attribution_link?u=%2Fwatch%3Fv%3D<id> — recurse once on the inner URL
        if ($path === '/attribution_link' && isset($parts['query'])) {
            parse_str($parts['query'], $query);
            if (isset($query['u']) && str_starts_with((string)$query['u'], '/')) {
                return $this->extractId('https://www.youtube.com' . $query['u']);
            }
        }

        return null;
    }

    private function validateId(string $id): ?string
    {
        return preg_match('/^[A-Za-z0-9_-]{11}$/', $id) === 1 ? $id : null;
    }
}
Enter fullscreen mode Exit fullscreen mode

A few deliberate decisions in there. The host allowlist is exact-match, not str_ends_with(), because suffix matching is how you end up accepting notyoutube.com. The attribution_link branch recurses exactly once and only when the inner value starts with /, so a malicious u=https://attacker.example can never smuggle a foreign host through. And the ID validation is a single anchored character-class regex — the one place a regex genuinely belongs. Note what is not in validateId(): no strtolower(). YouTube IDs are case-sensitive base64url. Lowercasing them merges unrelated videos, and yes, I learned that from a bug report.

Stage 2 Deciding What the Canonical URL Actually Is

Extraction gives you identity. Canonicalization is a policy decision on top of it, and you need two separate answers:

  • Canonical source URL — what we store as the upstream reference: always https://www.youtube.com/watch?v=<id>. Shorts, embeds, mobile hosts all collapse into this one form.
  • Canonical internal URL — what we serve and what search engines index: /watch/<id>-<slug> on our own domain, where the slug is a transliterated, hyphenated title.

The controversial part is what to throw away. We drop everything that is not the ID: t= timestamps, list= and index= playlist context, si= share tokens, every utm_* and feature= param. Timestamps were the only debatable one — a deep link to 4:32 of a video is arguably distinct content — but for a discovery site the page is the video, not the moment, so keeping t= would just mint duplicate pages with offset players. If your product is clip-level (chapters, highlights), store the offset in a separate column instead of letting it leak into URL identity.

Slugs deserve one warning: they are presentation, not identity. Titles change after upload — clickbait A/B testing is rampant in trending content — so our router matches only on the 11-character ID prefix of the path and 301-redirects to the current slug if the slug portion is stale or missing. The slug can rot freely; the URL still resolves and link equity consolidates onto one address.

Stage 3 One Row Per Video with SQLite UPSERT

The database schema enforces what the code promises. The UNIQUE constraint on video_id is the actual guarantee — application-level checks are a race condition with eight cron jobs writing on staggered schedules:

CREATE TABLE videos (
    id INTEGER PRIMARY KEY,
    video_id TEXT NOT NULL UNIQUE,
    canonical_url TEXT NOT NULL,
    title TEXT NOT NULL,
    slug TEXT NOT NULL,
    region_first_seen TEXT NOT NULL,
    regions_seen TEXT NOT NULL DEFAULT '[]',  -- JSON array
    fetch_count INTEGER NOT NULL DEFAULT 1,
    first_seen_at TEXT NOT NULL,
    last_seen_at TEXT NOT NULL
);

CREATE VIRTUAL TABLE videos_fts USING fts5(
    title, description,
    content='videos', content_rowid='id'
);
Enter fullscreen mode Exit fullscreen mode

The FTS5 table is an external-content index over videos, kept in sync by triggers (omitted here for length). This matters for dedup more than you would expect: before canonicalization, our FTS index carried every duplicate row, so a search for a viral title returned the same video four times with four different URLs. Collapsing the base table shrank the FTS index by roughly 30% and fixed result quality for free.

The ingest write is a single UPSERT. When a video arrives from a region we have not seen it in before, we append that region to the JSON array; when it is a repeat, we just bump the freshness columns:

$pdo->exec('PRAGMA journal_mode=WAL');

$stmt = $pdo->prepare(<<<'SQL'
INSERT INTO videos
    (video_id, canonical_url, title, slug,
     region_first_seen, regions_seen, first_seen_at, last_seen_at)
VALUES
    (:vid, :url, :title, :slug,
     :region, json_array(:region2), :now, :now2)
ON CONFLICT(video_id) DO UPDATE SET
    last_seen_at = excluded.last_seen_at,
    fetch_count  = fetch_count + 1,
    title        = excluded.title,
    regions_seen = CASE
        WHEN EXISTS (
            SELECT 1 FROM json_each(videos.regions_seen)
            WHERE value = json_extract(excluded.regions_seen, '$[0]')
        )
        THEN videos.regions_seen
        ELSE json_insert(videos.regions_seen, '$[#]',
                         json_extract(excluded.regions_seen, '$[0]'))
    END
SQL);

$pdo->beginTransaction();
foreach ($fetchedVideos as $v) {
    $id = $canonicalizer->extractId($v['url']);
    if ($id === null) {
        $rejected++;
        continue; // log it, never store it
    }
    $stmt->execute([
        ':vid'     => $id,
        ':url'     => 'https://www.youtube.com/watch?v=' . $id,
        ':title'   => $v['title'],
        ':slug'    => slugify($v['title']),
        ':region'  => $region,
        ':region2' => $region,
        ':now'     => $nowUtc,
        ':now2'    => $nowUtc,
    ]);
}
$pdo->commit();
Enter fullscreen mode Exit fullscreen mode

The trick worth stealing is reading the incoming region back out of excluded.regions_seen with json_extract(..., '$[0]') inside the conflict clause — it keeps the whole merge in one statement with no read-modify-write round trip, so two regional crons colliding on the same video cannot lose an update. Everything runs inside one transaction per regional batch (typically 50–200 videos), which on SQLite in WAL mode is the difference between a 40-second pass and a sub-second one. We deliberately let title = excluded.title overwrite on conflict, because the freshest feed usually has the current title after the uploader's inevitable retitle.

Stage 4 Serving Exactly One Page Per Video

Storage dedup fixes your database; it does not fix what Google already crawled. Three things close the loop on the serving side:

  1. Every watch page emits <link rel="canonical"> pointing at its own canonical path — absolute URL, current slug, no query string.
  2. The router 301-redirects any non-canonical variant: stale slug, missing slug, uppercase path, trailing junk params.
  3. The sitemap generator reads from the deduplicated table, so it physically cannot advertise two URLs for one video.

The router-side enforcement is small enough to show whole:

// Route: /watch/{idAndSlug} — match on ID, enforce slug by redirect
$videoId = substr($idAndSlug, 0, 11);
$video = $repo->findByVideoId($videoId); // 404s if unknown

$canonicalPath = '/watch/' . $video['video_id'] . '-' . $video['slug'];
if (parse_url($_SERVER['REQUEST_URI'], PHP_URL_PATH) !== $canonicalPath) {
    header('Location: ' . $canonicalPath, true, 301);
    exit;
}
Enter fullscreen mode Exit fullscreen mode

One subtlety on a cached stack: we run a three-layer cache (LiteSpeed page cache, then a PHP file-based page cache, then a data cache), and redirects must be decided before the page cache lookup. Early on, a non-canonical URL got cached as a full 200 page, and that cached duplicate kept resurfacing in GSC for two weeks after the fix shipped. The redirect check now lives in the front controller above all cache reads, and every regional cron pass finishes by clearing the page cache so stale duplicates cannot outlive the data that spawned them.

Backfilling 41,000 Rows Without Downtime

New writes were now clean, but the table still held ~12,000 historical duplicates. Before touching anything I wrote a read-only audit script — in Python, because for one-shot data forensics I want a REPL and collections rather than another PHP CLI entry point:

#!/usr/bin/env python3
"""Audit: group video rows by canonical ID, report duplicates."""
import re
import sqlite3
from collections import defaultdict
from urllib.parse import urlparse, parse_qs

VALID_ID = re.compile(r'^[A-Za-z0-9_-]{11}$')
PREFIXES = ('shorts', 'embed', 'live', 'v')

def extract_id(url: str) -> str | None:
    p = urlparse(url if '://' in url else f'https://{url}')
    host = (p.hostname or '').lower()
    cand = ''
    if host == 'youtu.be':
        cand = p.path.strip('/').split('/')[0]
    elif host in ('youtube.com', 'www.youtube.com', 'm.youtube.com',
                  'music.youtube.com', 'www.youtube-nocookie.com'):
        if p.path == '/watch':
            cand = parse_qs(p.query).get('v', [''])[0]
        else:
            seg = p.path.strip('/').split('/')
            if len(seg) > 1 and seg[0] in PREFIXES:
                cand = seg[1]
    return cand if VALID_ID.match(cand) else None

conn = sqlite3.connect('data/videos.db')
groups = defaultdict(list)
for row_id, url, seen in conn.execute(
        'SELECT id, source_url, first_seen_at FROM videos'):
    vid = extract_id(url)
    groups[vid if vid else f'__bad_{row_id}'].append((row_id, seen))

dupes = {k: sorted(v, key=lambda r: r[1])
         for k, v in groups.items()
         if len(v) > 1 and not k.startswith('__bad_')}
bad = sum(1 for k in groups if k.startswith('__bad_'))
print(f'{len(dupes)} video IDs with duplicates, '
      f'{sum(len(v) - 1 for v in dupes.values())} rows to merge, '
      f'{bad} unparseable rows')
Enter fullscreen mode Exit fullscreen mode

The audit found 11,904 mergeable rows and 283 rows whose URLs could not be parsed at all (truncated feed data, mostly 10-character IDs from a source that clipped the last character — those got deleted, since a 10-character ID is unrecoverable). The merge itself followed three rules: keep the oldest row (it carries the original first_seen_at, which feeds our trending-age logic), union the regions_seen arrays and sum fetch_count into the survivor, and write every deleted row's old internal path into a redirect_map table so historical inbound links 301 to the survivor instead of 404ing. The whole merge ran as one transaction per 500 groups, followed by an FTS5 rebuild and PRAGMA optimize.

Shipping It Over FTP

Our deploys are FTP-based — shared hosting, no SSH, no migration runner. That constraint shapes the rollout order more than any framework choice would. The sequence that works: first deploy code that tolerates both states (the canonicalizer plus a router that serves old duplicate URLs if the redirect_map table does not exist yet), then trigger the migration through an authenticated task route hit by the cron user, then deploy the cleanup that removes the compatibility branch. Because eight regional cron jobs fire on staggered offsets around the clock, there is no quiet window — the idempotent UPSERT is what makes a mid-deploy cron pass harmless rather than corrupting. If your deploy mechanism cannot atomically swap files, idempotent writes are not an optimization, they are a survival requirement.

Edge Cases That Actually Happened

Everything above sounds tidy. These are the production incidents behind it:

  • Case-sensitive IDs. An early "normalization" step lowercased everything. dQw4w9WgXcQ and dqw4w9wgxcq are different videos (or one real and one nonexistent). Two days of confused 404s.
  • Double-encoded attribution links. Some scraped sources URL-encode the already-encoded u= parameter. We decode at most once and reject if the result still contains %3F. Decoding in a loop "until clean" is an invitation for crafted input.
  • Playlist URLs with no v=. youtube.com/playlist?list=... extracts no ID and must be rejected, not stored with an empty ID. Our UNIQUE constraint turned the first occurrence of this bug into a hard error instead of silent garbage — constraints are cheap monitoring.
  • Live URLs outliving the stream. /live/<id> keeps resolving after the broadcast ends and the same ID later shows up as /watch?v=<id> in feeds. Canonicalizing on ID merged them automatically; a URL-string dedup never would have.
  • Shorts trending in multiple regions under both shapes. The same Short arrived as /shorts/<id> from two Asian regional feeds and watch?v=<id> from a European one within six hours. This single pattern accounted for almost a quarter of our duplicates.
  • Third-party shorteners. bit.ly and friends wrapping YouTube links. We reject them outright rather than following redirects from untrusted feed data inside a cron job. If a video matters, it reappears as a direct link within a fetch cycle or two.

Results

Three weeks after rollout: the videos table went from 41,283 rows to 28,094 with zero lost videos; GSC duplicate-canonical warnings dropped from 412 affected URLs to 9 (all crawl-lag stragglers that cleared on their own); the FTS5 index shrank ~30% and search results stopped showing the same viral video four times; and indexed-page coverage actually rose, because crawl budget that had been burned on duplicates moved to real pages.

The broader lesson is boring and durable: when you aggregate content from many sources — and especially from many regions, where the same item arrives repeatedly in different costumes — identity is the first thing to engineer, not an afterthought. Extract a stable ID, enforce it with a database constraint rather than application discipline, make every write idempotent, and let 301s plus canonical tags clean up history. None of this needed anything heavier than PHP, SQLite, and a cron schedule. The pipeline has now survived three months of eight-region ingest without creating a single duplicate row, which is exactly the kind of metric nobody celebrates and everybody depends on.

Source: dev.to

arrow_back Back to Tutorials