Building a Video URL Canonicalization Pipeline for a Discovery Platform
php
dev.to
A single YouTube video can reach our crawler under a dozen different URLs. https://www.youtube.com/watch?v=dQw4w9WgXcQ, https://youtu.be/dQw4w9WgXcQ?t=43, https://m.youtube.com/watch?v=dQw4w9WgXcQ&feature=share&utm_source=newsletter, the /shorts/ variant, the embed iframe src, and the consent-redirect wrapper that Google bounces EU traffic through. They all point at the same 3 minutes 33 seconds of video. If you treat those as distinct rows, you end up with six near-duplicate cards on a discover