How to Detect Duplicate Content with Python Using Fuzzy String Matching

typescript dev.to

Struggling with duplicate content issues on your site? I wrote a small Python script that checks for near-duplicates using fuzzy string matching. It's helped me catch pages that were too similar and needed merging.

python
from difflib import SequenceMatcher
from bs4 import BeautifulSoup
import requests

def fetch_text(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Remove script and style elements
for script in soup(['script', 'style']):
script.decompose()
return soup.get_text()

def similarity_ratio(text1, text2):
return SequenceMatcher(None, text1, text2).ratio()

url1 = 'https://example.com/page1'
url2 = 'https://example.com/page2'
text1 = fetch_text(url1)
text2 = fetch_text(url2)
ratio = similarity_ratio(text1, text2)
print(f'Similarity: {ratio:.2%}')
if ratio > 0.8:
print('These pages might need attention!')

This is a quick check, but for larger sites, I'd recommend using a dedicated tool like SERPSpur's content analysis feature to pinpoint exact duplicates. What's your strategy for handling duplicate content?
https://serpspur.com/

Source: dev.to

arrow_back Back to Tutorials