Struggling with duplicate content issues on your site? I wrote a small Python script that checks for near-duplicates using fuzzy string matching. It's helped me catch pages that were too similar and needed merging.
python
from difflib import SequenceMatcher
from bs4 import BeautifulSoup
import requests
def fetch_text(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Remove script and style elements
for script in soup(['script', 'style']):
script.decompose()
return soup.get_text()
def similarity_ratio(text1, text2):
return SequenceMatcher(None, text1, text2).ratio()
url1 = 'https://example.com/page1'
url2 = 'https://example.com/page2'
text1 = fetch_text(url1)
text2 = fetch_text(url2)
ratio = similarity_ratio(text1, text2)
print(f'Similarity: {ratio:.2%}')
if ratio > 0.8:
print('These pages might need attention!')
This is a quick check, but for larger sites, I'd recommend using a dedicated tool like SERPSpur's content analysis feature to pinpoint exact duplicates. What's your strategy for handling duplicate content?
https://serpspur.com/