Detecting Content Dilution: Using Python for Thin and Duplicate Content Analysis
Thin and duplicate content—material that provides little unique value or is copied from existing sources—is a critical concern for SEO and content integrity. While manual checking is impossible at scale, Python offers powerful Natural Language Processing (NLP) tools to automate the detection process. This guide provides a comprehensive, multi-layered approach using Python to analyze content saved in Markdown format.
📝 Prerequisites and Setup
Before starting, ensure you have the necessary libraries installed. We will use nltk for tokenization, scikit-learn for vectorization, and standard Python tools for file handling.
bash
pip install nltk scikit-learn
The Workflow Overview
Our detection process will be three-fold:
- Cleaning: Converting Markdown syntax into clean, plain text.
- Duplication: Using Cosine Similarity on TF-IDF vectors to detect semantic overlap (near-duplicates).
- Thinness: Implementing simple metrics (word count, boilerplate ratio) to assess content depth.
🧹 Phase 1: Markdown Cleaning and Preprocessing
Markdown includes syntax elements (#, *, [text](link)) that must be stripped away to analyze the pure textual content.
“`python
import re
def clean_markdown_to_text(markdown_content: str) -> str:
“””Strips Markdown syntax and performs basic text cleaning.”””
# 1. Remove Headers (e.g., # Heading 1, ## Heading 2)
text = re.sub(r’^#+ ‘, ”, markdown_content, flags=re.MULTILINE)
# 2. Remove Links and Images (e.g., text or )
text = re.sub(r'[.?](.?)|’, ”, text)
# 3. Remove Emphasis/List Markers (e.g., , __, -)
text = re.sub(r'[_`]’, ”, text)
# 4. Normalize whitespace
text = re.sub(r’\s+’, ‘ ‘, text).strip()
return text
Example Usage:
markdown_example = “””
My Amazing Article
This is the main content. It talks about Python programming and how it is fun.
Subtopic
We can detect things using Python tools like NLTK. It is essential stuff.
- Bullet point 1
- Bullet point 2
“””
cleaned_text = clean_markdown_to_text(markdown_example)
print(cleaned_text)
“`
🧬 Phase 2: Detecting Near-Duplicate Content (Semantic Similarity)
Exact duplication is easy to catch, but sophisticated plagiarism requires measuring semantic overlap. We use TF-IDF (Term Frequency-Inverse Document Frequency) combined with Cosine Similarity.
- TF-IDF: Weights words by how important they are to a specific document relative to the entire corpus.
- Cosine Similarity: Measures the cosine of the angle between two document vectors. A value close to 1 means high similarity; 0 means no correlation.
Implementation Steps
Assume you have a list of articles, each stored as clean text.
“`python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
def detect_duplicates(texts: list[str], similarity_threshold: float = 0.8):
“””
Analyzes a list of clean texts to find pairs exceeding the similarity threshold.
“””
if not texts:
return []
# 1. Initialize the TfidfVectorizer
# We use ngram_range=(1, 2) to consider not just single words, but pairs of words (bigrams),
# which helps capture context ("great machine" vs "great" and "machine").
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
tfidf_matrix = vectorizer.fit_transform(texts)
# 2. Calculate the Cosine Similarity Matrix
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
# 3. Find the highly similar pairs
duplicates = []
num_docs = len(texts)
for i in range(num_docs):
for j in range(i + 1, num_docs):
similarity = cosine_sim_matrix[i, j]
if similarity >= similarity_threshold:
duplicates.append({
"Article A Index": i,
"Article B Index": j,
"Similarity Score": f"{similarity:.4f}",
"Overlap": "Potential Duplicate/Near-Duplicate Content"
})
return pd.DataFrame(duplicates)
— Example Usage —
article_a = “Python is a powerful language, perfect for building web applications. It handles big data and machine learning tasks easily.”
article_b = “Web development using Python is very powerful. It handles both big data and machine learning tasks with ease.” # High overlap
article_c = “Gardening is a peaceful hobby. Planting tomatoes requires sunlight and patience. Use good soil.” # Low overlap
corpus = [article_a, article_b, article_c]
duplicate_df = detect_duplicates(corpus, similarity_threshold=0.75)
print(“\n— Duplication Analysis Results —“)
print(duplicate_df)
“`
💡 Tuning Tip: The
similarity_thresholdis critical. A score of 0.8 or higher is usually a strong indicator of significant overlap. Adjust this based on the tolerance level of your content policy.
📉 Phase 3: Detecting Thin Content
Thin content refers to pages that lack sufficient unique depth or value. Since Markdown structure can sometimes trick basic word counters, we implement a weighted scoring system.
Core Metrics:
- Minimum Word Count: A hard limit (e.g., less than 150 words).
- Paragraph Density: Too few paragraphs for the word count suggests fluff or poor structure.
- Uniqueness Score: Ratio of the text to the corpus average—if it scores very low compared to others, it might be overly general.
“`python
import re
from typing import Tuple
def analyze_thinness(clean_text: str, doc_id: str) -> Tuple[bool, str]:
“””
Analyzes a document for thin content based on word count and paragraph structure.
“””
# Simple word count (counting non-space characters separated by space)
word_count = len(clean_text.split())
# Approximate paragraph count (counting double newlines after cleaning)
# We count instances of ' ' that aren't just separated by spaces
paragraph_count = clean_text.count('.') + clean_text.count('!') + 1 # Heuristic
# Define thresholds
MIN_WORDS = 150
MIN_PARAGRAPHS = 5
if word_count < MIN_WORDS:
return True, f"FAIL: Word Count ({word_count} words). Below the minimum threshold of {MIN_WORDS}."
if paragraph_count < MIN_PARAGRAPHS:
return True, f"WARNING: Low Paragraph Density ({paragraph_count} structural units). Content feels shallow."
# If it passes basic checks
return False, "PASS: Meets minimum structural and word count requirements."
— Example Usage —
1. Thin Content Example (Too short)
thin_article_text = “This is a very short piece. It talks about cats. Cats are fluffy. They nap a lot.”
is_thin_1, report_1 = analyze_thinness(thin_article_text, “Article A”)
2. Robust Content Example
robust_article_text = “Content must be deep and comprehensive. It requires multiple perspectives and extensive research. For instance, writing about Python’s use cases in machine learning involves discussing TensorFlow and PyTorch. A thorough discussion ensures the reader gains maximum value, establishing authority and trust. Remember to use strong paragraph breaks and transition phrases to guide the reader logically through complex ideas. This structured approach enhances readability significantly.”
is_thin_2, report_2 = analyze_thinness(robust_article_text, “Article B”)
print(“\n— Thin Content Analysis —“)
print(f”Article A Thinness Check: {report_1}”)
print(f”Article B Thinness Check: {report_2}”)
“`
🛠️ Summary and Integration
By combining these three functions, you can build a robust auditing pipeline:
- Clean all incoming Markdown files.
- Pass the clean texts through
detect_duplicatesto flag areas of structural overlap. - Pass the clean texts through
analyze_thinnessto ensure minimum depth.
A piece of content flagged by both the duplicate detector and the thinness analyzer represents high-risk, low-value material that requires immediate human review and overhaul.