Tracking Core Web Vitals Over Time with Python
Core Web Vitals (CWVs) are not just buzzwords; they are measurable metrics that reflect the real-world user experience of a webpage. Google heavily weights these metrics—Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS)—when determining search ranking and overall site performance.
However, simply tracking a snapshot of CWVs is insufficient. To truly improve performance, you need historical data: how did LCP change month-over-month? Did the CLS improvement from last quarter hold up? Python provides a powerful, programmatic way to scrape, store, process, and visualize this crucial time-series performance data.
This guide details the workflow for using Python to track Core Web Vitals over time, moving from basic data collection to actionable analysis.
⚙️ The Core Workflow
Tracking CWVs over time generally involves four distinct phases:
- Data Collection: Acquiring the raw CWV data for a specific URL at a specific point in time.
- Data Storage: Storing the collected data into a structured, time-series database.
- Data Processing: Cleaning, normalizing, and calculating trends from the stored data.
- Data Visualization: Presenting the historical trends in an understandable format.
🐍 Phase 1: Data Collection (The Scraper)
Since CWVs are deeply integrated into browser performance, reliable collection requires simulating a real user visit.
Choosing Your Tool
While you could try direct API calls (like those from Google Search Console, which are often rate-limited or lack historical granular data), a more universal approach is to use a headless browser framework.
Recommended Library: Selenium or Playwright. Playwright is often preferred for its speed and modern API.
Example: Basic Scraper Setup (Conceptual)
A basic script would initialize the browser, navigate to the target URL, wait for key elements to load, and then capture performance metrics.
“`python
Note: Actual implementation depends heavily on the required metrics source (Lighthouse/API)
from playwright.sync_api import sync_playwright
from datetime import datetime
def collect_cwv_data(url: str) -> dict:
“””Navigates to a URL and simulates performance auditing.”””
# Playwright context setup
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Load the page
page.goto(url, wait_until="networkidle")
# --- The Crucial Step: Generating Performance Data ---
# For accurate CWV, you ideally run an audit (e.g., using Lighthouse via API,
# or waiting for specific performance events). For demonstration, we'll simulate
# the measurement:
# In a real-world scenario, you would inject a performance script or API call here.
# Simulate data collection
metrics = {
'timestamp': datetime.now().isoformat(),
'url': url,
'lcp': page.evaluate("document.querySelector('img').getBoundingClientRect().height"), # Simplified placeholder
'cls': 0.01, # Placeholder
'fid': 120, # Placeholder
}
browser.close()
return metrics
Example usage
data = collect_cwv_data(“https://example.com”)
print(data)
“`
💡 Pro-Tip: Using Lighthouse: For the most authoritative CWV scores, consider using the Puppeteer/Playwright ability to run Google Lighthouse audits. This tool is explicitly designed to calculate these metrics accurately.
💾 Phase 2: Data Storage (The Time-Series Database)
As you run your scraper daily or weekly, you will generate thousands of records. Storing this data requires a database optimized for time-series data.
Recommended Tools
- PostgreSQL with TimescaleDB Extension: Excellent balance of power and reliability. It handles time-series queries incredibly well.
- InfluxDB: A purpose-built time-series database. Highly optimized for metrics and performance tracking.
- SQLite (for local testing): Suitable if your tracking is limited to a single machine or small scale.
Implementation using psycopg2 (for PostgreSQL)
Assuming you are using PostgreSQL, the structure of your table should be simple and optimized for time and URL.
“`python
import psycopg2
from datetime import datetime
DB_CONFIG = {
“host”: “localhost”,
“database”: “cwv_metrics”,
“user”: “user”,
“password”: “password”
}
def store_metrics(metrics: dict):
“””Inserts collected CWV metrics into the PostgreSQL database.”””
conn = None
try:
conn = psycopg2.connect(**DB_CONFIG)
cursor = conn.cursor()
# SQL Injection protection via parameterized queries
sql = """
INSERT INTO core_web_vitals (
timestamp, url, lcp, cls, fid
) VALUES (%s, %s, %s, %s, %s);
"""
cursor.execute(
sql,
(
metrics['timestamp'],
metrics['url'],
metrics['lcp'],
metrics['cls'],
metrics['fid']
)
)
conn.commit()
print("Data successfully stored.")
except Exception as e:
print(f"Error storing data: {e}")
finally:
if conn:
conn.close()
“`
Database Schema Snippet:
| Column | Data Type | Description |
| :— | :— | :— |
| id | SERIAL | Primary Key |
| timestamp | TIMESTAMP | When the data was collected |
| url | VARCHAR | The tested URL |
| lcp | REAL | Largest Contentful Paint (ms) |
| cls | REAL | Cumulative Layout Shift Score |
| fid | REAL | First Input Delay (ms) |
📊 Phase 3 & 4: Analysis and Visualization
Once the data is consistently stored, the real power of Python comes into play for analysis. We use pandas for data manipulation and matplotlib or plotly for visualization.
Retrieving and Processing Data
This function pulls all historical data for a single URL.
“`python
import pandas as pd
Assuming you have a function to connect and query the DB
def get_historical_data(target_url: str) -> pd.DataFrame:
“””Fetches all CWV data for a given URL and returns a DataFrame.”””
# — Placeholder: In reality, this function queries your database —
data = {
‘timestamp’: pd.to_datetime([‘2023-10-01’, ‘2023-11-01’, ‘2023-12-01’, ‘2024-01-01’]),
‘lcp’: [2800, 2500, 2300, 2100], # Improving LCP (lower is better)
‘cls’: [0.15, 0.12, 0.09, 0.07], # Improving CLS (lower is better)
‘fid’: [180, 150, 120, 100]
}
df = pd.DataFrame(data)
df[‘url’] = target_url
df = df.sort_values(by=’timestamp’).reset_index(drop=True)
return df
Get the data
history_df = get_historical_data(“https://your-tracked-site.com”)
print(“Historical Data Snapshot:”)
print(history_df)
“`
Visualizing Trends with Plotly
For professional, interactive dashboards, Plotly is superior to standard matplotlib.
“`python
import plotly.express as px
def visualize_cwv_trends(df: pd.DataFrame):
“””Generates interactive trend lines for LCP, CLS, and FID.”””
fig = make_subplots(rows=3, cols=1,
shared_xaxes=True,
vertical_spacing=0.05,
subplot_titles=("LCP Trend (ms)", "CLS Trend (Score)", "FID Trend (ms)"))
# LCP (Goal: Decrease)
fig.add_trace(go.Scatter(x=df['timestamp'], y=df['lcp'], mode='lines+markers', name='LCP'), row=1, col=1)
# CLS (Goal: Decrease)
fig.add_trace(go.Scatter(x=df['timestamp'], y=df['cls'], mode='lines+markers', name='CLS'), row=2, col=1)
# FID (Goal: Decrease)
fig.add_trace(go.Scatter(x=df['timestamp'], y=df['fid'], mode='lines+markers', name='FID'), row=3, col=1)
fig.update_layout(
title_text="Core Web Vitals Performance Over Time",
height=800,
xaxis_title="Date",
yaxis_title="Score / Time"
)
fig.show() # In a Jupyter/Colab environment, this displays the graph
“`
💡 Summary and Best Practices
By combining scraping, time-series storage, and advanced visualization, you transform raw performance numbers into actionable insights.
- Automation is Key: Schedule the entire process (Collection $\rightarrow$ Storage) using a job scheduler like Cron (Linux) or Airflow.
- Define Baselines: Always track against a clear baseline (e.g., the performance metrics collected the month before the major site redesign).
- Correlate with Changes: When a metric changes dramatically, immediately cross-reference the
timestampwith recent code deployments or content changes to determine root causes. - Focus on Trends: Don’t panic over a single bad score. Focus on the long-term trend line. Consistent improvement, even slight, indicates successful performance engineering efforts.