16 min read

Establishing Baseline Health Metrics for New Domains

Q: Should I separate synthetic Lighthouse data from Chrome UX Report field data in the baseline?

Yes. Tag each row with an env_tag column ('synthetic' vs 'field'). Aggregate them separately — synthetic data is deterministic and useful for regression detection; field data reflects real-user variance and informs percentile thresholds. Mixing them without the tag inflates standard deviation and degrades Z-score reliability.

Q: What happens if the crawler hits a Crawl-Delay directive?

Scrapy's ROBOTSTXT_OBEY setting enforces Crawl-Delay automatically when set to True. Additionally, set DOWNLOAD_DELAY to at least the Crawl-Delay value as a floor. If the directive exceeds your pipeline timeout budget, configure a partial crawl scope using DEPTH_LIMIT and re-run in segments.

Q: How do I handle a site migration during an active baseline window?

Close the current baseline window, Git-tag the snapshot as 'pre-migration', and start a fresh baseline collection post-migration. Never splice pre- and post-migration data — redirect chains and URL structure changes make the two datasets non-comparable.

Without a statistical baseline, every metric fluctuation looks like an incident and every incident looks like noise. SREs and SEO engineers onboarding a new domain into an audit pipeline have no anomaly thresholds, no trend context, and no reproducible reference point — alert fatigue sets in within days. This workflow is part of Technical Audit Fundamentals & Scope Mapping, the broader framework that governs how domains are scoped, scored, and monitored. Follow this guide end-to-end before connecting a new domain to any alerting or scoring system.

Prerequisites & Environment Setup

Before running any collection, lock your toolchain versions to guarantee reproducibility across runs.

Required tools (pinned versions):

Tool	Minimum version	Purpose
Python	3.11.x	Crawler and normalization scripts
Scrapy	2.11.x	HTTP crawl engine
pandas	2.2.x	Z-score filtering and rolling windows
TimescaleDB	2.14.x	Time-series metric storage
psycopg2-binary	2.9.x	PostgreSQL adapter
PyArrow	15.x	Parquet artifact serialisation

Pin these in requirements.txt and commit the lockfile:

pip freeze > requirements.txt
git add requirements.txt && git commit -m "chore: pin baseline pipeline dependencies"

Required environment variables — export before any pipeline invocation:

export BASELINE_DB_HOST="timescale.internal"
export BASELINE_DB_PORT="5432"
export BASELINE_DB_NAME="site_health"
export BASELINE_DB_USER="pipeline_rw"
export BASELINE_DB_PASS="<from secret manager>"
export BASELINE_S3_BUCKET="audit-baselines-prod"
export BASELINE_ENV_TAG="production"   # or 'staging'
export BASELINE_TZ="UTC"

Store secrets in your CI secret manager (GitHub Actions secrets.*, AWS Secrets Manager, or Vault). Never commit credentials — use OIDC where possible.

Step 1 — Domain Discovery and Crawl Initialization

The crawler must build an accurate URL inventory before any metric can be recorded. Scope boundaries defined during crawl depth and scope configuration for enterprise sites feed directly into the spider's allowed domain list and depth limit.

# baseline_spider.py — pinned Scrapy 2.11, Python 3.11
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


class BaselineSpider(scrapy.Spider):
    name = "domain_baseline"
    custom_settings = {
        "CONCURRENT_REQUESTS": 4,        # conservative for a new domain
        "DOWNLOAD_DELAY": 0.25,          # 250 ms floor between requests
        "ROBOTSTXT_OBEY": True,          # honour Crawl-Delay directives
        "AUTOTHROTTLE_ENABLED": True,
        "AUTOTHROTTLE_TARGET_CONCURRENCY": 2.0,
        "DEPTH_LIMIT": 6,                # prevent unbounded recursion
        "HTTPCACHE_ENABLED": True,       # idempotent re-runs
        "HTTPCACHE_DIR": "/var/cache/scrapy/baseline",
        "USER_AGENT": "SiteHealthAuditBot/1.0 (+https://site-health-audit.com/bot)",
        "FEEDS": {
            "/var/data/baseline_crawl_%(time)s.jsonl": {"format": "jsonlines"}
        },
    }

    def __init__(self, start_url: str, allowed_domain: str, **kwargs):
        super().__init__(**kwargs)
        self.start_urls = [start_url]
        self.allowed_domains = [allowed_domain]

    def start_requests(self):
        yield scrapy.Request(
            url=self.start_urls[0].rstrip("/") + "/sitemap.xml",
            callback=self.parse_sitemap,
            priority=10,
            errback=self.fallback_to_homepage,
        )

    def parse_sitemap(self, response):
        for loc in response.xpath("//loc/text()").getall():
            yield scrapy.Request(url=loc, callback=self.parse_page, priority=5)

    def fallback_to_homepage(self, failure):
        self.logger.warning("Sitemap not found, falling back to homepage crawl")
        yield scrapy.Request(url=self.start_urls[0], callback=self.parse_page)

    def parse_page(self, response):
        yield {
            "url": response.url,
            "status": response.status,
            "content_type": response.headers.get("Content-Type", b"").decode(),
            "canonical": response.xpath('//link[@rel="canonical"]/@href').get(),
            "crawled_at": response.headers.get("Date", b"").decode(),
        }

Common initialization mistakes:

Ignoring Crawl-Delay directives in robots.txt — this causes IP blocks within hours on high-protection hosts. Scrapy's ROBOTSTXT_OBEY handles this automatically, but only when set to True.
Failing to exclude parameterized URLs (e.g. ?sort=, ?page=) — these generate thousands of near-duplicate rows and inflate the baseline storage cost. Add DUPEFILTER_CLASS = "scrapy.dupefilters.RFPDupeFilter" and a custom URL normalizer.
Hardcoding the user-agent string — always include a contact URL so server admins can reach you; this reduces block frequency significantly.

Step 2 — Core Configuration: Metric Ingestion Schema

Once the crawl produces a JSONL artifact, the ingestion layer maps raw crawl data into structured time-series records. Establish the TimescaleDB schema once, then run the ingestion idempotently on every subsequent crawl cycle.

Key ingestion parameters

Parameter	Type	Default	Purpose
`env_tag`	`TEXT`	`'unknown'`	Separates synthetic from field data rows
`lcp_ms`	`FLOAT`	`NULL`	Largest Contentful Paint in milliseconds
`cls_score`	`FLOAT`	`NULL`	Cumulative Layout Shift (unitless, 0–1)
`inp_ms`	`FLOAT`	`NULL`	Interaction to Next Paint in milliseconds
`status_code`	`INT`	`NULL`	HTTP response code for health bucket mapping
`time`	`TIMESTAMPTZ`	required	UTC timestamp; hypertable partition key

Schema creation (run once)

-- TimescaleDB 2.14 — run outside any migration framework
CREATE TABLE IF NOT EXISTS public.health_metrics (
    time        TIMESTAMPTZ NOT NULL,
    url         TEXT        NOT NULL,
    status_code INT,
    lcp_ms      FLOAT,
    cls_score   FLOAT,
    inp_ms      FLOAT,
    env_tag     TEXT        DEFAULT 'unknown'
);

SELECT create_hypertable('public.health_metrics', 'time', if_not_exists => TRUE);

CREATE INDEX IF NOT EXISTS idx_health_metrics_url_time
    ON public.health_metrics (url, time DESC);

Ingestion query (idempotent insert)

-- Populate from raw_crawl_logs staging table
INSERT INTO public.health_metrics (time, url, status_code, lcp_ms, cls_score, inp_ms, env_tag)
SELECT
    (payload->>'timestamp')::timestamptz AT TIME ZONE 'UTC',
    payload->>'url',
    (payload->>'status_code')::INT,
    (payload->>'lcp')::FLOAT / 1000.0,       -- microseconds → milliseconds
    (payload->>'cls')::FLOAT,
    (payload->>'inp')::FLOAT / 1000.0,
    COALESCE(payload->>'environment', current_setting('app.env_tag', TRUE))
FROM raw_crawl_logs
WHERE payload->>'status_code' IS NOT NULL
  AND (payload->>'timestamp')::timestamptz > NOW() - INTERVAL '48 hours'
ON CONFLICT DO NOTHING;                       -- idempotency guard

Tag every row with env_tag = 'synthetic' for Lighthouse runs and env_tag = 'field' for Chrome UX Report data. Mixing the two without a tag will inflate standard deviation and corrupt the Z-score filtering in Step 3.

Cross-reference severity thresholds with the risk scoring frameworks for technical debt guide before deciding which HTTP status bucket maps to which severity level in your alerting system.

Step 3 — Execution & Scheduling

The pipeline runs weekly by default — long enough to capture weekend/weekday variance, short enough to detect regressions before they propagate to a release. A flock guard prevents overlapping executions if the previous run overruns.

# .github/workflows/baseline-recalculation.yml
name: Baseline Recalculation Pipeline

on:
  schedule:
    - cron: '0 02 * * 1'    # 02:00 UTC every Monday
  workflow_dispatch:          # allow manual trigger for initial setup

jobs:
  compute-baseline:
    runs-on: ubuntu-22.04
    timeout-minutes: 45
    env:
      BASELINE_DB_HOST: ${{ secrets.BASELINE_DB_HOST }}
      BASELINE_DB_PASS: ${{ secrets.BASELINE_DB_PASS }}
      BASELINE_S3_BUCKET: ${{ secrets.BASELINE_S3_BUCKET }}
      BASELINE_ENV_TAG: "production"
      BASELINE_TZ: "UTC"

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Acquire concurrency lock
        run: |
          exec 200>/var/lock/baseline.lock
          flock -n 200 || { echo "Another run is active — exiting."; exit 1; }

      - name: Run baseline pipeline
        run: |
          set -euo pipefail
          python /home/runner/work/pipeline/run_baseline.py \
            --output-dir /tmp/baseline_output \
            --output-format parquet \
            --env-tag "${BASELINE_ENV_TAG}"

      - name: Upload artifacts to S3
        run: |
          set -euo pipefail
          aws s3 cp /tmp/baseline_output/ \
            "s3://${BASELINE_S3_BUCKET}/$(date -u +%Y-%W)/" \
            --recursive \
            --sse aws:kms

      - uses: actions/upload-artifact@v4
        with:
          name: baseline-snapshot-${{ github.run_id }}
          path: /tmp/baseline_output/baseline_*.parquet
          retention-days: 90

For local or VM-based cron alternatives, see setting up a quarterly technical audit schedule which covers flock-based cron guards with timezone-safe execution.

Timezone handling: always collect and store in UTC. Convert to local time only at the reporting layer. Storing in a non-UTC zone causes duplicate or missing rows when DST transitions occur mid-window.

Step 4 — Statistical Normalization and Artifact Storage

Raw metrics from a new domain always contain outliers — bot traffic spikes, synthetic test runs, deployment anomalies. Removing these before calculating thresholds is mandatory; otherwise P90 values will be artificially inflated and alert thresholds will be too permissive.

Z-score outlier filtering

# normalize.py — pandas 2.2, numpy 1.26
import pandas as pd
import numpy as np
from pathlib import Path


def normalize_and_filter(
    df: pd.DataFrame,
    metric_col: str,
    window: int = 7,
    z_thresh: float = 2.5,
) -> pd.DataFrame:
    """
    Apply a rolling Z-score filter to a single metric column.

    Args:
        df: Input DataFrame with a DatetimeIndex sorted ascending.
        metric_col: Column name to filter.
        window: Rolling window size in days.
        z_thresh: Rows with |z_score| > z_thresh are quarantined.

    Returns:
        DataFrame with outlier rows removed and a 'quarantined' sidecar written
        to /var/data/quarantine/<metric_col>_<today>.parquet for manual review.
    """
    df = df.copy()
    rolling_mean = df[metric_col].rolling(window=window, min_periods=1).mean()
    rolling_std  = df[metric_col].rolling(window=window, min_periods=1).std(ddof=0)
    z_scores = (df[metric_col] - rolling_mean) / rolling_std.replace(0, np.nan)

    outlier_mask = z_scores.abs() > z_thresh
    quarantine   = df[outlier_mask]

    if not quarantine.empty:
        out_dir = Path("/var/data/quarantine")
        out_dir.mkdir(parents=True, exist_ok=True)
        quarantine.to_parquet(
            out_dir / f"{metric_col}_{pd.Timestamp.utcnow().date()}.parquet",
            index=True,
        )

    df[f"{metric_col}_rolling_mean"] = rolling_mean
    return df[~outlier_mask].drop(columns=[f"{metric_col}_rolling_mean"])

Apply normalize_and_filter separately per device class (mobile, desktop) and per page template type. Applying global normalization across heterogeneous templates masks segment-level regressions — a 4 000 ms LCP on a product detail page is a regression; on a video landing page it may be baseline.

Percentile threshold calculation

# thresholds.py
import pandas as pd


def calculate_thresholds(df: pd.DataFrame, metric_col: str) -> dict:
    """Return P50, P90, P95 and dynamic alert bounds at ±15% of P90."""
    p50 = df[metric_col].quantile(0.50)
    p90 = df[metric_col].quantile(0.90)
    p95 = df[metric_col].quantile(0.95)
    return {
        "p50": round(p50, 3),
        "p90": round(p90, 3),
        "p95": round(p95, 3),
        "alert_warn":     round(p90 * 1.15, 3),   # +15% → warning
        "alert_critical": round(p90 * 1.30, 3),   # +30% → critical
    }

Prometheus alert rule (version-controlled)

# prometheus_alerts.yml — committed to the repo, tagged on each baseline update
groups:
  - name: domain_health_baselines
    rules:
      - alert: HighLCPDeviation
        expr: |
          avg_over_time(lcp_ms[24h]) > baseline_p90_lcp_ms * 1.15
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "LCP deviates >15% above baseline P90"
          description: "Current LCP: {{ $value }}ms. Threshold: {{ $labels.baseline_p90_lcp_ms }}ms."

      - alert: CriticalLCPDeviation
        expr: |
          avg_over_time(lcp_ms[24h]) > baseline_p90_lcp_ms * 1.30
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "LCP deviates >30% above baseline P90"

Versioned Parquet artifact storage

After thresholds are calculated, storing and versioning crawl artifacts in cloud storage covers the full S3/GCS lifecycle policy. At minimum, apply these conventions to baseline snapshots:

#!/usr/bin/env bash
set -euo pipefail

SNAPSHOT_FILE="/var/data/baseline_output/baseline_$(date -u +%Y-%m-%d).parquet"
CHECKSUM=$(sha256sum "${SNAPSHOT_FILE}" | awk '{print $1}')
GIT_TAG="baseline/$(date -u +%Y-%m-%d)"

# Tag the snapshot in git for immutable audit trail
git tag -a "${GIT_TAG}" -m "Baseline snapshot ${GIT_TAG} — SHA256: ${CHECKSUM}"
git push origin "${GIT_TAG}"

# Upload to S3 with checksum metadata
aws s3 cp "${SNAPSHOT_FILE}" \
  "s3://${BASELINE_S3_BUCKET}/snapshots/${GIT_TAG}.parquet" \
  --metadata "sha256=${CHECKSUM},env=${BASELINE_ENV_TAG}" \
  --sse aws:kms

Verification Checklist

Run these steps after every initial baseline collection and after any re-baselining event (site migration, major structural change):

Row count sanity check — query SELECT COUNT(*), MIN(time), MAX(time) FROM public.health_metrics and verify the row count is within 10% of the expected URL count × collection days.
NULL rate audit — SELECT COUNT(*) FILTER (WHERE lcp_ms IS NULL) / COUNT(*)::float FROM public.health_metrics should be below 0.05 (5%). Higher NULL rates indicate a Lighthouse or CrUX API authentication failure.
env_tag distribution — SELECT env_tag, COUNT(*) FROM public.health_metrics GROUP BY env_tag — confirm 'field' and 'synthetic' rows are present in the expected ratio. A missing env_tag value means the environment variable injection is broken.
Quarantine review — check /var/data/quarantine/ for files from today's run. More than 2% of rows quarantined suggests the domain is in an unstable state; do not commit the baseline until the source of variance is identified.
Threshold sanity — verify p90 < p95 and p50 < p90 for every metric. Inverted percentiles indicate the normalization step produced a non-monotonic distribution and must be re-run with a wider rolling window.
Artifact checksum — compare the SHA-256 in the Git tag annotation against sha256sum <parquet file> locally. A mismatch means the file was modified post-upload and the artifact is tainted.
Prometheus alert dry-run — use promtool check rules prometheus_alerts.yml to validate syntax, then manually evaluate the LCP alert expression against a known-bad test metric to confirm the threshold fires correctly.

Troubleshooting

Spider exits immediately with zero items

Root cause: robots.txt is disallowing the bot's user-agent, or the sitemap URL is returning a non-200 status.

# Diagnose robots.txt compliance
curl -A "SiteHealthAuditBot/1.0" https://target-domain.com/robots.txt | grep -i "disallow"

# Check sitemap status
curl -o /dev/null -w "%{http_code}" https://target-domain.com/sitemap.xml

Fix: add the bot to an allow rule in robots.txt, or switch to the fallback_to_homepage path by passing --no-sitemap to the spider.

TimescaleDB hypertable creation fails with "already a hypertable"

Root cause: The schema setup script was run twice. The if_not_exists => TRUE parameter prevents the error, but older scripts may omit it.

-- Verify hypertable status
SELECT hypertable_name, num_chunks
FROM timescaledb_information.hypertables
WHERE hypertable_name = 'health_metrics';

Fix: if the table exists and the hypertable row is present, skip the create_hypertable call entirely. The if_not_exists flag must be TRUE in the production script.

Z-score filter quarantines more than 10% of rows

Root cause: The collection window contains a traffic spike event (sale, viral content, outage) that inflates the rolling standard deviation across all segments.

# Inspect quarantine file to identify spike source
import pandas as pd
q = pd.read_parquet("/var/data/quarantine/lcp_ms_2026-06-21.parquet")
print(q.groupby(q.index.date)["lcp_ms"].describe())

Fix: extend the rolling window from 7 to 14 days to dilute the spike, or exclude the spike dates using a date exclusion list before normalization. Never widen z_thresh above 3.0 — this defeats outlier detection.

Prometheus alert never fires during testing

Root cause: The baseline_p90_lcp_ms recording rule is missing or stale.

# Check recording rule value
curl -s "http://prometheus:9090/api/v1/query?query=baseline_p90_lcp_ms" | jq '.data.result'

Fix: confirm the recording rule is present in prometheus_rules.yml and that the last evaluation timestamp is recent. If the value is 0 or absent, re-run the baseline export script to repopulate the recording rule.

Parquet artifact checksum mismatch

Root cause: The file was modified post-upload, or a partial upload occurred due to a network interruption.

# Recompute local checksum
sha256sum /var/data/baseline_output/baseline_2026-06-21.parquet

# Compare against S3 object metadata
aws s3api head-object \
  --bucket "${BASELINE_S3_BUCKET}" \
  --key "snapshots/baseline/2026-06-21.parquet" \
  --query 'Metadata.sha256'

Fix: re-run the artifact export and upload. Add --expected-size validation to the upload script, and enable S3 versioning on the bucket so partial writes can be identified and rolled back.

Pipeline cron overlaps with a previous stalled run

Root cause: The previous execution hung (network timeout, DB lock wait) and the lock file was not cleaned up.

# Check lock file owner
lsof /var/lock/baseline.lock

# If stale, remove and re-run
rm -f /var/lock/baseline.lock

Fix: set flock --timeout 120 to prevent indefinite lock waits, and add a post-failure step in the GitHub Actions workflow that removes the lock file unconditionally.

FAQ

How many days of data do I need before a baseline is statistically reliable?

A minimum of 14 days covers two full weekly traffic cycles. 30 days is preferred for sites with strong weekend/weekday variance. Fewer than 7 days will produce thresholds that trigger excessive false-positive alerts in the first weeks of monitoring.

Should I separate synthetic Lighthouse data from Chrome UX Report field data in the baseline?

Yes. Tag each row with env_tag = 'synthetic' or env_tag = 'field'. Aggregate them separately — synthetic data is deterministic and useful for regression detection; field data reflects real-user variance and informs percentile thresholds. Mixing them without the tag inflates standard deviation and degrades Z-score reliability.

What happens if the crawler hits a Crawl-Delay directive?

Scrapy's ROBOTSTXT_OBEY setting enforces Crawl-Delay automatically. Additionally, set DOWNLOAD_DELAY to at least the Crawl-Delay value as a floor. If the directive exceeds your pipeline timeout budget, configure a partial crawl scope using DEPTH_LIMIT and re-run in segments. Managing crawl budget and rate limiting at the domain level is covered in detail in its own guide.

How do I handle a site migration during an active baseline window?

Close the current baseline window, Git-tag the snapshot as pre-migration, and start a fresh collection post-migration. Never splice pre- and post-migration data — redirect chain changes and URL structure differences make the two datasets non-comparable. The new baseline window must run for at least 14 days before alert thresholds are activated.

Technical Audit Fundamentals & Scope Mapping — parent guide covering the full audit lifecycle
Defining Crawl Depth & Scope for Enterprise Sites — scope constraints that feed the spider's allowed domains and depth limits
Risk Scoring Frameworks for Technical Debt — maps HTTP status buckets and metric deviations to prioritised remediation queues
Storing & Versioning Crawl Artifacts in Cloud Storage — S3/GCS lifecycle policies and immutable artifact retention for baseline snapshots
Capturing a First-Crawl Baseline Snapshot — freeze a checksummed, reproducible baseline every later audit diffs against
Aligning Audit Goals with Business KPIs — translates baseline metric thresholds into business-impact SLAs

Establishing Baseline Health Metrics for New Domains #

Prerequisites & Environment Setup #

Step 1 — Domain Discovery and Crawl Initialization #

Step 2 — Core Configuration: Metric Ingestion Schema #

Key ingestion parameters #

Schema creation (run once) #

Ingestion query (idempotent insert) #

Step 3 — Execution & Scheduling #

Step 4 — Statistical Normalization and Artifact Storage #

Z-score outlier filtering #

Percentile threshold calculation #

Prometheus alert rule (version-controlled) #

Versioned Parquet artifact storage #

Verification Checklist #

Troubleshooting #

Spider exits immediately with zero items #

TimescaleDB hypertable creation fails with "already a hypertable" #

Z-score filter quarantines more than 10% of rows #

Prometheus alert never fires during testing #

Parquet artifact checksum mismatch #

Pipeline cron overlaps with a previous stalled run #

FAQ #

Related #