5 min read

Establishing Baseline Health Metrics for New Domains

Establishing accurate baseline health metrics requires deterministic pipelines and strict scope validation. This workflow standardizes metric ingestion, normalization, and alerting for new domains. Align initial configurations with Technical Audit Fundamentals & Scope Mapping to prevent scope creep during early-stage data collection.

Phase 1: Domain Discovery & Crawl Configuration

Implementation Steps

  • Deploy headless crawler with configurable concurrency limits (default: 4 req/s).
  • Parse /robots.txt and validate against RFC 9309 directives.
  • Extract canonical URL sets and map redirect chains.
  • Initialize crawl queue with priority weighting for high-value templates.

Code Example

import scrapy
from scrapy.crawler import CrawlerProcess

class BaselineSpider(scrapy.Spider):
    name = "domain_baseline"
    custom_settings = {
        "CONCURRENT_REQUESTS": 4,
        "DOWNLOAD_DELAY": 0.25,
        "ROBOTSTXT_OBEY": True,
        "AUTOTHROTTLE_ENABLED": True,
    }

    def start_requests(self):
        yield scrapy.Request(
            url="https://target-domain.com/sitemap.xml",
            callback=self.parse_sitemap,
            priority=10
        )

    def parse_sitemap(self, response):
        for loc in response.xpath("//loc/text()").getall():
            yield scrapy.Request(url=loc, priority=5)

Common Mistakes

  • Ignoring crawl-delay directives causing IP blocks.
  • Failing to exclude parameterized URLs during baseline capture.
  • Hardcoding user-agent strings without fallback rotation.

Execution Notes

Before initiating metric collection, validate scope boundaries against Technical Audit Fundamentals & Scope Mapping. Configure the initial crawler to respect enterprise-scale constraints. Reference Defining Crawl Depth & Scope for Enterprise Sites for depth-limiting logic. This prevents unbounded recursion during early-stage discovery.

Phase 2: Raw Metric Ingestion & Schema Mapping

Implementation Steps

  • Ingest raw crawl logs into a time-series database (e.g., TimescaleDB).
  • Map HTTP status codes to categorical health buckets (2xx=Healthy, 4xx=Degraded, 5xx=Critical).
  • Extract LCP, CLS, and INP via Chrome UX Report API or synthetic Lighthouse runs.
  • Apply JSON schema validation to ensure consistent field naming across environments.

Code Example

-- dbt model: models/stg_health_metrics.sql
{{ config(materialized='table') }}

CREATE TABLE IF NOT EXISTS public.health_metrics (
 time TIMESTAMPTZ NOT NULL,
 url TEXT NOT NULL,
 status_code INT,
 lcp_ms FLOAT,
 cls_score FLOAT,
 inp_ms FLOAT,
 env_tag TEXT
);

SELECT create_hypertable('public.health_metrics', 'time');

INSERT INTO public.health_metrics (time, url, status_code, lcp_ms, cls_score, inp_ms, env_tag)
SELECT
 (payload->>'timestamp')::timestamptz AT TIME ZONE 'UTC',
 payload->>'url',
 (payload->>'status_code')::INT,
 (payload->>'lcp')::FLOAT / 1000.0,
 (payload->>'cls')::FLOAT,
 (payload->>'inp')::FLOAT / 1000.0,
 COALESCE(payload->>'environment', 'unknown')
FROM raw_crawl_logs
WHERE payload->>'status_code' IS NOT NULL;

Common Mistakes

  • Mixing synthetic and field data without environment tagging.
  • Storing raw HTML payloads instead of parsed DOM metrics.
  • Omitting timezone normalization for time-series aggregation.

Execution Notes

Normalize ingestion outputs using strict schema validation. Cross-reference severity thresholds with Risk Scoring Frameworks for Technical Debt. Ensure metric weighting aligns with operational impact before committing to the time-series database.

Phase 3: Statistical Normalization & Outlier Filtering

Implementation Steps

  • Apply Z-score filtering (threshold: ±2.5) to remove statistical outliers.
  • Segment metrics by device class, geographic region, and content type.
  • Implement sliding window averages (7-day, 30-day) to smooth volatility.
  • Flag and quarantine anomalous payloads for manual review before baseline commit.

Code Example

import pandas as pd
import numpy as np

def normalize_and_filter(df: pd.DataFrame, metric_col: str, window: int = 7, z_thresh: float = 2.5) -> pd.DataFrame:
 df = df.copy()
 df[f"{metric_col}_rolling_mean"] = df[metric_col].rolling(window=window, min_periods=1).mean()
 rolling_std = df[metric_col].rolling(window=window, min_periods=1).std(ddof=0)
 df["z_score"] = (df[metric_col] - df[f"{metric_col}_rolling_mean"]) / rolling_std.replace(0, np.nan)
 mask = df["z_score"].abs() <= z_thresh
 return df[mask].drop(columns=["z_score"])

Common Mistakes

  • Applying global normalization across heterogeneous page templates.
  • Ignoring seasonal traffic spikes during baseline windows.
  • Failing to cache normalization results, causing pipeline recomputation overhead.

Execution Notes

Execute normalization routines within isolated compute environments. Guarantee reproducibility by pinning dependency versions. Schedule recurring validation runs using Setting Up a Quarterly Technical Audit Schedule. Maintain metric integrity across deployment cycles.

Phase 4: Baseline Calculation & Threshold Definition

Implementation Steps

  • Calculate P50, P90, and P95 distributions for each normalized metric.
  • Derive composite health index using weighted geometric mean.
  • Set dynamic alert thresholds at ±15% deviation from baseline P90.
  • Version-control baseline snapshots using Git tags for audit trail compliance.

Code Example

# prometheus_alerts.yml
groups:
  - name: domain_health_baselines
    rules:
      - alert: HighLCPDeviation
        expr: |
          (avg_over_time(lcp_seconds[24h]) > baseline_p90_lcp * 1.15)
          and
          (avg_over_time(lcp_seconds[24h]) < baseline_p90_lcp * 0.85) == false
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "LCP deviates >15% from baseline P90"
          description: "Current LCP: {{ $value }}s. Baseline P90: {{ $labels.baseline_p90_lcp }}s."

Common Mistakes

  • Using arithmetic means instead of percentiles for skewed distributions.
  • Static thresholding that fails to adapt to content velocity.
  • Omitting baseline versioning, causing regression tracking failures.

Execution Notes

Lock baseline versions in the repository. Enforce change management protocols for all threshold adjustments. Document operational runbooks to maintain cross-functional alignment. Version control prevents regression tracking failures during incident response.

Phase 5: Automated Pipeline Orchestration & Alert Routing

Implementation Steps

  • Containerize metric pipeline using Docker with pinned base images.
  • Schedule execution via GitHub Actions or Airflow DAGs with retry logic.
  • Route alerts to PagerDuty/Slack based on severity classification.
  • Implement idempotent baseline updates to prevent duplicate metric ingestion.

Code Example

name: Baseline Recalculation Pipeline
on:
  schedule:
    - cron: '0 2 * * 1'
  workflow_dispatch:

jobs:
  compute-baseline:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    env:
      DB_HOST: ${{ secrets.DB_HOST }}
      DB_PASS: ${{ secrets.DB_PASS }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - run: python pipeline/run_baseline.py --output-format parquet
      - uses: actions/upload-artifact@v4
        with:
          name: baseline-snapshot
          path: output/baseline_*.parquet

Common Mistakes

  • Missing timeout configurations causing zombie pipeline runs.
  • Hardcoded credentials instead of OIDC or secret manager integration.
  • Lack of pipeline observability (missing logs/metrics for the audit job itself).

Execution Notes

Deploy the orchestration layer with strict resource quotas. Implement automated rollback triggers for failed executions. Validate pipeline health alongside domain metrics. Ensure continuous monitoring reliability through idempotent artifact generation.

Automation & Reproducibility Controls

  • Execution Model: CI/CD triggered or cron-based scheduling.
  • Reproducibility Controls:
  • Pinned dependency manifests (requirements.txt, package-lock.json).
  • Deterministic random seeds for sampling routines.
  • Environment variable injection for secrets and endpoints.
  • Git-tagged baseline snapshots for immutable audit trails.
  • Output Formats: Parquet, JSON, Prometheus metrics endpoint.
  • Pipeline Validation: Pre-commit hooks for schema validation, integration tests for crawler middleware, and synthetic load testing for alert routing.