14 min read

Storing & Versioning Crawl Artifacts in Cloud Storage

Q: Should I use bucket versioning or object-key versioning for crawl artifacts?

Use both: enable native bucket versioning as a safety net, and embed a deterministic version string (e.g. crawl_id + SHA-256 digest) in the object key. Key-level versioning makes retrieval O(1) without listing; bucket versioning protects against accidental overwrites.

Q: What happens if two crawler instances write to the same prefix simultaneously?

Without a distributed lock, the last writer wins and you silently lose one crawl run. Prevent this by including a UUIDv4 in every object key and treating concurrency at the orchestration layer — for example, using a Redis SET NX lock or GitHub Actions concurrency groups — before writes reach the bucket.

Q: Can I use the same bucket for staging and production crawl artifacts?

Technically yes, but use separate key prefixes (/staging/ vs /production/) and separate IAM roles. A cleaner boundary is two buckets: it prevents accidental lifecycle rules from deleting production baselines and keeps cost attribution clean.

Without a disciplined storage layer, every crawl run becomes a disposable event. When a regression surfaces three releases later — a spike in LCP, a drop in internal link counts, a newly broken redirect — the only way to diagnose root cause is to diff the current crawl against an earlier one. If those earlier outputs were overwritten or never persisted, the investigation stalls. SREs, SEO engineers, and agency teams running automated crawling pipelines all face the same failure mode: ad-hoc local saves, timestamp collisions, and no canonical record of what the site looked like before the change.

This page covers every step from bucket provisioning through drift-detection queries, using AWS S3 and Google Cloud Storage as the reference implementations.

Prerequisites & Environment Setup

Requirement	Minimum version	Notes
`aws-cli`	2.15.x	`aws --version` must report v2; v1 lacks `s3api` wait commands
`boto3`	1.34.x	Pin in `requirements.txt`; use a virtualenv
`google-cloud-storage`	2.16.x	Pin alongside `google-auth>=2.29`
`python`	3.11.x	Match the version used in your CI runner image
`jq`	1.7.x	Required for manifest inspection in shell scripts

Required environment variables — export these before running any step below:

export AWS_REGION="us-east-1"
export CRAWL_BUCKET="crawl-artifacts-prod"
export CRAWL_ENV="production"          # or "staging"
export GOOGLE_APPLICATION_CREDENTIALS="/run/secrets/gcs-sa.json"
export PIPELINE_TOKEN="${PIPELINE_TOKEN:?PIPELINE_TOKEN must be set}"

Lock Python dependencies with an exact hash manifest so CI and local environments are byte-identical:

pip-compile --generate-hashes requirements.in \
  --output-file requirements.txt \
  --resolver=backtracking

Step 1 — Bucket Initialisation

Provision the bucket with versioning, server-side encryption, and a hierarchical prefix layout. The key pattern /{env}/{crawl_id}/{timestamp}/{resource_type}/ prevents hot-partition throttling during concurrent high-throughput writes — a flat layout will hit S3's 3,500 PUT/s-per-prefix limit within minutes on a large site crawl.

#!/usr/bin/env bash
set -euo pipefail

BUCKET="crawl-artifacts-prod"
REGION="us-east-1"
KMS_KEY_ARN="${KMS_KEY_ARN:?KMS_KEY_ARN required}"

# Create bucket (us-east-1 must NOT include LocationConstraint)
aws s3api create-bucket \
  --bucket "${BUCKET}" \
  --region "${REGION}"

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket "${BUCKET}" \
  --versioning-configuration Status=Enabled

# Enforce server-side encryption with a customer-managed KMS key
aws s3api put-bucket-encryption \
  --bucket "${BUCKET}" \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "'"${KMS_KEY_ARN}"'"
      },
      "BucketKeyEnabled": true
    }]
  }'

# Block all public access
aws s3api put-public-access-block \
  --bucket "${BUCKET}" \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

echo "Bucket ${BUCKET} provisioned with versioning and KMS encryption."

For Google Cloud Storage, the equivalent GCS setup using gcloud:

#!/usr/bin/env bash
set -euo pipefail

PROJECT="my-audit-project"
BUCKET="crawl-artifacts-prod"
REGION="us-central1"
KMS_KEY="projects/${PROJECT}/locations/global/keyRings/crawl/cryptoKeys/prod"
SA="crawler-sa@${PROJECT}.iam.gserviceaccount.com"

gcloud storage buckets create "gs://${BUCKET}" \
  --project="${PROJECT}" \
  --location="${REGION}" \
  --versioning \
  --default-encryption-key="${KMS_KEY}" \
  --uniform-bucket-level-access

# Least-privilege IAM: write-only for the crawler service account
gcloud storage buckets add-iam-policy-binding "gs://${BUCKET}" \
  --member="serviceAccount:${SA}" \
  --role="roles/storage.objectCreator"

echo "GCS bucket ${BUCKET} ready."

Step 2 — Core Configuration

The table below lists every parameter that governs artifact storage behaviour. Set these in your config/storage.yaml and inject via environment variable at runtime.

Parameter	Type	Default	Purpose
`CRAWL_BUCKET`	string	—	Target bucket name; required, no default
`CRAWL_ENV`	string	`staging`	Key prefix guard — prevents staging writes hitting production
`RETENTION_HOT_DAYS`	int	`30`	Days before objects transition to Nearline/Intelligent-Tiering
`RETENTION_COLD_DAYS`	int	`90`	Days before non-baseline objects are deleted
`BASELINE_TAG`	string	`false`	Set `true` on objects that anchor regression comparisons — exempt from auto-delete
`CHECKSUM_ALGO`	string	`sha256`	Digest algorithm embedded in object metadata and key suffix
`UPLOAD_PART_SIZE_MB`	int	`8`	Multipart threshold; increase to `64` for HAR files over 1 GB
`MAX_UPLOAD_RETRIES`	int	`3`	Exponential-backoff retry count on transient 5xx responses
`CONCURRENCY_LOCK_TTL_S`	int	`300`	Redis SET NX TTL; prevents two crawl runs writing to the same prefix simultaneously

Apply lifecycle rules immediately after bucket creation. These rules tie back to RETENTION_HOT_DAYS and RETENTION_COLD_DAYS:

#!/usr/bin/env bash
set -euo pipefail

BUCKET="${CRAWL_BUCKET:?}"
HOT="${RETENTION_HOT_DAYS:-30}"
COLD="${RETENTION_COLD_DAYS:-90}"

aws s3api put-bucket-lifecycle-configuration \
  --bucket "${BUCKET}" \
  --lifecycle-configuration "$(cat <<JSON
{
  "Rules": [
    {
      "ID": "tier-to-intelligent-tiering",
      "Status": "Enabled",
      "Filter": {"Prefix": ""},
      "Transitions": [
        {"Days": ${HOT}, "StorageClass": "INTELLIGENT_TIERING"}
      ]
    },
    {
      "ID": "expire-non-baseline-artifacts",
      "Status": "Enabled",
      "Filter": {
        "And": {
          "Prefix": "",
          "Tags": [{"Key": "baseline", "Value": "false"}]
        }
      },
      "Expiration": {"Days": ${COLD}},
      "NoncurrentVersionExpiration": {"NoncurrentDays": ${COLD}}
    }
  ]
}
JSON
)"

echo "Lifecycle rules applied: hot=${HOT}d, cold=${COLD}d."

Step 3 — Execution: Upload with SHA-256 Manifest

The upload function computes a SHA-256 digest before writing so the key includes the checksum — making every object key deterministic and idempotent. If the same crawl output is uploaded twice, the second call is a no-op because the key already exists. This is the correct approach when managing crawl budget and rate limiting requires retry logic at the crawler layer.

#!/usr/bin/env python3
"""upload_artifact.py — upload a single crawl output file to S3 with a SHA-256 manifest."""
import hashlib
import os
import time
from pathlib import Path

import boto3
from botocore.config import Config
from botocore.exceptions import ClientError


BUCKET = os.environ["CRAWL_BUCKET"]
ENV    = os.environ.get("CRAWL_ENV", "staging")
REGION = os.environ.get("AWS_REGION", "us-east-1")


def sha256_file(path: Path) -> str:
    h = hashlib.sha256()
    with path.open("rb") as fh:
        for chunk in iter(lambda: fh.read(65_536), b""):
            h.update(chunk)
    return h.hexdigest()


def upload_artifact(
    file_path: str,
    crawl_id: str,
    resource_type: str,
    is_baseline: bool = False,
) -> dict:
    path    = Path(file_path)
    digest  = sha256_file(path)
    ts      = time.strftime("%Y%m%dT%H%M%SZ", time.gmtime())
    key     = f"{ENV}/{crawl_id}/{ts}/{resource_type}/{digest}.json"

    client = boto3.client(
        "s3",
        region_name=REGION,
        config=Config(retries={"max_attempts": int(os.environ.get("MAX_UPLOAD_RETRIES", 3)),
                                "mode": "adaptive"}),
    )

    tagging = (
        "baseline=true&crawl_id=" + crawl_id
        if is_baseline
        else "baseline=false&crawl_id=" + crawl_id
    )

    try:
        client.put_object(
            Bucket=BUCKET,
            Key=key,
            Body=path.read_bytes(),
            Metadata={
                "x-checksum-sha256": digest,
                "x-crawl-id":        crawl_id,
                "x-resource-type":   resource_type,
                "x-env":             ENV,
            },
            Tagging=tagging,
        )
    except ClientError as exc:
        raise RuntimeError(f"S3 upload failed for {key}: {exc}") from exc

    return {"key": key, "digest": digest, "bucket": BUCKET}


if __name__ == "__main__":
    import argparse, json
    p = argparse.ArgumentParser()
    p.add_argument("file")
    p.add_argument("--crawl-id", required=True)
    p.add_argument("--type", dest="resource_type", default="crawl-output")
    p.add_argument("--baseline", action="store_true")
    args = p.parse_args()
    result = upload_artifact(args.file, args.crawl_id, args.resource_type, args.baseline)
    print(json.dumps(result, indent=2))

Step 4 — CI/CD Sync & Downstream Trigger

Wire artifact sync into your deployment pipeline so every merge to main produces a versioned crawl snapshot. This feeds directly into integrating custom crawlers with CI/CD pipelines — the downstream validation gate reads from the versioned prefix written here.

# .github/workflows/crawl-artifact-sync.yml
name: crawl-artifact-sync

on:
  push:
    branches: [main]

concurrency:
  group: crawl-artifact-sync-${{ github.ref }}
  cancel-in-progress: false   # do not cancel; let in-flight uploads finish

env:
  AWS_REGION:   us-east-1
  CRAWL_BUCKET: ${{ secrets.CRAWL_BUCKET }}
  CRAWL_ENV:    production
  CRAWL_ID:     ${{ github.sha }}

jobs:
  upload-and-verify:
    runs-on: ubuntu-24.04
    permissions:
      id-token: write   # required for OIDC → AWS assume-role
      contents: read
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region:     ${{ env.AWS_REGION }}

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: pip

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Restore crawl cache
        uses: actions/cache@v4
        with:
          path: ./crawl-output
          key:  crawl-${{ github.sha }}
          restore-keys: crawl-

      - name: Upload artifacts
        run: |
          aws s3 sync ./crawl-output \
            "s3://${CRAWL_BUCKET}/${CRAWL_ENV}/${CRAWL_ID}/" \
            --only-show-errors \
            --sse aws:kms

      - name: Verify checksums
        run: |
          python3 scripts/verify_manifest.py \
            --bucket "${CRAWL_BUCKET}" \
            --prefix  "${CRAWL_ENV}/${CRAWL_ID}/"

      - name: Trigger downstream analytics
        if: success()
        run: |
          curl --fail --silent --show-error \
            -X POST \
            -H "Authorization: Bearer ${{ secrets.PIPELINE_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d '{"crawl_id":"${{ env.CRAWL_ID }}","env":"${{ env.CRAWL_ENV }}"}' \
            https://api.internal/v1/jobs/parse

The verify_manifest.py script lists every object under the prefix, reads its x-checksum-sha256 metadata value, re-downloads the object body, and re-computes the digest locally. Any mismatch raises a non-zero exit and fails the workflow:

#!/usr/bin/env python3
"""verify_manifest.py — re-verify SHA-256 checksums of all objects under a prefix."""
import argparse, hashlib, sys
import boto3

def verify_prefix(bucket: str, prefix: str) -> int:
    s3 = boto3.client("s3")
    paginator = s3.get_paginator("list_objects_v2")
    failures = 0
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get("Contents", []):
            key  = obj["Key"]
            meta = s3.head_object(Bucket=bucket, Key=key)["Metadata"]
            expected = meta.get("x-checksum-sha256", "")
            if not expected:
                print(f"WARN  no checksum metadata on {key}")
                continue
            body = s3.get_object(Bucket=bucket, Key=key)["Body"].read()
            actual = hashlib.sha256(body).hexdigest()
            if actual != expected:
                print(f"FAIL  {key}: expected {expected}, got {actual}")
                failures += 1
            else:
                print(f"OK    {key}")
    return failures

if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument("--bucket", required=True)
    p.add_argument("--prefix", required=True)
    args = p.parse_args()
    fails = verify_prefix(args.bucket, args.prefix)
    sys.exit(fails)

Verification Checklist

Run these steps after every pipeline execution to confirm the workflow completed correctly:

Confirm object count. aws s3 ls --recursive s3://${CRAWL_BUCKET}/${CRAWL_ENV}/${CRAWL_ID}/ | wc -l — compare against the crawler's reported URL count; a mismatch signals a dropped write.
Spot-check metadata. aws s3api head-object --bucket ${CRAWL_BUCKET} --key <sample-key> — verify x-checksum-sha256, x-crawl-id, and x-env metadata fields are present.
Run verify_manifest.py. Zero failures expected; any FAIL line is a corrupted artifact.
Inspect lifecycle tags. aws s3api get-object-tagging --bucket ${CRAWL_BUCKET} --key <key> — confirm baseline and crawl_id tags are set.
Check versioning. aws s3api list-object-versions --bucket ${CRAWL_BUCKET} --prefix ${CRAWL_ENV}/${CRAWL_ID}/ — every object should have exactly one version for a fresh crawl prefix.
Verify downstream trigger. Check the analytics API job queue to confirm the parse job was enqueued with the correct crawl_id.

Drift Detection Queries

Once versioned artifacts exist, normalise them into a queryable format — JSON Lines or Parquet — and load into BigQuery or Athena. The SQL below detects regressions in Core Web Vitals across successive crawl versions. This feeds directly into metric scoring and data normalisation workflows where these deltas become inputs to health-score algorithms.

When working with headless browser configurations that produce JavaScript-rendered snapshots, correlating those render states with the immutable version ID ensures each baseline accurately represents the rendered DOM at that point in time.

-- Detect Core Web Vitals regressions between consecutive crawl versions
WITH versioned_metrics AS (
  SELECT
    crawl_id,
    artifact_version,
    url,
    status_code,
    lcp_ms,
    cls_score,
    inp_ms,
    wcag_violations,
    LAG(lcp_ms)    OVER (PARTITION BY url ORDER BY artifact_version) AS prev_lcp,
    LAG(cls_score) OVER (PARTITION BY url ORDER BY artifact_version) AS prev_cls,
    LAG(inp_ms)    OVER (PARTITION BY url ORDER BY artifact_version) AS prev_inp
  FROM `project.dataset.crawl_artifacts`
  WHERE artifact_version BETWEEN 'v2026.01' AND 'v2026.06'
)
SELECT
  url,
  artifact_version,
  lcp_ms,
  COALESCE(lcp_ms - prev_lcp, 0)                                         AS lcp_delta_ms,
  CASE WHEN cls_score > 0.1  THEN 'FAIL' ELSE 'PASS' END                 AS cls_status,
  CASE WHEN inp_ms   > 200   THEN 'DEGRADED' ELSE 'OPTIMAL' END          AS inp_status,
  CASE WHEN wcag_violations > 0 THEN wcag_violations ELSE 0 END          AS a11y_violations
FROM versioned_metrics
WHERE
  ABS(COALESCE(lcp_ms - prev_lcp, 0)) > 500
  OR cls_score > 0.1
  OR inp_ms    > 200
ORDER BY lcp_delta_ms DESC
LIMIT 200;

Troubleshooting

Symptom: S3 PutObject fails with SlowDown (HTTP 503)

Root cause: too many concurrent writes to the same key prefix. The S3 request rate limit is 3,500 PUT/s per prefix partition.

# Verify prefix distribution
aws s3 ls --recursive "s3://${CRAWL_BUCKET}/${CRAWL_ENV}/" \
  | awk '{print $4}' \
  | cut -d'/' -f1-4 \
  | sort | uniq -c | sort -rn | head -20

Fix: add a UUIDv4 shard segment to the key path, or introduce a jitter delay (shuf -i 1-5 -n 1 | xargs sleep) between concurrent upload workers.

Symptom: Version IDs are not incremented — bucket always shows null

Root cause: bucket versioning was never enabled, or was suspended after provisioning.

aws s3api get-bucket-versioning --bucket "${CRAWL_BUCKET}"
# If output is empty or shows "Suspended", re-enable:
aws s3api put-bucket-versioning \
  --bucket "${CRAWL_BUCKET}" \
  --versioning-configuration Status=Enabled

Symptom: Lifecycle rules are not transitioning objects

Root cause: the rule filter does not match the baseline=false tag format, or the tag key casing differs from what was written at upload time.

# Inspect the active lifecycle configuration
aws s3api get-bucket-lifecycle-configuration --bucket "${CRAWL_BUCKET}" | jq .

# Verify a specific object's tags
aws s3api get-object-tagging \
  --bucket "${CRAWL_BUCKET}" \
  --key    "${CRAWL_ENV}/${CRAWL_ID}/20260101T000000Z/crawl-output/abc123.json" \
  | jq .TagSet

Fix: ensure the Tagging parameter in put_object uses baseline=false (lowercase) to match the lifecycle filter exactly.

Symptom: verify_manifest.py reports FAIL on checksums after a successful upload

Root cause: the upload used multipart and the x-checksum-sha256 metadata was computed on the pre-split file, but the ETags diverge for multipart objects.

# Check if an object was uploaded via multipart (ETag contains a hyphen)
aws s3api head-object \
  --bucket "${CRAWL_BUCKET}" \
  --key    "<suspect-key>" \
  | jq '.ETag'
# A "-" in the ETag string (e.g. "abc123-4") signals multipart

Fix: compute the SHA-256 digest from the raw file content (not the ETag) before upload, and store it in custom metadata — which verify_manifest.py already reads. Do not use S3 ETags as checksums for multipart objects.

Symptom: Two concurrent CI runs produce identical object keys and one overwrites the other

Root cause: crawl_id was set to the branch name rather than the commit SHA, so two runs on the same branch share a prefix.

# Correct: use the commit SHA as crawl_id
env:
  CRAWL_ID: ${{ github.sha }}   # not ${{ github.ref_name }}

Additionally, add a concurrency block to the workflow (shown above) so subsequent pushes queue rather than run in parallel on the same branch.

Symptom: Lifecycle auto-delete removes baseline artifacts used for regression

Root cause: baseline objects were tagged baseline=false by default and fell under the auto-delete rule.

# Retroactively tag an object as a baseline to exempt it
aws s3api put-object-tagging \
  --bucket "${CRAWL_BUCKET}" \
  --key    "${CRAWL_ENV}/${CRAWL_ID}/20260101T000000Z/crawl-output/abc123.json" \
  --tagging '{"TagSet":[{"Key":"baseline","Value":"true"},{"Key":"crawl_id","Value":"'"${CRAWL_ID}"'"}]}'

Should I use bucket versioning or object-key versioning for crawl artifacts?

Use both. Enable native bucket versioning as a safety net against accidental overwrites, and embed a deterministic version string (crawl ID + SHA-256 digest) in the object key. Key-level versioning makes retrieval O(1) without listing all versions; bucket versioning protects against human error.

How long should raw HAR files be retained vs. normalised Parquet outputs?

Raw HAR/JSON artifacts serve forensic purposes and can be tiered to Nearline or Coldline after 30 days and deleted after 90. Normalised Parquet files used for trend queries should be retained for at least 12 months to keep year-over-year comparisons valid.

What happens if two crawler instances write to the same prefix simultaneously?

Without a distributed lock the last writer wins and you silently lose one crawl run. Prevent this by including a UUIDv4 in every object key and enforcing mutual exclusion at the orchestration layer — for example via a Redis SET NX lock or GitHub Actions concurrency group — before writes reach the bucket.

Can I use the same bucket for staging and production crawl artifacts?

Technically yes, but separate key prefixes (/staging/ vs /production/) and separate IAM roles are mandatory. A cleaner boundary is two buckets: it prevents lifecycle rules from accidentally deleting production baselines and keeps cost attribution clean.

Automated Crawling & Pipeline Tooling — parent section covering the full pipeline from crawler configuration through artifact storage
Integrating Custom Crawlers with CI/CD Pipelines — wiring versioned artifacts into automated regression gates
Managing Crawl Budget & Rate Limiting — controlling request rates that determine artifact volume and storage growth
Expiring Old Crawl Artifacts with Lifecycle Rules — GCS and S3 lifecycle policies for tiering and expiry
Metric Scoring & Data Normalisation — consuming versioned artifacts to compute and track health scores across releases

Storing & Versioning Crawl Artifacts in Cloud Storage #

Prerequisites & Environment Setup #

Step 1 — Bucket Initialisation #

Step 2 — Core Configuration #

Step 3 — Execution: Upload with SHA-256 Manifest #

Step 4 — CI/CD Sync & Downstream Trigger #

Verification Checklist #

Drift Detection Queries #

Troubleshooting #

Related #