Configuring Spatial Metric Collection in Kubernetes

Configuring spatial metric collection in Kubernetes demands a deterministic approach to instrumentation, scrape topology, and alert routing. When geospatial pipelines degrade, the difference between a five-minute MTTR and a multi-hour outage hinges on whether your observability stack captures coordinate reference system (CRS) transformation failures, spatial index rebuild latency, and vector tile generation throughput with sub-second precision. The baseline architecture must treat spatial workloads as first-class observability citizens rather than generic HTTP microservices. Establishing a Geospatial Observability Architecture & Fundamentals foundation ensures that metric cardinality remains bounded while preserving the spatial context required for rapid triage.

1. OpenTelemetry Collector Deployment & Pipeline Configuration

Deploy the OpenTelemetry Collector as a dedicated Deployment within your GIS namespace, or as a DaemonSet if node-level spatial workers require local telemetry aggregation. The collector must parse raw application telemetry from PostGIS exporters, GeoServer instances, or custom ETL workers, normalize it into OpenTelemetry metrics format, and forward it to Prometheus.

flowchart LR
  EXP["PostGIS / GeoServer / ETL exporters"] --> COL["OTel Collector · Deployment or DaemonSet"]
  COL --> P["batch · filter · metricstransform · attributes"]
  P --> PROM["Prometheus :8889"]
  PROM --> AL["PrometheusRule alerts"]
  AL --> PD["PagerDuty / Slack"]

Kubernetes Deployment Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector-spatial
  namespace: gis-platform
  labels:
    app: otel-collector
    component: spatial-metrics
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:0.95.0
        args: ["--config=/etc/otel/config.yaml"]
        ports:
        - containerPort: 4317
          name: otlp-grpc
        - containerPort: 4318
          name: otlp-http
        - containerPort: 8889
          name: prom-export
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits:
            cpu: 500m
            memory: 1Gi
        volumeMounts:
        - name: otel-config
          mountPath: /etc/otel
      volumes:
      - name: otel-config
        configMap:
          name: otel-collector-config-spatial

Collector Pipeline Configuration

Apply the following otel-collector-config.yaml to enforce strict metric filtering, rename conventions, and spatial attribute extraction. This configuration strips non-spatial noise while preserving CRS lineage and grid-level context.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  filter:
    metrics:
      include:
        match_type: strict
        metric_names:
          - spatial_query_duration_seconds
          - vector_tile_generation_errors_total
          - crs_transform_failures_total
          - spatial_index_rebuild_latency_ms
          - bbox_query_cache_hit_ratio
  transform:
    metric_statements:
      - context: metric
        statements:
          - set(name, "gis.spatial.query.duration") where name == "spatial_query_duration_seconds"
          - set(name, "gis.spatial.crs.transform.failures") where name == "crs_transform_failures_total"
  attributes:
    actions:
      - key: crs_source
        action: upsert
        from_attribute: spatial.source_epsg
      - key: crs_target
        action: upsert
        from_attribute: spatial.target_epsg
      - key: tile_grid_level
        action: upsert
        from_attribute: spatial.grid_level

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: gis_platform
    resource_to_telemetry_conversion:
      enabled: true

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch, filter, transform, attributes]
      exporters: [prometheus]

Note: The transform processor (contrib) replaces the deprecated metricstransform processor for renaming metrics in recent collector versions. If using an older contrib build that still includes metricstransform, replace the transform block with the equivalent metricstransform block.

2. Instrumentation & Metric Taxonomy Alignment

When instrumenting ETL workers, tile servers, and spatial databases, align naming conventions with a standardized OpenTelemetry Integration for GIS Pipelines schema to prevent cardinality explosions during high-throughput tile generation. Avoid embedding dynamic bounding box coordinates or raw geometry hashes in metric labels. Instead, quantize spatial dimensions into discrete grid levels, CRS identifiers, and query complexity tiers.

Python Instrumentation Example

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

reader = PeriodicExportingMetricReader(
    exporter=OTLPMetricExporter(
        endpoint="otel-collector-spatial.gis-platform:4317",
        insecure=True
    )
)
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("gis.tile.generator")

tile_generation_counter = meter.create_counter(
    "vector_tile_generation_errors_total",
    description="Tracks vector tile generation failures by grid level and CRS",
    unit="1"
)

def emit_tile_error(grid_level: int, source_crs: str, target_crs: str):
    tile_generation_counter.add(
        1,
        attributes={
            "spatial.grid_level": str(grid_level),
            "spatial.source_epsg": source_crs,
            "spatial.target_epsg": target_crs,
            "failure_type": "projection_mismatch"
        }
    )

3. Prometheus Scrape Topology & Alert Routing

Configure Prometheus to scrape the collector’s /metrics endpoint and establish recording rules for spatial aggregations. Use the official Prometheus Alerting Rules Documentation as a reference for syntax validation.

Prometheus Configuration Snippet

scrape_configs:
  - job_name: 'gis-otel-collector'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['otel-collector-spatial.gis-platform.svc.cluster.local:8889']
    relabel_configs:
      - source_labels: [__name__]
        regex: 'gis_platform_gis_spatial_.*'
        action: keep

PromQL Alert Rules

Deploy these rules via PrometheusRule CRDs. They enforce strict thresholds for spatial pipeline degradation.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gis-spatial-alerts
  namespace: monitoring
spec:
  groups:
  - name: spatial-pipeline-reliability
    interval: 15s
    rules:
    - alert: HighCRSTransformFailureRate
      expr: |
        rate(gis_platform_gis_spatial_crs_transform_failures_total[5m]) > 0.05
      for: 2m
      labels:
        severity: critical
        team: gis-platform
      annotations:
        summary: "CRS transformation failure rate exceeds 5% over 5m window"
        runbook_url: "/runbooks/spatial-crs-drift"

    - alert: VectorTileGenerationLatencyDegradation
      expr: |
        histogram_quantile(0.99, rate(gis_platform_gis_spatial_query_duration_seconds_bucket[5m])) > 2.5
      for: 3m
      labels:
        severity: warning
        team: tile-ops
      annotations:
        summary: "P99 vector tile generation latency exceeds 2.5s"

    - alert: SpatialIndexRebuildStall
      expr: |
        gis_platform_spatial_index_rebuild_latency_ms > 300000
      for: 1m
      labels:
        severity: critical
        team: db-sre
      annotations:
        summary: "Spatial index rebuild operation stalled beyond 5m threshold"

4. Incident Playbook: Spatial Metric Degradation Triage

When alerts fire, follow this deterministic runbook to isolate failures without disrupting active spatial queries.

Phase 1: Isolate Coordinate Reference System Drift

  1. Query rate(gis_platform_gis_spatial_crs_transform_failures_total[5m]) grouped by crs_source and crs_target.
  2. Identify mismatched EPSG codes. Cross-reference against your spatial registry to detect unauthorized CRS overrides in upstream ETL jobs.
  3. Validate transformation matrices using PostGIS Performance and Tuning guidelines. If ST_Transform latency spikes, verify that the target CRS is cached in the database’s spatial_ref_sys table.

Phase 2: Diagnose Tile Grid & Cache Degradation

  1. Inspect bbox_query_cache_hit_ratio. A drop below 0.65 typically indicates grid misalignment or aggressive cache eviction.
  2. Check tile_grid_level label distribution. Sudden shifts to higher zoom levels (e.g., z16z20) during peak traffic suggest client-side zoom abuse or broken tile request routing.
  3. Implement request throttling at the ingress layer for unbounded bounding box queries.

Phase 3: Resolve Index Fragmentation & Rebuild Latency

  1. If spatial_index_rebuild_latency_ms triggers, query the underlying database for lock contention:
    SELECT pid, state, query
    FROM pg_stat_activity
    WHERE query LIKE '%CREATE INDEX%';
  2. Verify that maintenance_work_mem and work_mem are sized appropriately for spatial GiST/BRIN index operations.
  3. If latency persists, trigger a rolling restart of the spatial worker pods to clear in-memory geometry caches and force a clean index scan.

Phase 4: Compliance & Audit Logging

  1. Archive all metric snapshots and alert payloads to your immutable storage tier.
  2. Map CRS transformation failures to data lineage records. Regulatory frameworks often require proof that spatial projections were applied consistently across pipeline stages.
  3. Update the spatial observability dashboard to reflect the resolved state and adjust alert thresholds if the degradation was caused by legitimate traffic scaling.