Configuring Spatial Metric Collection in Kubernetes

Running the OpenTelemetry collector for a geospatial workload in Kubernetes is a narrow but high-stakes job: the pod has to survive a raster-tiling burst without OOM-killing itself, keep only the spatial metric families, and route alerts on coordinate reference system (CRS) transform failures, spatial index rebuild latency, and vector tile throughput with sub-second precision. Get the resource sizing, scrape topology, or sampling wrong and a green dashboard hides a corrupted pipeline. This page sits under OpenTelemetry integration for GIS pipelines — and, above it, the Geospatial Observability Architecture & Fundamentals guide — and covers the concrete cluster manifests, collector pipeline, and Prometheus wiring that make spatial telemetry a first-class citizen rather than a generic HTTP series.

Problem Framing

Spatial workloads emit signals that ordinary infrastructure monitoring cannot see. A ST_Transform call that silently loses sub-meter precision raises no exception and burns no extra CPU; a tile-pyramid build fans one input feature into hundreds of work units; a GiST index rebuild can stall for minutes behind a long-running transaction. The collector deployed in Kubernetes is the choke point where all of those signals either survive or vanish, so its configuration has to be deliberate about three things that a generic collector ignores.

First, cardinality. Embedding raw bounding-box coordinates or geometry hashes in metric labels detonates the time-series database during high-throughput tile generation. The metric names this collector accepts are defined by the Geospatial Metric Taxonomy for ETL, and spatial dimensions must be quantized into discrete grid levels, CRS identifiers, and complexity tiers before they ever reach a label. Second, scope — which features get full instrumentation versus sampled checks is governed by the observability scoping rules for vector data, so a point feed and a polygon feed do not carry identical collection cost. Third, backpressure isolation: geometry-validation workers must never block on the telemetry path, so the collector sheds load softly under memory pressure rather than stalling feature throughput. The signal that you have this wrong is usually a crs_transform_failures rate that climbs while CPU and request latency stay flat.

Implementation

Deploy the contrib collector build as a dedicated Deployment in the GIS namespace (use a DaemonSet only when node-local spatial workers aggregate telemetry locally). Two replicas give you rolling-restart headroom; the memory limit must clear the largest tile batch the pod will buffer.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector-spatial
  namespace: gis-platform
  labels:
    app: otel-collector
    component: spatial-metrics
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:0.119.0  # contrib: filter + transform required
        args: ["--config=/etc/otel/config.yaml"]
        ports:
        - containerPort: 4317
          name: otlp-grpc
        - containerPort: 4318
          name: otlp-http
        - containerPort: 8889
          name: prom-export      # Prometheus scrapes this, not 4317/4318
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits:
            cpu: 500m
            memory: 1Gi           # size above the largest tile batch buffered in-flight
        volumeMounts:
        - name: otel-config
          mountPath: /etc/otel
      volumes:
      - name: otel-config
        configMap:
          name: otel-collector-config-spatial

The collector config below strips non-spatial noise, renames raw counters into the gis.spatial.* namespace, and lifts CRS and grid context out of span attributes into metric labels. Every block is annotated for the geospatial-specific reason it exists.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024         # amortize export cost under tiling bursts
  filter:
    metrics:
      include:
        match_type: strict        # allow-list only — keeps cardinality bounded
        metric_names:
          - spatial_query_duration_seconds
          - vector_tile_generation_errors_total
          - crs_transform_failures_total
          - spatial_index_rebuild_latency_ms
          - bbox_query_cache_hit_ratio
  transform:                      # contrib processor; replaces deprecated metricstransform
    metric_statements:
      - context: metric
        statements:
          - set(name, "gis.spatial.query.duration") where name == "spatial_query_duration_seconds"
          - set(name, "gis.spatial.crs.transform.failures") where name == "crs_transform_failures_total"
  attributes:                     # promote quantized spatial context to labels — never raw bbox
    actions:
      - key: crs_source
        action: upsert
        from_attribute: spatial.source_epsg
      - key: crs_target
        action: upsert
        from_attribute: spatial.target_epsg
      - key: tile_grid_level
        action: upsert
        from_attribute: spatial.grid_level

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: gis_platform
    resource_to_telemetry_conversion:
      enabled: true               # carry deployment.zone / dataset lineage into series

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch, filter, transform, attributes]
      exporters: [prometheus]

On the producer side, the SDK must emit quantized attributes — discrete grid level and EPSG codes, never a serialized geometry. This counter feeds vector_tile_generation_errors_total:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

reader = PeriodicExportingMetricReader(
    exporter=OTLPMetricExporter(
        endpoint="otel-collector-spatial.gis-platform:4317",  # in-cluster service DNS
        insecure=True,
    )
)
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))
meter = metrics.get_meter("gis.tile.generator")

tile_generation_counter = meter.create_counter(
    "vector_tile_generation_errors_total",
    description="Vector tile generation failures by grid level and CRS",
    unit="1",
)

def emit_tile_error(grid_level: int, source_crs: str, target_crs: str):
    tile_generation_counter.add(
        1,
        attributes={
            "spatial.grid_level": str(grid_level),   # bounded: z0..z22
            "spatial.source_epsg": source_crs,        # e.g. "EPSG:4326"
            "spatial.target_epsg": target_crs,        # e.g. "EPSG:3857"
            "failure_type": "projection_mismatch",
        },
    )

Point Prometheus at the collector’s :8889 endpoint and keep only the spatial families with a relabel rule, then deploy alert thresholds as a PrometheusRule CRD.

scrape_configs:
  - job_name: 'gis-otel-collector'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['otel-collector-spatial.gis-platform.svc.cluster.local:8889']
    relabel_configs:
      - source_labels: [__name__]
        regex: 'gis_platform_gis_spatial_.*'   # drop everything outside the spatial namespace
        action: keep

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gis-spatial-alerts
  namespace: monitoring
spec:
  groups:
  - name: spatial-pipeline-reliability
    interval: 15s
    rules:
    - alert: HighCRSTransformFailureRate
      expr: |
        rate(gis_platform_gis_spatial_crs_transform_failures_total[5m]) > 0.05
      for: 2m
      labels: { severity: critical, team: gis-platform }
      annotations:
        summary: "CRS transform failure rate exceeds 5% over a 5m window"
        runbook_url: "/runbooks/spatial-crs-drift"
    - alert: VectorTileGenerationLatencyDegradation
      expr: |
        histogram_quantile(0.99, rate(gis_platform_gis_spatial_query_duration_seconds_bucket[5m])) > 2.5
      for: 3m
      labels: { severity: warning, team: tile-ops }
      annotations:
        summary: "P99 vector tile generation latency exceeds 2.5s"
    - alert: SpatialIndexRebuildStall
      expr: |
        gis_platform_spatial_index_rebuild_latency_ms > 300000
      for: 1m
      labels: { severity: critical, team: db-sre }
      annotations:
        summary: "Spatial index rebuild stalled beyond the 5m threshold"

Verification & Testing

Confirm the path end to end before trusting any dashboard. First, check the collector is exporting the renamed series by port-forwarding and scraping it directly:

kubectl -n gis-platform port-forward deploy/otel-collector-spatial 8889:8889 &
curl -s localhost:8889/metrics | grep gis_platform_gis_spatial_crs_transform_failures

Then inject a synthetic failure to prove the producer, collector renaming, and alert rule all line up. Call emit_tile_error(18, "EPSG:4326", "EPSG:3857") in a loop fast enough to push the rate past 0.05, and assert the alert moves to pending then firing:

# expect ALERTS{alertname="HighCRSTransformFailureRate", alertstate="firing"} after the 2m "for"
curl -s 'http://prometheus.monitoring:9090/api/v1/query' \
  --data-urlencode 'query=ALERTS{alertname="HighCRSTransformFailureRate"}'

Finally, validate cardinality stayed bounded — a passing pipeline keeps the spatial namespace small. If this count grows without bound, a raw coordinate has leaked into a label:

curl -s 'http://prometheus.monitoring:9090/api/v1/query' \
  --data-urlencode 'query=count(count by (__name__,crs_source,crs_target,tile_grid_level)(gis_platform_gis_spatial_crs_transform_failures_total))'

Gotchas & Failure Modes

Raw bbox in a label detonates the TSDB. The single most common mistake is attaching an un-quantized bounding box or geometry hash as an attribute; the attributes processor will faithfully promote it to a label and every unique extent becomes a new series. Always quantize to a grid level before emitting, and keep the filter allow-list strict so a stray instrument cannot slip through.
metricstransform vs transform drift. Recent contrib builds use the transform processor shown above; older builds ship the deprecated metricstransform. Pinning the wrong image silently drops the rename, so crs_transform_failures_total never becomes gis.spatial.crs.transform.failures and the keep relabel rule discards it — you get an alert rule watching a series that does not exist. Verify the renamed name appears at :8889 after any image bump.
Index-rebuild stalls misread as load. A firing SpatialIndexRebuildStall usually means lock contention, not CPU. Confirm against the database before scaling pods, which only adds workers that block on the same lock. The same correlation discipline carries over to multi-region setups documented in monitoring topology for multi-region GIS, where cross-AZ replication lag distorts the latency you attribute to the index.

-- Is the rebuild blocked, or genuinely slow? Check for lock waiters first.
SELECT pid, state, wait_event_type, query
FROM pg_stat_activity
WHERE query ILIKE '%CREATE INDEX%' OR query ILIKE '%REINDEX%';

If a rebuild is genuinely slow rather than blocked, confirm maintenance_work_mem is sized for the GiST/BRIN build before reaching for a rolling restart of the spatial worker pods.

FAQ

Deployment or DaemonSet for the spatial collector?

Use a Deployment for centralized metric aggregation, which is the right default for PostGIS, GeoServer, and ETL exporters that already ship over the network. Reach for a DaemonSet only when node-local spatial workers produce enough telemetry that a per-node aggregation hop meaningfully reduces cross-node traffic — for example LiDAR or raster-tiling workers pinned to specific nodes.

Why scrape port 8889 instead of the OTLP ports?

4317 and 4318 are ingest ports where the collector receives OTLP from your workers. The Prometheus exporter publishes the finished, filtered, renamed series on 8889. Scraping the OTLP ports returns nothing useful; the relabel keep rule and all the gis_platform_* naming only exist on the exporter endpoint.

How do I stop spatial metrics from blowing up cardinality?

Keep the collector filter as a strict allow-list, quantize every spatial dimension (grid level, EPSG code, complexity tier) before it becomes an attribute, and never emit raw coordinates or geometry hashes as labels. The verification query above gives you a running cardinality count so a regression surfaces in testing rather than in production. The per-geometry baselines in the observability scoping rules for vector data tell you which feeds justify full instrumentation.

My CRS-failure alert never fires even though transforms are failing — why?

Almost always a rename mismatch. If the collector image still ships metricstransform but your config uses transform (or vice versa), the metric keeps its raw name, the keep relabel rule drops it, and the alert watches an empty series. Scrape :8889 directly and confirm gis_platform_gis_spatial_crs_transform_failures is present before debugging thresholds.

Where do these spans originate, and who owns them?

Span and metric authority is set by the spatial data trust boundaries: raw WKB ingestion owns SRID-validation signals, post-join stages own topology-preservation signals. When a primary spatial service degrades and the pipeline routes to a cache, the fallback chains for spatial API failures must carry identical CRS metadata so a silent quality regression cannot hide behind the collector.

OpenTelemetry Integration for GIS Pipelines — the parent guide to collector topology, attributes, and sampling this deployment implements.
Geospatial Observability Architecture & Fundamentals — the end-to-end picture of instrumenting spatial pipelines.
Geospatial Metric Taxonomy for ETL — the canonical gis.spatial.* names this collector filters and renames.
Monitoring Topology for Multi-Region GIS — scaling the same collector pattern across regions without distorting freshness signals.