Skip to content

Resource Quotas

Per-pod CPU and memory recommendations. Mark TBD where unverified — the canonical values live in each service's deploy/ folder and should be cross-checked there before quoting in capacity planning.

Sense

Pod CPU request CPU limit Memory request Memory limit Replicas (steady) Notes
Sense API 500m 2 512Mi 1Gi 2 Mostly request-shaped; memory is small.
Sense Worker (general) 1 4 1Gi 4Gi 2 Kafka consumers + Hangfire jobs.
Sense Worker (PMTiles run) 2 4 4Gi 8Gi (job pod) Tippecanoe is CPU+disk heavy. Use a separate Hangfire queue or a dedicated pod when running.
Sense Worker (Citylens migration) 2 4 4Gi 8Gi (job pod) ClickHouse insert pressure during bulk import.
Planner Web (Nginx static) 50m 200m 64Mi 256Mi 2 Just static + reverse proxy.
Migration job (Sense API) 200m 1 256Mi 1Gi (one-shot) One-shot per release.
Migration job (Sense Worker) 200m 1 256Mi 1Gi (one-shot) Includes Kafka topic creation.

Gen

Pod CPU request CPU limit Memory request Memory limit Replicas (steady)
Gen API 500m 2 512Mi 1Gi 2
Gen Web (Next.js) 500m 2 1Gi 2Gi 2
Migration job (Gen API) 200m 1 256Mi 1Gi (one-shot)

Platform infra (in-cluster)

Pod CPU request CPU limit Memory request Memory limit Replicas
OpenFGA 250m 1 256Mi 512Mi 2 (StatefulSet)
Valhalla 500m 2 2Gi 4Gi 1 (data is mounted)
ClickHouse replica (per replica) 2 8 16Gi 32Gi 2
ClickHouse Keeper 100m 500m 256Mi 1Gi 3
SigNoz / OTel collector 500m 2 1Gi 2Gi per chart defaults
ch-ui 50m 200m 64Mi 256Mi 1
Kafka UI 100m 500m 256Mi 512Mi 1

The ClickHouse numbers are placeholder; verify against the deployment-specific values.yaml. ClickHouse memory is workload-driven — increase if you see merge backlog or cache thrashing.

Externally managed (out of scope here)

  • PostgreSQL — sized by the customer / cloud provider. Recommend ≥ 4 vCPU / 16 GiB for steady; scale with growth.
  • Kafka — provider-managed. Provision based on partition count + retention.
  • S3 / GCS / MinIO — pay-as-you-go (cloud) or sized per data volume (MinIO).

Sizing logic

  • API pods: limited by network and DB latency, not CPU. Scale horizontally on QPS, not vertically.
  • Worker pods: heterogeneous. Most consumers are I/O-bound; the PMTiles and Citylens migration jobs are CPU-bound. Either run them on dedicated pods (separate Hangfire queue) or right-size the general Worker for the heaviest job.
  • ClickHouse: bigger merges → bigger working set. RAM is the lever; CPU matters during merges and federated reads.
  • Mobile traffic: API request rate scales with the number of concurrent inspectors, not with frame count (frames go straight to S3). 2 API replicas are enough for several hundred concurrent inspectors.

Right-sizing

Run a load test against staging before adjusting prod quotas — see axion.sense.backend/docs/LoadTestPerfNotes.md for the baseline scenario and findings.