Resource Quotas¶

Per-pod CPU and memory recommendations. Mark TBD where unverified — the canonical values live in each service's deploy/ folder and should be cross-checked there before quoting in capacity planning.

Sense¶

Pod	CPU request	CPU limit	Memory request	Memory limit	Replicas (steady)	Notes
Sense API	500m	2	512Mi	1Gi	2	Mostly request-shaped; memory is small.
Sense Worker (general)	1	4	1Gi	4Gi	2	Kafka consumers + Hangfire jobs.
Sense Worker (PMTiles run)	2	4	4Gi	8Gi	(job pod)	Tippecanoe is CPU+disk heavy. Use a separate Hangfire queue or a dedicated pod when running.
Sense Worker (Citylens migration)	2	4	4Gi	8Gi	(job pod)	ClickHouse insert pressure during bulk import.
Planner Web (Nginx static)	50m	200m	64Mi	256Mi	2	Just static + reverse proxy.
Migration job (Sense API)	200m	1	256Mi	1Gi	(one-shot)	One-shot per release.
Migration job (Sense Worker)	200m	1	256Mi	1Gi	(one-shot)	Includes Kafka topic creation.

Gen¶

Pod	CPU request	CPU limit	Memory request	Memory limit	Replicas (steady)
Gen API	500m	2	512Mi	1Gi	2
Gen Web (Next.js)	500m	2	1Gi	2Gi	2
Migration job (Gen API)	200m	1	256Mi	1Gi	(one-shot)

Platform infra (in-cluster)¶

Pod	CPU request	CPU limit	Memory request	Memory limit	Replicas
OpenFGA	250m	1	256Mi	512Mi	2 (StatefulSet)
Valhalla	500m	2	2Gi	4Gi	1 (data is mounted)
ClickHouse replica (per replica)	2	8	16Gi	32Gi	2
ClickHouse Keeper	100m	500m	256Mi	1Gi	3
SigNoz / OTel collector	500m	2	1Gi	2Gi	per chart defaults
ch-ui	50m	200m	64Mi	256Mi	1
Kafka UI	100m	500m	256Mi	512Mi	1

The ClickHouse numbers are placeholder; verify against the deployment-specific values.yaml. ClickHouse memory is workload-driven — increase if you see merge backlog or cache thrashing.

Externally managed (out of scope here)¶

PostgreSQL — sized by the customer / cloud provider. Recommend ≥ 4 vCPU / 16 GiB for steady; scale with growth.
Kafka — provider-managed. Provision based on partition count + retention.
S3 / GCS / MinIO — pay-as-you-go (cloud) or sized per data volume (MinIO).

Sizing logic¶

API pods: limited by network and DB latency, not CPU. Scale horizontally on QPS, not vertically.
Worker pods: heterogeneous. Most consumers are I/O-bound; the PMTiles and Citylens migration jobs are CPU-bound. Either run them on dedicated pods (separate Hangfire queue) or right-size the general Worker for the heaviest job.
ClickHouse: bigger merges → bigger working set. RAM is the lever; CPU matters during merges and federated reads.
Mobile traffic: API request rate scales with the number of concurrent inspectors, not with frame count (frames go straight to S3). 2 API replicas are enough for several hundred concurrent inspectors.

Right-sizing¶

Run a load test against staging before adjusting prod quotas — see axion.sense.backend/docs/LoadTestPerfNotes.md for the baseline scenario and findings.