Resource Quotas¶
Per-pod CPU and memory recommendations. Mark TBD where unverified — the canonical values live in each service's deploy/ folder and should be cross-checked there before quoting in capacity planning.
Sense¶
| Pod | CPU request | CPU limit | Memory request | Memory limit | Replicas (steady) | Notes |
|---|---|---|---|---|---|---|
| Sense API | 500m | 2 | 512Mi | 1Gi | 2 | Mostly request-shaped; memory is small. |
| Sense Worker (general) | 1 | 4 | 1Gi | 4Gi | 2 | Kafka consumers + Hangfire jobs. |
| Sense Worker (PMTiles run) | 2 | 4 | 4Gi | 8Gi | (job pod) | Tippecanoe is CPU+disk heavy. Use a separate Hangfire queue or a dedicated pod when running. |
| Sense Worker (Citylens migration) | 2 | 4 | 4Gi | 8Gi | (job pod) | ClickHouse insert pressure during bulk import. |
| Planner Web (Nginx static) | 50m | 200m | 64Mi | 256Mi | 2 | Just static + reverse proxy. |
| Migration job (Sense API) | 200m | 1 | 256Mi | 1Gi | (one-shot) | One-shot per release. |
| Migration job (Sense Worker) | 200m | 1 | 256Mi | 1Gi | (one-shot) | Includes Kafka topic creation. |
Gen¶
| Pod | CPU request | CPU limit | Memory request | Memory limit | Replicas (steady) |
|---|---|---|---|---|---|
| Gen API | 500m | 2 | 512Mi | 1Gi | 2 |
| Gen Web (Next.js) | 500m | 2 | 1Gi | 2Gi | 2 |
| Migration job (Gen API) | 200m | 1 | 256Mi | 1Gi | (one-shot) |
Platform infra (in-cluster)¶
| Pod | CPU request | CPU limit | Memory request | Memory limit | Replicas |
|---|---|---|---|---|---|
| OpenFGA | 250m | 1 | 256Mi | 512Mi | 2 (StatefulSet) |
| Valhalla | 500m | 2 | 2Gi | 4Gi | 1 (data is mounted) |
| ClickHouse replica (per replica) | 2 | 8 | 16Gi | 32Gi | 2 |
| ClickHouse Keeper | 100m | 500m | 256Mi | 1Gi | 3 |
| SigNoz / OTel collector | 500m | 2 | 1Gi | 2Gi | per chart defaults |
| ch-ui | 50m | 200m | 64Mi | 256Mi | 1 |
| Kafka UI | 100m | 500m | 256Mi | 512Mi | 1 |
The ClickHouse numbers are placeholder; verify against the deployment-specific
values.yaml. ClickHouse memory is workload-driven — increase if you see merge backlog or cache thrashing.
Externally managed (out of scope here)¶
- PostgreSQL — sized by the customer / cloud provider. Recommend ≥ 4 vCPU / 16 GiB for steady; scale with growth.
- Kafka — provider-managed. Provision based on partition count + retention.
- S3 / GCS / MinIO — pay-as-you-go (cloud) or sized per data volume (MinIO).
Sizing logic¶
- API pods: limited by network and DB latency, not CPU. Scale horizontally on QPS, not vertically.
- Worker pods: heterogeneous. Most consumers are I/O-bound; the PMTiles and Citylens migration jobs are CPU-bound. Either run them on dedicated pods (separate Hangfire queue) or right-size the general Worker for the heaviest job.
- ClickHouse: bigger merges → bigger working set. RAM is the lever; CPU matters during merges and federated reads.
- Mobile traffic: API request rate scales with the number of concurrent inspectors, not with frame count (frames go straight to S3). 2 API replicas are enough for several hundred concurrent inspectors.
Right-sizing¶
Run a load test against staging before adjusting prod quotas — see axion.sense.backend/docs/LoadTestPerfNotes.md for the baseline scenario and findings.