ClickHouse¶

The analytical and audit store. Sense writes; Gen reads (federated); Sense API reads for catalog-style queries.

At a glance¶

Property	Value
Version	ClickHouse 26.3
Operator	Altinity ClickHouse Operator 0.3.11 (in-cluster)
Topology	1 shard × 2 replicas + ClickHouse Keeper
Service DNS	`clickhouse-sense.axion.svc:8123` (HTTP), `:9000` (TCP)
Database	`axion_sense`
Helm chart	`axion.infra/services/clickhouse`
Backup	Per-replica backups via `clickhouse-backup` (out-of-band cron)
Storage	PVC per replica

Tables¶

All tables use plain ReplicatedMergeTree. Idempotency is handled client-side via lightweight DELETE (with mutations_sync = 1, allow_nondeterministic_mutations = 1) followed by INSERT.

Table	Engine	Sort key	Partition	Purpose
`tracks`	`ReplicatedMergeTree`	`(id)`	`toYYYYMM(UUIDv7ToDateTime(id))`	Track metadata + map-matched route
`frames`	`ReplicatedMergeTree`	`(track_id, captured_at, id)`	`toYYYYMM(UUIDv7ToDateTime(id))`	Per-frame metadata, GPS, sensors, H3 cells
`detections`	`ReplicatedMergeTree`	`(track_id, frame_id, detector_id)`	`toYYYYMM(UUIDv7ToDateTime(track_id))`	ML detection results
`audit_log`	`ReplicatedMergeTree` (write-once)	`(organization_id, created_at)`	`toYYYYMM(created_at)`	Compliance audit trail

Why plain `ReplicatedMergeTree` + client-side `DELETE` + `INSERT`¶

We deliver from Kafka at-least-once. Re-deliveries on consumer restart, retries on transient ClickHouse errors, and idempotent re-runs of the Citylens migration all create duplicates. We handle them with a writer-side DELETE by primary key + INSERT (using mutations_sync = 1, allow_nondeterministic_mutations = 1) so each row is replaced cleanly.

ReplacingMergeTree(updated_at) was the original choice — it collapses duplicates by sort key during background merges, which sounds ideal. We moved away from it because querying it correctly required SELECT … FINAL, and FINAL cost killed read latency once the working set grew. Plain ReplicatedMergeTree keeps reads fast at the cost of a more involved write path.

audit_log follows the same pattern but the contract is stronger: every audit row is unique and must never be coalesced. Duplicates are kept out at the producer (Kafka idempotent producer + at-most-once acceptance criteria for audit batches).

Why `UUIDv7ToDateTime` for partitioning¶

Track and frame IDs are UUIDv7 — they encode a millisecond timestamp. Partitioning by toYYYYMM(UUIDv7ToDateTime(id)) gives time-ordered monthly partitions without storing a separate timestamp column on the partition key. Same applies to detections (partitioned by track_id).

Operational benefits: - OPTIMIZE PARTITION works on a known time slice. - Old partitions can be dropped wholesale (ALTER TABLE ... DROP PARTITION '202501') for retention enforcement without DELETE overhead.

Inserts¶

All inserts come from Sense Worker via batch path:

TrackMetadataBatchMiddleware → tracks + frames.
RecognitionResponseBatchMiddleware → detections.
AuditEventBatchMiddleware → ClickHouseAuditBatcher → audit_log.
Citylens*BatchMiddleware → tracks / frames / detections (via mapping).

Insert pattern: bulk INSERT INTO ... VALUES (...) (...) ... with prepared columns. Batch size 1000–5000 rows, flush every few seconds or when size threshold is hit.

Migrations¶

Idempotent CREATE TABLE IF NOT EXISTS SQL in axion.sense.backend/deploy/clickhouse/init.sql. Migration runner applies the script at every startup — safe because it's all IF NOT EXISTS. Schema changes are additive only; column drops are deferred to a planned downtime window.

Reading¶

Sense API: catalog-style reads (e.g., recent tracks for an org) via the data-context's ClickHouse readers.
Sense Worker: reads frames by H3 cell during PMTiles generation.
Gen API: federated read via the ClickHouse data-source connector. Gen has its own DB role with SELECT only.

Operations¶

Hot upgrades: Altinity Operator handles rolling upgrade of replicas. Pin the operator + ClickHouse versions in Chart.yaml to control timing.
Replica failure: Operator promotes the surviving replica; new replica resyncs from the other.
Keeper failure: 3-node Keeper quorum required for ZooKeeper-style coordination — verify replica count.
Disk pressure: monitor system.parts size; tune merge settings if part count blows up.

Observability¶

ClickHouse exporter → OTeL (query latencies, merge counts, replication lag).
Per-query traces via system.query_log accessible through the ClickHouse UI (ch-ui exposed at https://axion-clickhouse-ui-staging.dev.axionx.ai).

Capacity¶

Metric	Steady state	Peak (after coverage day)
`frames` rows / sec	~3 / s (long-run avg); ~5–10 / s during active hours	~45–50 / s
`detections` rows / sec	~2–3 / s (long-run avg)	~40 / s (≈1,000 / s during reprocessing/backfill bursts)
`tracks` rows / sec	~0.1 / s (long-run avg)	~2.5 / s
Disk per replica	~0.9–1.1 GiB / month growth (frames + detections + tracks); +0.3–0.5 GiB during detection-reprocessing months	—
RAM per replica	12–16 GiB recommended	—