Skip to content

ClickHouse

The analytical and audit store. Sense writes; Gen reads (federated); Sense API reads for catalog-style queries.

At a glance

Property Value
Version ClickHouse 26.3
Operator Altinity ClickHouse Operator 0.3.11 (in-cluster)
Topology 1 shard × 2 replicas + ClickHouse Keeper
Service DNS clickhouse-sense.axion.svc:8123 (HTTP), :9000 (TCP)
Database axion_sense
Helm chart axion.infra/services/clickhouse
Backup Per-replica backups via clickhouse-backup (out-of-band cron)
Storage PVC per replica

Tables

All tables use plain ReplicatedMergeTree. Idempotency is handled client-side via lightweight DELETE (with mutations_sync = 1, allow_nondeterministic_mutations = 1) followed by INSERT.

Table Engine Sort key Partition Purpose
tracks ReplicatedMergeTree (id) toYYYYMM(UUIDv7ToDateTime(id)) Track metadata + map-matched route
frames ReplicatedMergeTree (track_id, captured_at, id) toYYYYMM(UUIDv7ToDateTime(id)) Per-frame metadata, GPS, sensors, H3 cells
detections ReplicatedMergeTree (track_id, frame_id, detector_id) toYYYYMM(UUIDv7ToDateTime(track_id)) ML detection results
audit_log ReplicatedMergeTree (write-once) (organization_id, created_at) toYYYYMM(created_at) Compliance audit trail

Why plain ReplicatedMergeTree + client-side DELETE + INSERT

We deliver from Kafka at-least-once. Re-deliveries on consumer restart, retries on transient ClickHouse errors, and idempotent re-runs of the Citylens migration all create duplicates. We handle them with a writer-side DELETE by primary key + INSERT (using mutations_sync = 1, allow_nondeterministic_mutations = 1) so each row is replaced cleanly.

ReplacingMergeTree(updated_at) was the original choice — it collapses duplicates by sort key during background merges, which sounds ideal. We moved away from it because querying it correctly required SELECT … FINAL, and FINAL cost killed read latency once the working set grew. Plain ReplicatedMergeTree keeps reads fast at the cost of a more involved write path.

audit_log follows the same pattern but the contract is stronger: every audit row is unique and must never be coalesced. Duplicates are kept out at the producer (Kafka idempotent producer + at-most-once acceptance criteria for audit batches).

Why UUIDv7ToDateTime for partitioning

Track and frame IDs are UUIDv7 — they encode a millisecond timestamp. Partitioning by toYYYYMM(UUIDv7ToDateTime(id)) gives time-ordered monthly partitions without storing a separate timestamp column on the partition key. Same applies to detections (partitioned by track_id).

Operational benefits: - OPTIMIZE PARTITION works on a known time slice. - Old partitions can be dropped wholesale (ALTER TABLE ... DROP PARTITION '202501') for retention enforcement without DELETE overhead.

Inserts

All inserts come from Sense Worker via batch path:

  • TrackMetadataBatchMiddlewaretracks + frames.
  • RecognitionResponseBatchMiddlewaredetections.
  • AuditEventBatchMiddlewareClickHouseAuditBatcheraudit_log.
  • Citylens*BatchMiddlewaretracks / frames / detections (via mapping).

Insert pattern: bulk INSERT INTO ... VALUES (...) (...) ... with prepared columns. Batch size 1000–5000 rows, flush every few seconds or when size threshold is hit.

Migrations

Idempotent CREATE TABLE IF NOT EXISTS SQL in axion.sense.backend/deploy/clickhouse/init.sql. Migration runner applies the script at every startup — safe because it's all IF NOT EXISTS. Schema changes are additive only; column drops are deferred to a planned downtime window.

Reading

  • Sense API: catalog-style reads (e.g., recent tracks for an org) via the data-context's ClickHouse readers.
  • Sense Worker: reads frames by H3 cell during PMTiles generation.
  • Gen API: federated read via the ClickHouse data-source connector. Gen has its own DB role with SELECT only.

Operations

  • Hot upgrades: Altinity Operator handles rolling upgrade of replicas. Pin the operator + ClickHouse versions in Chart.yaml to control timing.
  • Replica failure: Operator promotes the surviving replica; new replica resyncs from the other.
  • Keeper failure: 3-node Keeper quorum required for ZooKeeper-style coordination — verify replica count.
  • Disk pressure: monitor system.parts size; tune merge settings if part count blows up.

Observability

  • ClickHouse exporter → OTeL (query latencies, merge counts, replication lag).
  • Per-query traces via system.query_log accessible through the ClickHouse UI (ch-ui exposed at https://axion-clickhouse-ui-staging.dev.axionx.ai).

Capacity

Metric Steady state Peak (after coverage day)
frames rows / sec ~3 / s (long-run avg); ~5–10 / s during active hours ~45–50 / s
detections rows / sec ~2–3 / s (long-run avg) ~40 / s (≈1,000 / s during reprocessing/backfill bursts)
tracks rows / sec ~0.1 / s (long-run avg) ~2.5 / s
Disk per replica ~0.9–1.1 GiB / month growth (frames + detections + tracks); +0.3–0.5 GiB during detection-reprocessing months
RAM per replica 12–16 GiB recommended