Skip to content

Observability

Two parallel observability lanes:

  1. SigNoz — infrastructure + application telemetry (traces, logs, metrics, alerts) for everything except the AI agent.
  2. Langfuse — LLM-specific telemetry for Gen Web's AI agent (prompts, completions, tool calls, evals).

SigNoz

Property Value
Version SigNoz 0.119.0
Helm chart signoz/signoz 0.119.0 (in axion.infra/services/signoz)
Backed by Its own ClickHouse cluster (separate from the Sense analytical cluster)
OTel collector Bundled, exposed on 4317 (gRPC) and 4318 (HTTP)
UI https://signoz-staging.dev.axionx.ai

What sends to SigNoz

Source Signal types
Sense API Traces, logs, metrics
Sense Worker (incl. Hangfire) Traces, logs, metrics
Gen API Traces, logs, metrics
Postgres exporter Metrics
Kafka exporter Metrics
ClickHouse exporter Metrics
OpenFGA Metrics, traces
Valhalla Metrics (if exposed by chart)

All app services use OpenTelemetry SDKs (OpenTelemetry.* NuGet packages on .NET, equivalents on the web BFF). They export via OTLP to the in-cluster collector.

Trace shape

A frame upload from the mobile app produces a single trace:

Span: rpc.RoadDataService/CreateTrack             [Sense API]
├── Span: db.postgres.insert track_logs           [Sense API]
└── Span: kafka.produce track.metadata            [Sense API]
    └── Span: kafka.consume track.metadata        [Sense Worker]
        ├── Span: db.clickhouse.insert frames     [Sense Worker]
        ├── Span: kafka.produce recognition_requests          [Sense Worker]
        │   └── Span: kafka.consume recognition_requests      [Vision Worker]
        │       └── Span: kafka.produce vision_frames_lifecycle (PredictionRequired)
        │           ├── Span: kafka.consume vision_frames_lifecycle [Vision Quality]
        │           │   └── Span: grpc.triton.quality_check
        │           ├── Span: kafka.consume vision_frames_lifecycle [Vision Worker, results]
        │           │   └── Span: db.clickhouse.insert detections (quality verdict)
        │           └── Span: kafka.consume vision_frames_lifecycle [Vision Worker, dispatch]
        │               └── Span: http.detector.post (per detector)
        │                   └── Span: http.detections_api.push  [External Detector → Vision Detections API]
        │                       └── Span: kafka.produce vision_frames_lifecycle (LocationEstimationRequired)
        │                           └── Span: kafka.consume vision_frames_lifecycle [Vision Clusterization]
        │                               └── Span: kafka.produce clusterization_requests
        │                                   └── Span: kafka.consume clusterization_requests
        │                                       └── Span: db.clickhouse.insert detections + objects
        └── Span: kafka.produce track.metadata (TrackMatchingRequest) [Sense Worker]
            └── Span: kafka.consume track.metadata               [Vision Matching]
                └── Span: http.valhalla.match
                    └── Span: kafka.produce track.metadata (TrackMatchingResult)
                        └── Span: kafka.consume track.metadata   [Sense Worker]
                            └── Span: db.clickhouse.update tracks/frames (is_map_matched=true)

The trace is held together by request_id propagation:

  • HTTP/gRPC ingress sets a request-id header (or echoes a client-supplied one).
  • KafkaFlow's RequestIdProducerMiddleware puts request-id on every produced message header.
  • KafkaFlow consumer middleware reads it back and seeds the W3C traceparent for the new span.

Span attributes (standard set)

Every app span carries:

  • org.id, user.id (when available)
  • request.id
  • track.id / frame.id (when relevant)
  • kafka.topic, kafka.partition, kafka.offset (Kafka spans)
  • db.statement (sanitized — never raw user input)

Hangfire telemetry

HangfireTelemetryFilter subscribes to job state changes and emits OpenTelemetry spans for:

  • Job enqueue
  • Job execution start/end
  • Job failure with exception details

This makes scheduled jobs (PMTiles generation, Citylens migration) observable in the same trace UI as request-driven work.

Logs

  • Structured JSON logs (Serilog).
  • Each log entry carries the active trace_id and span_id so you can pivot from a span to the related logs.
  • Levels: Information in prod, Debug in dev. Warning+ are alertable.

Metrics

Application metrics are emitted via the OpenTelemetry Metrics API:

Metric Source Use
request_count API Throughput dashboards
request_duration_ms API Latency p50/p95/p99
kafka_consume_lag Worker Backpressure alerts
hangfire_jobs_processed Worker Scheduler health
clickhouse_insert_rows Worker Audit + analytics ingest rate
worker_errors_total Worker Error budget

Alerts

Configured in SigNoz; canonical set:

  • API p99 latency > 1s sustained.
  • Kafka consumer lag > 5 minutes for any topic.
  • Hangfire job failure rate > 5% over 30 min.
  • ClickHouse insert error rate > 1%.
  • Postgres connection saturation > 80%.

Langfuse (LLM observability)

Property Value
Service Langfuse (managed or self-hosted)
Used by Gen Web's AI agent only
What it captures LLM prompts, completions, token counts, tool invocations, latencies, evaluation scores

Why a separate observability lane

SigNoz is great for HTTP/gRPC/DB traces. It's not built around prompts and completions. Langfuse is — different audience (data/AI engineers), different UI (prompt versioning, eval dashboards).

The integration point is the Next.js BFF for the AI agent (/api/agent/stream). Each LLM call and each tool invocation gets exported to Langfuse with:

  • The user's question.
  • The system prompt + the chosen tool.
  • The completion (when applicable).
  • Token counts and cost.
  • The downstream tool call's input/output.

What does NOT go to Langfuse

  • The actual data returned from the database. Langfuse should never see analytical results — that's PII and possibly customer-confidential.
  • Anything from Sense or other surfaces — Langfuse is scoped to the AI agent.

Configuration

{
  "LANGFUSE_PUBLIC_KEY": "...",
  "LANGFUSE_SECRET_KEY": "...",
  "LANGFUSE_HOST": "https://cloud.langfuse.com"
}

In dev / e2e tests, LLM_PROVIDER=mock short-circuits the LLM call entirely; nothing goes to Langfuse.

What's missing (deliberately)

  • No APM agent — we don't deploy Datadog/New Relic agents. OpenTelemetry covers the same ground at lower lock-in.
  • No log aggregator separate from SigNoz — SigNoz handles logs alongside traces; one less moving piece.
  • No custom RUM — frontend errors flow through Sentry on the web side (separate concern, not part of the platform observability backbone).