Observability¶

Two parallel observability lanes:

SigNoz — infrastructure + application telemetry (traces, logs, metrics, alerts) for everything except the AI agent.
Langfuse — LLM-specific telemetry for Gen Web's AI agent (prompts, completions, tool calls, evals).

SigNoz¶

Property	Value
Version	SigNoz 0.119.0
Helm chart	`signoz/signoz` 0.119.0 (in `axion.infra/services/signoz`)
Backed by	Its own ClickHouse cluster (separate from the Sense analytical cluster)
OTel collector	Bundled, exposed on 4317 (gRPC) and 4318 (HTTP)
UI	`https://signoz-staging.dev.axionx.ai`

What sends to SigNoz¶

Source	Signal types
Sense API	Traces, logs, metrics
Sense Worker (incl. Hangfire)	Traces, logs, metrics
Gen API	Traces, logs, metrics
Postgres exporter	Metrics
Kafka exporter	Metrics
ClickHouse exporter	Metrics
OpenFGA	Metrics, traces
Valhalla	Metrics (if exposed by chart)

All app services use OpenTelemetry SDKs (OpenTelemetry.* NuGet packages on .NET, equivalents on the web BFF). They export via OTLP to the in-cluster collector.

Trace shape¶

A frame upload from the mobile app produces a single trace:

Span: rpc.RoadDataService/CreateTrack             [Sense API]
├── Span: db.postgres.insert track_logs           [Sense API]
└── Span: kafka.produce track.metadata            [Sense API]
    └── Span: kafka.consume track.metadata        [Sense Worker]
        ├── Span: db.clickhouse.insert frames     [Sense Worker]
        ├── Span: kafka.produce recognition_requests          [Sense Worker]
        │   └── Span: kafka.consume recognition_requests      [Vision Worker]
        │       └── Span: kafka.produce vision_frames_lifecycle (PredictionRequired)
        │           ├── Span: kafka.consume vision_frames_lifecycle [Vision Quality]
        │           │   └── Span: grpc.triton.quality_check
        │           ├── Span: kafka.consume vision_frames_lifecycle [Vision Worker, results]
        │           │   └── Span: db.clickhouse.insert detections (quality verdict)
        │           └── Span: kafka.consume vision_frames_lifecycle [Vision Worker, dispatch]
        │               └── Span: http.detector.post (per detector)
        │                   └── Span: http.detections_api.push  [External Detector → Vision Detections API]
        │                       └── Span: kafka.produce vision_frames_lifecycle (LocationEstimationRequired)
        │                           └── Span: kafka.consume vision_frames_lifecycle [Vision Clusterization]
        │                               └── Span: kafka.produce clusterization_requests
        │                                   └── Span: kafka.consume clusterization_requests
        │                                       └── Span: db.clickhouse.insert detections + objects
        └── Span: kafka.produce track.metadata (TrackMatchingRequest) [Sense Worker]
            └── Span: kafka.consume track.metadata               [Vision Matching]
                └── Span: http.valhalla.match
                    └── Span: kafka.produce track.metadata (TrackMatchingResult)
                        └── Span: kafka.consume track.metadata   [Sense Worker]
                            └── Span: db.clickhouse.update tracks/frames (is_map_matched=true)

The trace is held together by request_id propagation:

HTTP/gRPC ingress sets a request-id header (or echoes a client-supplied one).
KafkaFlow's RequestIdProducerMiddleware puts request-id on every produced message header.
KafkaFlow consumer middleware reads it back and seeds the W3C traceparent for the new span.

Span attributes (standard set)¶

Every app span carries:

org.id, user.id (when available)
request.id
track.id / frame.id (when relevant)
kafka.topic, kafka.partition, kafka.offset (Kafka spans)
db.statement (sanitized — never raw user input)

Hangfire telemetry¶

HangfireTelemetryFilter subscribes to job state changes and emits OpenTelemetry spans for:

Job enqueue
Job execution start/end
Job failure with exception details

This makes scheduled jobs (PMTiles generation, Citylens migration) observable in the same trace UI as request-driven work.

Logs¶

Structured JSON logs (Serilog).
Each log entry carries the active trace_id and span_id so you can pivot from a span to the related logs.
Levels: Information in prod, Debug in dev. Warning+ are alertable.

Metrics¶

Application metrics are emitted via the OpenTelemetry Metrics API:

Metric	Source	Use
`request_count`	API	Throughput dashboards
`request_duration_ms`	API	Latency p50/p95/p99
`kafka_consume_lag`	Worker	Backpressure alerts
`hangfire_jobs_processed`	Worker	Scheduler health
`clickhouse_insert_rows`	Worker	Audit + analytics ingest rate
`worker_errors_total`	Worker	Error budget

Alerts¶

Configured in SigNoz; canonical set:

API p99 latency > 1s sustained.
Kafka consumer lag > 5 minutes for any topic.
Hangfire job failure rate > 5% over 30 min.
ClickHouse insert error rate > 1%.
Postgres connection saturation > 80%.

Langfuse (LLM observability)¶

Property	Value
Service	Langfuse (managed or self-hosted)
Used by	Gen Web's AI agent only
What it captures	LLM prompts, completions, token counts, tool invocations, latencies, evaluation scores

Why a separate observability lane¶

SigNoz is great for HTTP/gRPC/DB traces. It's not built around prompts and completions. Langfuse is — different audience (data/AI engineers), different UI (prompt versioning, eval dashboards).

The integration point is the Next.js BFF for the AI agent (/api/agent/stream). Each LLM call and each tool invocation gets exported to Langfuse with:

The user's question.
The system prompt + the chosen tool.
The completion (when applicable).
Token counts and cost.
The downstream tool call's input/output.

What does NOT go to Langfuse¶

The actual data returned from the database. Langfuse should never see analytical results — that's PII and possibly customer-confidential.
Anything from Sense or other surfaces — Langfuse is scoped to the AI agent.

Configuration¶

{
  "LANGFUSE_PUBLIC_KEY": "...",
  "LANGFUSE_SECRET_KEY": "...",
  "LANGFUSE_HOST": "https://cloud.langfuse.com"
}

In dev / e2e tests, LLM_PROVIDER=mock short-circuits the LLM call entirely; nothing goes to Langfuse.

What's missing (deliberately)¶

No APM agent — we don't deploy Datadog/New Relic agents. OpenTelemetry covers the same ground at lower lock-in.
No log aggregator separate from SigNoz — SigNoz handles logs alongside traces; one less moving piece.
No custom RUM — frontend errors flow through Sentry on the web side (separate concern, not part of the platform observability backbone).