Observability¶
Two parallel observability lanes:
- SigNoz — infrastructure + application telemetry (traces, logs, metrics, alerts) for everything except the AI agent.
- Langfuse — LLM-specific telemetry for Gen Web's AI agent (prompts, completions, tool calls, evals).
SigNoz¶
| Property | Value |
|---|---|
| Version | SigNoz 0.119.0 |
| Helm chart | signoz/signoz 0.119.0 (in axion.infra/services/signoz) |
| Backed by | Its own ClickHouse cluster (separate from the Sense analytical cluster) |
| OTel collector | Bundled, exposed on 4317 (gRPC) and 4318 (HTTP) |
| UI | https://signoz-staging.dev.axionx.ai |
What sends to SigNoz¶
| Source | Signal types |
|---|---|
| Sense API | Traces, logs, metrics |
| Sense Worker (incl. Hangfire) | Traces, logs, metrics |
| Gen API | Traces, logs, metrics |
| Postgres exporter | Metrics |
| Kafka exporter | Metrics |
| ClickHouse exporter | Metrics |
| OpenFGA | Metrics, traces |
| Valhalla | Metrics (if exposed by chart) |
All app services use OpenTelemetry SDKs (OpenTelemetry.* NuGet packages on .NET, equivalents on the web BFF). They export via OTLP to the in-cluster collector.
Trace shape¶
A frame upload from the mobile app produces a single trace:
Span: rpc.RoadDataService/CreateTrack [Sense API]
├── Span: db.postgres.insert track_logs [Sense API]
└── Span: kafka.produce track.metadata [Sense API]
└── Span: kafka.consume track.metadata [Sense Worker]
├── Span: db.clickhouse.insert frames [Sense Worker]
├── Span: kafka.produce recognition_requests [Sense Worker]
│ └── Span: kafka.consume recognition_requests [Vision Worker]
│ └── Span: kafka.produce vision_frames_lifecycle (PredictionRequired)
│ ├── Span: kafka.consume vision_frames_lifecycle [Vision Quality]
│ │ └── Span: grpc.triton.quality_check
│ ├── Span: kafka.consume vision_frames_lifecycle [Vision Worker, results]
│ │ └── Span: db.clickhouse.insert detections (quality verdict)
│ └── Span: kafka.consume vision_frames_lifecycle [Vision Worker, dispatch]
│ └── Span: http.detector.post (per detector)
│ └── Span: http.detections_api.push [External Detector → Vision Detections API]
│ └── Span: kafka.produce vision_frames_lifecycle (LocationEstimationRequired)
│ └── Span: kafka.consume vision_frames_lifecycle [Vision Clusterization]
│ └── Span: kafka.produce clusterization_requests
│ └── Span: kafka.consume clusterization_requests
│ └── Span: db.clickhouse.insert detections + objects
└── Span: kafka.produce track.metadata (TrackMatchingRequest) [Sense Worker]
└── Span: kafka.consume track.metadata [Vision Matching]
└── Span: http.valhalla.match
└── Span: kafka.produce track.metadata (TrackMatchingResult)
└── Span: kafka.consume track.metadata [Sense Worker]
└── Span: db.clickhouse.update tracks/frames (is_map_matched=true)
The trace is held together by request_id propagation:
- HTTP/gRPC ingress sets a
request-idheader (or echoes a client-supplied one). - KafkaFlow's
RequestIdProducerMiddlewareputsrequest-idon every produced message header. - KafkaFlow consumer middleware reads it back and seeds the W3C
traceparentfor the new span.
Span attributes (standard set)¶
Every app span carries:
org.id,user.id(when available)request.idtrack.id/frame.id(when relevant)kafka.topic,kafka.partition,kafka.offset(Kafka spans)db.statement(sanitized — never raw user input)
Hangfire telemetry¶
HangfireTelemetryFilter subscribes to job state changes and emits OpenTelemetry spans for:
- Job enqueue
- Job execution start/end
- Job failure with exception details
This makes scheduled jobs (PMTiles generation, Citylens migration) observable in the same trace UI as request-driven work.
Logs¶
- Structured JSON logs (
Serilog). - Each log entry carries the active
trace_idandspan_idso you can pivot from a span to the related logs. - Levels:
Informationin prod,Debugin dev.Warning+ are alertable.
Metrics¶
Application metrics are emitted via the OpenTelemetry Metrics API:
| Metric | Source | Use |
|---|---|---|
request_count |
API | Throughput dashboards |
request_duration_ms |
API | Latency p50/p95/p99 |
kafka_consume_lag |
Worker | Backpressure alerts |
hangfire_jobs_processed |
Worker | Scheduler health |
clickhouse_insert_rows |
Worker | Audit + analytics ingest rate |
worker_errors_total |
Worker | Error budget |
Alerts¶
Configured in SigNoz; canonical set:
- API p99 latency > 1s sustained.
- Kafka consumer lag > 5 minutes for any topic.
- Hangfire job failure rate > 5% over 30 min.
- ClickHouse insert error rate > 1%.
- Postgres connection saturation > 80%.
Langfuse (LLM observability)¶
| Property | Value |
|---|---|
| Service | Langfuse (managed or self-hosted) |
| Used by | Gen Web's AI agent only |
| What it captures | LLM prompts, completions, token counts, tool invocations, latencies, evaluation scores |
Why a separate observability lane¶
SigNoz is great for HTTP/gRPC/DB traces. It's not built around prompts and completions. Langfuse is — different audience (data/AI engineers), different UI (prompt versioning, eval dashboards).
The integration point is the Next.js BFF for the AI agent (/api/agent/stream). Each LLM call and each tool invocation gets exported to Langfuse with:
- The user's question.
- The system prompt + the chosen tool.
- The completion (when applicable).
- Token counts and cost.
- The downstream tool call's input/output.
What does NOT go to Langfuse¶
- The actual data returned from the database. Langfuse should never see analytical results — that's PII and possibly customer-confidential.
- Anything from Sense or other surfaces — Langfuse is scoped to the AI agent.
Configuration¶
{
"LANGFUSE_PUBLIC_KEY": "...",
"LANGFUSE_SECRET_KEY": "...",
"LANGFUSE_HOST": "https://cloud.langfuse.com"
}
In dev / e2e tests, LLM_PROVIDER=mock short-circuits the LLM call entirely; nothing goes to Langfuse.
What's missing (deliberately)¶
- No APM agent — we don't deploy Datadog/New Relic agents. OpenTelemetry covers the same ground at lower lock-in.
- No log aggregator separate from SigNoz — SigNoz handles logs alongside traces; one less moving piece.
- No custom RUM — frontend errors flow through Sentry on the web side (separate concern, not part of the platform observability backbone).