Observability — operator how-to¶

Sibling docs: observability.md is the architectural overview (pipeline, exporters, why-it-works). This doc is the operational how-to — what to type when something is on fire.

Every PromQL/LogQL example below is lifted verbatim from a panel in observability/grafana/provisioning/dashboards/*.json or from one of the perf documents (docs/audits/perf-promql-audit.md, docs/performance-test-report.md). They have been validated against the local stack. Where a query is not taken from a panel, the example is marked {{verify before customer share}}.

1. Where each signal lives¶

I want to see…	Tool	URL pattern	Example query
Request rate per service	Grafana → `ebit-perf-test`	`<grafana>/d/ebit-perf-test`	`sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m]))`
Request p95 latency per service	Grafana → `ebit-perf-test` or `ebit-service-overview`	same	`histogram_quantile(0.95, sum by (le, service_name) (rate(duration_milliseconds_bucket{span_kind="SPAN_KIND_SERVER"}[1m])))`
Error rate per service	Grafana → `ebit-perf-test`	same	`sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_ERROR"}[1m])) / sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m])) * 100`
Error logs from `api`	Loki via Grafana Explore	`<grafana>/explore`	`{service_name="ebit-api"} \\|= "ERROR"`
Logs by trace_id	Loki via Grafana Explore	`<grafana>/explore`	`{service_name="ebit-api"} \\|= "<trace_id>"`
Full trace for a request	Jaeger	`<jaeger>:16686` (`/search?service=ebit-api`)	search by traceID, or click a Grafana exemplar
Container CPU / mem / disk / net	Grafana → `ebit-perf-system`	`<grafana>/d/ebit-perf-system`	`100 - 100 * avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))`
Postgres slowest tables	Grafana → `ebit-prisma-postgres`	`<grafana>/d/ebit-prisma-postgres`	`topk(5, histogram_quantile(0.95, sum by (le, db_sql_table) (rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m]))))`
Postgres hot queries (raw)	psql + `pg_stat_statements`	shell to db pod	`SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;`
Redis throughput by command	Grafana → `ebit-redis`	`<grafana>/d/ebit-redis`	`sum by (span_name) (rate(calls_total{span_kind="SPAN_KIND_CLIENT",span_name=~"(?i)(get\\|set\\|del\\|hget\\|hset\\|...)"}[1m]))` (full filter in dashboard JSON)
BullMQ queue depth	Grafana → `ebit-bullmq`	`<grafana>/d/ebit-bullmq`	`bullmq_queue_jobs{state="wait"}`
Logs ↔ trace pivot UI	Grafana → `ebit-logs-trace-pivot`	`<grafana>/d/ebit-logs-trace-pivot`	(mixed Loki + Tempo/Jaeger panels)

Local URLs: Grafana http://localhost:3003 (admin/grafana) · Jaeger http://localhost:16686 · Prometheus http://localhost:9090 · Loki http://localhost:3100. Production URLs come from the perf-stack Terraform outputs (terraform output -json | jq .grafana_url) — see docs/perf-run-checklist.md.

2. Find a trace from a user complaint¶

Customer says "betting was slow at 14:30". Walk-through:

Open Grafana → Dashboards → ebit · Performance Test (or ebit · Service Overview (RED) for prod-shape views).
Set the time range to a 30-min window straddling 14:30 (e.g., 14:15–14:45). Top-right time picker.
Look at "p95 Latency by Service". The line for service_name="api" (or bj, speed-roulette, etc.) should show a visible spike at 14:30.
Click the spike. Grafana renders an exemplar dot (small diamond) on the same time axis when traces have been sampled at that latency. Hover → "View Trace in Jaeger" link.
Jaeger opens with the slowest trace from that bucket. Inspect the waterfall: the longest span is the bottleneck. Common culprits documented in docs/audits/perf-trace-coverage-audit.md.
No exemplar dot visible? Tail-sampling drops ~90% of OK traces; only ERROR / SLOW samples are kept (see docs/audits/jaeger-storage-research.md). Click into Jaeger directly and search by service+operation+min-duration; you'll find a representative trace even without the exemplar link.

3. Find a log from a trace¶

Given a trace_id (e.g., the root traceID shown in Jaeger):

{service_name="ebit-api"} |= "<trace_id>"

This works because the OTel pino instrumentation injects trace_id / span_id / trace_flags into every log record (see observability.md §"Trace correlation"). The pipe operator does substring match on the JSON line.

For span-level filtering, narrow further:

{service_name="ebit-api"} |= "<trace_id>" |= "<span_id>"

EvoLogger records (winston-backed, ~40 call sites) reach Loki via the filelog/docker receiver — they don't carry service_name resource. Query with:

{source="docker_filelog"} |= "<trace_id>"

Combine both with or-style logic by running them in two Explore tabs side-by-side; LogQL has no native union.

4. Find a trace from a log¶

Inverse: you have a noisy ERROR log line and want the full trace.

In Grafana Explore with the Loki datasource, run a LogQL query that surfaces the line, e.g. {service_name="ebit-api"} |= "ERROR".
Click the line. The provisioned derivedFields config (in observability/grafana/provisioning/datasources/datasource.yaml) renders a "View trace" link on the right of any record carrying a trace_id field.
Click "View trace". Jaeger opens directly to the root span.

If the link doesn't appear: the log record didn't carry a trace_id (likely a startup record before the span was active, or an EvoLogger record). Fall back to step 5 below.

5. Cross-service tracing gotchas¶

Cross-service trace propagation breaks in three documented cases — operators chasing a "missing parent span" symptom should check these first before assuming a code bug.

Redis pub/sub RPC (ExternalControllerClient). Any operation that crosses services via Redis pub/sub does not propagate the OTel context — the consumer starts an orphan trace. Workaround: search Jaeger by service_name of the consumer + a unique correlation field (request body hash, user_id, etc.) and pivot on that. Documented in project_otel_microservice_transport_gap.md.
Websocket /events (rt service). The socket.io gateway has no server-side OTel instrumentation; messages enter and leave without spans. Logs in the rt service still carry trace_id if a parent context exists, but most ws-only operations have no parent. Workaround: trace by user_id ({service_name="ebit-rt"} |= "user_id=<id>").
BullMQ bet-settled consumer. traceparent is not stored in the job payload, so the consumer span is an orphan. Documented in docs/audits/perf-trace-coverage-audit.md. Workaround: pivot on bet_id field — the producer's bet-place trace and the consumer's bet-settle trace can be joined manually.

6. Performance regression hunt¶

"p95 was 50 ms last week, now 200 ms".

Open ebit-service-overview in Grafana with the time range set to 7d.
Look at the p95-by-service panel. Identify the moment p95 stepped up.
Cross-reference deployments. If the panel has a deployment_version label series (currently {{TBD: deployment_version label not yet wired}} in the local stack — see docs/perf-run-checklist.md), the regression maps to a release. Otherwise, correlate against the git log --until of ebit-api for the suspect window.
Drill into spanmetrics by span_name. Switch to ebit-perf-test and look at the per-span panels (Prisma, Redis, BullMQ). One of them will show the same step-up — that's the layer where the regression landed.
Pull a representative trace for the regressed window (Jaeger search with the same service+operation+min-duration). Compare span breakdown vs a pre-regression trace from the same operation.
Confirm in source: git log the controller / service that owns the regressed span between the two timestamps.

7. Capacity check¶

"We doubled our user count, do we have headroom?"

Open ebit-perf-system in Grafana for the 1h peak window.
Cross-reference five panels:
CPU utilization per VM — peaks below 70% means CPU has runway.
Memory pressure — 1 - MemAvailable / MemTotal should stay below ~0.85.
Disk IO weighted time — proxy for queue depth; rising linearly with load is fine, super-linear is a saturation warning.
Network throughput per interface — compare against the instance's nominal bandwidth (c7g class details in terraform/perf/README.md).
Conntrack utilization — node_nf_conntrack_entries / node_nf_conntrack_entries_limit < 0.5 is healthy.
Pull the FD count (process_open_fds or node_filefd_allocated) — Node clusters at ~10k connections hit the default fd limit; raise via ulimit ahead of doubling.
Compare against a prior ramp. If the perf programme has a "1× user count" reference run captured in docs/performance-test-report.md, the deltas there give a linear extrapolation.

8. Common confusions (pitfalls)¶

Spanmetrics names are unprefixed. The metrics emitted by the OTel spanmetrics connector are calls_total and duration_milliseconds_bucket — not traces_spanmetrics_calls_total. Old docs that reference the prefixed form are wrong (already corrected in perf-promql-audit.md).
Process-exporter node group collapses 5 NestJS apps. The default group rule lumps all node processes into groupname="node". To split per-app you need cmdline regex rules in process-exporter.yml. {{verify before customer share}} — exact patterns documented in the perf-system dashboard variables.
Jaeger v2 healthcheck path is /status, not /. v1 used /. Probes that haven't been updated will permanently report unhealthy.
Tail-sampling drops ~90% of OK traces. If you can't find a trace by min-duration=0 search, that's expected — the storage tier only keeps ERROR / SLOW / sampled. See docs/audits/jaeger-storage-research.md for the exact policy.
Two metric sources for HTTP latency. http_server_duration_milliseconds_bucket (OTel HTTP instrumentation, labels http_route / http_status_code) and duration_milliseconds_bucket{span_kind="SPAN_KIND_SERVER"} (spanmetrics, labels span_name / status_code). Dashboards use the spanmetrics form; older runbooks use the HTTP-instrumentation form. They report the same fact but from different sources — see docs/audits/perf-promql-audit.md MISMATCH 1.
x-captcha-token: pass only works locally. This is a captcha bypass, not an observability one — but it bites operators who try to replay a prod trace from curl. See docs/api-reference/index.md §"Captcha bypass (local only)".
bj and speed-roulette have no Swagger. Their REST surface is internal only; you won't find them in the API reference. Operate on them via logs + traces.

9. Local-dev observability¶

The same stack runs locally via docker compose -f observability/docker-compose.yml up -d (or the project-level compose if observability is wired into the main file). URLs:

Tool	Local URL	Default credentials
Grafana	`http://localhost:3003`	`admin` / `grafana`
Jaeger UI	`http://localhost:16686`	none
Prometheus	`http://localhost:9090`	none
Loki HTTP	`http://localhost:3100`	none (curl/Grafana only)
OTel Collector OTLP gRPC	`localhost:4317`	none (clients)
OTel Collector OTLP HTTP	`localhost:4318`	none

Datasources are pre-provisioned (observability/grafana/provisioning/datasources/datasource.yaml); dashboards land automatically (observability/grafana/provisioning/dashboards/). The only setup steps for a fresh dev box are: have Docker, have the compose file, docker compose up. The observability network is pre-wired so app containers reach the collector at otel-collector:4317.

If a dashboard says "No data": check that the relevant app exporter is running (docker compose ps), then check that Prometheus is scraping it (<prom>/targets).

10. Useful saved queries¶

These are the queries operators run repeatedly. Each is verbatim from a dashboard panel or a perf doc.

# 1. Service request rate (server side)
sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m]))

# 2. Service p95 latency
histogram_quantile(0.95,
  sum by (le, service_name) (
    rate(duration_milliseconds_bucket{span_kind="SPAN_KIND_SERVER"}[1m])
  )
)

# 3. Service error rate (5xx %)
sum by (service_name) (
  rate(calls_total{span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_ERROR"}[1m])
) / sum by (service_name) (
  rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m])
) * 100

# 4. Prisma ops/s by query type
sum by (span_name) (rate(calls_total{span_name=~"prisma:.*"}[1m]))

# 5. DB query p95 across all tables
histogram_quantile(0.95,
  sum by (le) (rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m]))
)

# 6. Top 5 slowest tables
topk(5, histogram_quantile(0.95,
  sum by (le, db_sql_table) (
    rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m])
  )
))

# 7. BullMQ queue depth (jobs waiting)
bullmq_queue_jobs{state="wait"}

# 8. Container CPU per VM
100 - 100 * avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))

# 9. Container memory pressure
1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

# 10. Conntrack utilization
node_nf_conntrack_entries / node_nf_conntrack_entries_limit

# 11. k6 virtual users (during a perf run)
k6_vus{scenario=~"$scenario"}

# 12. k6 request p95 (client-observed, in ms)
histogram_quantile(0.95, rate(k6_http_req_duration_seconds_bucket{scenario=~"$scenario"}[1m])) * 1000

# 13. Errors from any service in the last 5 min
{service_name=~"ebit-.*"} |= "ERROR"

# 14. Trace pivot — every record for one trace
{service_name="ebit-api"} |= "<trace_id>"

# 15. EvoLogger-only records (no service_name resource)
{source="docker_filelog"} |= "EvoLogger"

Cross-links¶

observability.md — architectural overview (sibling, why-it-works).
../observability.md — top-level OTel/Loki/Jaeger pipeline + validation curls.
../architecture/tracing-flow.md — Mermaid flowchart of the OTel pipeline.
../audits/perf-promql-audit.md — cross-audit of every PromQL across the perf programme; cite when a query in this doc looks wrong.
../audits/perf-trace-coverage-audit.md — per-endpoint OTel span coverage and the documented blind spots.
../perf-run-checklist.md — pre-flight / in-flight / post-flight checklist for a perf run; uses many of these queries.
../performance-test-report.md — the bottleneck-hunting patterns table referenced in §6.
../e2e-trace-demo.md — runnable end-to-end trace walkthrough, complements the §2/§3/§4 pivots above.
../audits/jaeger-storage-research.md — tail-sampling policy, retention.
../adr/0002-spanmetrics-over-prisma-metrics.md, 0007-evologger-kept-not-migrated.md — relevant decisions.