Skip to content

Observability — operator how-to

Sibling docs: observability.md is the architectural overview (pipeline, exporters, why-it-works). This doc is the operational how-to — what to type when something is on fire.

Every PromQL/LogQL example below is lifted verbatim from a panel in observability/grafana/provisioning/dashboards/*.json or from one of the perf documents (docs/audits/perf-promql-audit.md, docs/performance-test-report.md). They have been validated against the local stack. Where a query is not taken from a panel, the example is marked {{verify before customer share}}.

1. Where each signal lives

I want to see… Tool URL pattern Example query
Request rate per service Grafana → ebit-perf-test <grafana>/d/ebit-perf-test sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m]))
Request p95 latency per service Grafana → ebit-perf-test or ebit-service-overview same histogram_quantile(0.95, sum by (le, service_name) (rate(duration_milliseconds_bucket{span_kind="SPAN_KIND_SERVER"}[1m])))
Error rate per service Grafana → ebit-perf-test same sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_ERROR"}[1m])) / sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m])) * 100
Error logs from api Loki via Grafana Explore <grafana>/explore {service_name="ebit-api"} \|= "ERROR"
Logs by trace_id Loki via Grafana Explore <grafana>/explore {service_name="ebit-api"} \|= "<trace_id>"
Full trace for a request Jaeger <jaeger>:16686 (/search?service=ebit-api) search by traceID, or click a Grafana exemplar
Container CPU / mem / disk / net Grafana → ebit-perf-system <grafana>/d/ebit-perf-system 100 - 100 * avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))
Postgres slowest tables Grafana → ebit-prisma-postgres <grafana>/d/ebit-prisma-postgres topk(5, histogram_quantile(0.95, sum by (le, db_sql_table) (rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m]))))
Postgres hot queries (raw) psql + pg_stat_statements shell to db pod SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;
Redis throughput by command Grafana → ebit-redis <grafana>/d/ebit-redis sum by (span_name) (rate(calls_total{span_kind="SPAN_KIND_CLIENT",span_name=~"(?i)(get\|set\|del\|hget\|hset\|...)"}[1m])) (full filter in dashboard JSON)
BullMQ queue depth Grafana → ebit-bullmq <grafana>/d/ebit-bullmq bullmq_queue_jobs{state="wait"}
Logs ↔ trace pivot UI Grafana → ebit-logs-trace-pivot <grafana>/d/ebit-logs-trace-pivot (mixed Loki + Tempo/Jaeger panels)

Local URLs: Grafana http://localhost:3003 (admin/grafana) · Jaeger http://localhost:16686 · Prometheus http://localhost:9090 · Loki http://localhost:3100. Production URLs come from the perf-stack Terraform outputs (terraform output -json | jq .grafana_url) — see docs/perf-run-checklist.md.

2. Find a trace from a user complaint

Customer says "betting was slow at 14:30". Walk-through:

  1. Open GrafanaDashboardsebit · Performance Test (or ebit · Service Overview (RED) for prod-shape views).
  2. Set the time range to a 30-min window straddling 14:30 (e.g., 14:15–14:45). Top-right time picker.
  3. Look at "p95 Latency by Service". The line for service_name="api" (or bj, speed-roulette, etc.) should show a visible spike at 14:30.
  4. Click the spike. Grafana renders an exemplar dot (small diamond) on the same time axis when traces have been sampled at that latency. Hover → "View Trace in Jaeger" link.
  5. Jaeger opens with the slowest trace from that bucket. Inspect the waterfall: the longest span is the bottleneck. Common culprits documented in docs/audits/perf-trace-coverage-audit.md.
  6. No exemplar dot visible? Tail-sampling drops ~90% of OK traces; only ERROR / SLOW samples are kept (see docs/audits/jaeger-storage-research.md). Click into Jaeger directly and search by service+operation+min-duration; you'll find a representative trace even without the exemplar link.

3. Find a log from a trace

Given a trace_id (e.g., the root traceID shown in Jaeger):

{service_name="ebit-api"} |= "<trace_id>"

This works because the OTel pino instrumentation injects trace_id / span_id / trace_flags into every log record (see observability.md §"Trace correlation"). The pipe operator does substring match on the JSON line.

For span-level filtering, narrow further:

{service_name="ebit-api"} |= "<trace_id>" |= "<span_id>"

EvoLogger records (winston-backed, ~40 call sites) reach Loki via the filelog/docker receiver — they don't carry service_name resource. Query with:

{source="docker_filelog"} |= "<trace_id>"

Combine both with or-style logic by running them in two Explore tabs side-by-side; LogQL has no native union.

4. Find a trace from a log

Inverse: you have a noisy ERROR log line and want the full trace.

  1. In Grafana Explore with the Loki datasource, run a LogQL query that surfaces the line, e.g. {service_name="ebit-api"} |= "ERROR".
  2. Click the line. The provisioned derivedFields config (in observability/grafana/provisioning/datasources/datasource.yaml) renders a "View trace" link on the right of any record carrying a trace_id field.
  3. Click "View trace". Jaeger opens directly to the root span.

If the link doesn't appear: the log record didn't carry a trace_id (likely a startup record before the span was active, or an EvoLogger record). Fall back to step 5 below.

5. Cross-service tracing gotchas

Cross-service trace propagation breaks in three documented cases — operators chasing a "missing parent span" symptom should check these first before assuming a code bug.

  1. Redis pub/sub RPC (ExternalControllerClient). Any operation that crosses services via Redis pub/sub does not propagate the OTel context — the consumer starts an orphan trace. Workaround: search Jaeger by service_name of the consumer + a unique correlation field (request body hash, user_id, etc.) and pivot on that. Documented in project_otel_microservice_transport_gap.md.
  2. Websocket /events (rt service). The socket.io gateway has no server-side OTel instrumentation; messages enter and leave without spans. Logs in the rt service still carry trace_id if a parent context exists, but most ws-only operations have no parent. Workaround: trace by user_id ({service_name="ebit-rt"} |= "user_id=<id>").
  3. BullMQ bet-settled consumer. traceparent is not stored in the job payload, so the consumer span is an orphan. Documented in docs/audits/perf-trace-coverage-audit.md. Workaround: pivot on bet_id field — the producer's bet-place trace and the consumer's bet-settle trace can be joined manually.

6. Performance regression hunt

"p95 was 50 ms last week, now 200 ms".

  1. Open ebit-service-overview in Grafana with the time range set to 7d.
  2. Look at the p95-by-service panel. Identify the moment p95 stepped up.
  3. Cross-reference deployments. If the panel has a deployment_version label series (currently {{TBD: deployment_version label not yet wired}} in the local stack — see docs/perf-run-checklist.md), the regression maps to a release. Otherwise, correlate against the git log --until of ebit-api for the suspect window.
  4. Drill into spanmetrics by span_name. Switch to ebit-perf-test and look at the per-span panels (Prisma, Redis, BullMQ). One of them will show the same step-up — that's the layer where the regression landed.
  5. Pull a representative trace for the regressed window (Jaeger search with the same service+operation+min-duration). Compare span breakdown vs a pre-regression trace from the same operation.
  6. Confirm in source: git log the controller / service that owns the regressed span between the two timestamps.

7. Capacity check

"We doubled our user count, do we have headroom?"

  1. Open ebit-perf-system in Grafana for the 1h peak window.
  2. Cross-reference five panels:
  3. CPU utilization per VM — peaks below 70% means CPU has runway.
  4. Memory pressure1 - MemAvailable / MemTotal should stay below ~0.85.
  5. Disk IO weighted time — proxy for queue depth; rising linearly with load is fine, super-linear is a saturation warning.
  6. Network throughput per interface — compare against the instance's nominal bandwidth (c7g class details in terraform/perf/README.md).
  7. Conntrack utilizationnode_nf_conntrack_entries / node_nf_conntrack_entries_limit < 0.5 is healthy.
  8. Pull the FD count (process_open_fds or node_filefd_allocated) — Node clusters at ~10k connections hit the default fd limit; raise via ulimit ahead of doubling.
  9. Compare against a prior ramp. If the perf programme has a "1× user count" reference run captured in docs/performance-test-report.md, the deltas there give a linear extrapolation.

8. Common confusions (pitfalls)

  1. Spanmetrics names are unprefixed. The metrics emitted by the OTel spanmetrics connector are calls_total and duration_milliseconds_bucketnot traces_spanmetrics_calls_total. Old docs that reference the prefixed form are wrong (already corrected in perf-promql-audit.md).
  2. Process-exporter node group collapses 5 NestJS apps. The default group rule lumps all node processes into groupname="node". To split per-app you need cmdline regex rules in process-exporter.yml. {{verify before customer share}} — exact patterns documented in the perf-system dashboard variables.
  3. Jaeger v2 healthcheck path is /status, not /. v1 used /. Probes that haven't been updated will permanently report unhealthy.
  4. Tail-sampling drops ~90% of OK traces. If you can't find a trace by min-duration=0 search, that's expected — the storage tier only keeps ERROR / SLOW / sampled. See docs/audits/jaeger-storage-research.md for the exact policy.
  5. Two metric sources for HTTP latency. http_server_duration_milliseconds_bucket (OTel HTTP instrumentation, labels http_route / http_status_code) and duration_milliseconds_bucket{span_kind="SPAN_KIND_SERVER"} (spanmetrics, labels span_name / status_code). Dashboards use the spanmetrics form; older runbooks use the HTTP-instrumentation form. They report the same fact but from different sources — see docs/audits/perf-promql-audit.md MISMATCH 1.
  6. x-captcha-token: pass only works locally. This is a captcha bypass, not an observability one — but it bites operators who try to replay a prod trace from curl. See docs/api-reference/index.md §"Captcha bypass (local only)".
  7. bj and speed-roulette have no Swagger. Their REST surface is internal only; you won't find them in the API reference. Operate on them via logs + traces.

9. Local-dev observability

The same stack runs locally via docker compose -f observability/docker-compose.yml up -d (or the project-level compose if observability is wired into the main file). URLs:

Tool Local URL Default credentials
Grafana http://localhost:3003 admin / grafana
Jaeger UI http://localhost:16686 none
Prometheus http://localhost:9090 none
Loki HTTP http://localhost:3100 none (curl/Grafana only)
OTel Collector OTLP gRPC localhost:4317 none (clients)
OTel Collector OTLP HTTP localhost:4318 none

Datasources are pre-provisioned (observability/grafana/provisioning/datasources/datasource.yaml); dashboards land automatically (observability/grafana/provisioning/dashboards/). The only setup steps for a fresh dev box are: have Docker, have the compose file, docker compose up. The observability network is pre-wired so app containers reach the collector at otel-collector:4317.

If a dashboard says "No data": check that the relevant app exporter is running (docker compose ps), then check that Prometheus is scraping it (<prom>/targets).

10. Useful saved queries

These are the queries operators run repeatedly. Each is verbatim from a dashboard panel or a perf doc.

# 1. Service request rate (server side)
sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m]))

# 2. Service p95 latency
histogram_quantile(0.95,
  sum by (le, service_name) (
    rate(duration_milliseconds_bucket{span_kind="SPAN_KIND_SERVER"}[1m])
  )
)

# 3. Service error rate (5xx %)
sum by (service_name) (
  rate(calls_total{span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_ERROR"}[1m])
) / sum by (service_name) (
  rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m])
) * 100

# 4. Prisma ops/s by query type
sum by (span_name) (rate(calls_total{span_name=~"prisma:.*"}[1m]))

# 5. DB query p95 across all tables
histogram_quantile(0.95,
  sum by (le) (rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m]))
)

# 6. Top 5 slowest tables
topk(5, histogram_quantile(0.95,
  sum by (le, db_sql_table) (
    rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m])
  )
))

# 7. BullMQ queue depth (jobs waiting)
bullmq_queue_jobs{state="wait"}

# 8. Container CPU per VM
100 - 100 * avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))

# 9. Container memory pressure
1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

# 10. Conntrack utilization
node_nf_conntrack_entries / node_nf_conntrack_entries_limit

# 11. k6 virtual users (during a perf run)
k6_vus{scenario=~"$scenario"}

# 12. k6 request p95 (client-observed, in ms)
histogram_quantile(0.95, rate(k6_http_req_duration_seconds_bucket{scenario=~"$scenario"}[1m])) * 1000
# 13. Errors from any service in the last 5 min
{service_name=~"ebit-.*"} |= "ERROR"

# 14. Trace pivot — every record for one trace
{service_name="ebit-api"} |= "<trace_id>"

# 15. EvoLogger-only records (no service_name resource)
{source="docker_filelog"} |= "EvoLogger"