Observability — operator how-to¶
Sibling docs:
observability.mdis the architectural overview (pipeline, exporters, why-it-works). This doc is the operational how-to — what to type when something is on fire.
Every PromQL/LogQL example below is lifted verbatim from a panel in observability/grafana/provisioning/dashboards/*.json or from one of the perf documents (docs/audits/perf-promql-audit.md, docs/performance-test-report.md). They have been validated against the local stack. Where a query is not taken from a panel, the example is marked {{verify before customer share}}.
1. Where each signal lives¶
| I want to see… | Tool | URL pattern | Example query |
|---|---|---|---|
| Request rate per service | Grafana → ebit-perf-test |
<grafana>/d/ebit-perf-test |
sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m])) |
| Request p95 latency per service | Grafana → ebit-perf-test or ebit-service-overview |
same | histogram_quantile(0.95, sum by (le, service_name) (rate(duration_milliseconds_bucket{span_kind="SPAN_KIND_SERVER"}[1m]))) |
| Error rate per service | Grafana → ebit-perf-test |
same | sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_ERROR"}[1m])) / sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m])) * 100 |
Error logs from api |
Loki via Grafana Explore | <grafana>/explore |
{service_name="ebit-api"} \|= "ERROR" |
| Logs by trace_id | Loki via Grafana Explore | <grafana>/explore |
{service_name="ebit-api"} \|= "<trace_id>" |
| Full trace for a request | Jaeger | <jaeger>:16686 (/search?service=ebit-api) |
search by traceID, or click a Grafana exemplar |
| Container CPU / mem / disk / net | Grafana → ebit-perf-system |
<grafana>/d/ebit-perf-system |
100 - 100 * avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) |
| Postgres slowest tables | Grafana → ebit-prisma-postgres |
<grafana>/d/ebit-prisma-postgres |
topk(5, histogram_quantile(0.95, sum by (le, db_sql_table) (rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m])))) |
| Postgres hot queries (raw) | psql + pg_stat_statements |
shell to db pod | SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20; |
| Redis throughput by command | Grafana → ebit-redis |
<grafana>/d/ebit-redis |
sum by (span_name) (rate(calls_total{span_kind="SPAN_KIND_CLIENT",span_name=~"(?i)(get\|set\|del\|hget\|hset\|...)"}[1m])) (full filter in dashboard JSON) |
| BullMQ queue depth | Grafana → ebit-bullmq |
<grafana>/d/ebit-bullmq |
bullmq_queue_jobs{state="wait"} |
| Logs ↔ trace pivot UI | Grafana → ebit-logs-trace-pivot |
<grafana>/d/ebit-logs-trace-pivot |
(mixed Loki + Tempo/Jaeger panels) |
Local URLs: Grafana http://localhost:3003 (admin/grafana) · Jaeger http://localhost:16686 · Prometheus http://localhost:9090 · Loki http://localhost:3100. Production URLs come from the perf-stack Terraform outputs (terraform output -json | jq .grafana_url) — see docs/perf-run-checklist.md.
2. Find a trace from a user complaint¶
Customer says "betting was slow at 14:30". Walk-through:
- Open Grafana →
Dashboards→ebit · Performance Test(orebit · Service Overview (RED)for prod-shape views). - Set the time range to a 30-min window straddling 14:30 (e.g.,
14:15–14:45). Top-right time picker. - Look at "p95 Latency by Service". The line for
service_name="api"(orbj,speed-roulette, etc.) should show a visible spike at 14:30. - Click the spike. Grafana renders an exemplar dot (small diamond) on the same time axis when traces have been sampled at that latency. Hover → "View Trace in Jaeger" link.
- Jaeger opens with the slowest trace from that bucket. Inspect the waterfall: the longest span is the bottleneck. Common culprits documented in
docs/audits/perf-trace-coverage-audit.md. - No exemplar dot visible? Tail-sampling drops ~90% of OK traces; only ERROR / SLOW samples are kept (see
docs/audits/jaeger-storage-research.md). Click into Jaeger directly and search by service+operation+min-duration; you'll find a representative trace even without the exemplar link.
3. Find a log from a trace¶
Given a trace_id (e.g., the root traceID shown in Jaeger):
This works because the OTel pino instrumentation injects trace_id / span_id / trace_flags into every log record (see observability.md §"Trace correlation"). The pipe operator does substring match on the JSON line.
For span-level filtering, narrow further:
EvoLogger records (winston-backed, ~40 call sites) reach Loki via the filelog/docker receiver — they don't carry service_name resource. Query with:
Combine both with or-style logic by running them in two Explore tabs side-by-side; LogQL has no native union.
4. Find a trace from a log¶
Inverse: you have a noisy ERROR log line and want the full trace.
- In Grafana Explore with the Loki datasource, run a LogQL query that surfaces the line, e.g.
{service_name="ebit-api"} |= "ERROR". - Click the line. The provisioned
derivedFieldsconfig (inobservability/grafana/provisioning/datasources/datasource.yaml) renders a "View trace" link on the right of any record carrying atrace_idfield. - Click "View trace". Jaeger opens directly to the root span.
If the link doesn't appear: the log record didn't carry a trace_id (likely a startup record before the span was active, or an EvoLogger record). Fall back to step 5 below.
5. Cross-service tracing gotchas¶
Cross-service trace propagation breaks in three documented cases — operators chasing a "missing parent span" symptom should check these first before assuming a code bug.
- Redis pub/sub RPC (
ExternalControllerClient). Any operation that crosses services via Redis pub/sub does not propagate the OTel context — the consumer starts an orphan trace. Workaround: search Jaeger byservice_nameof the consumer + a unique correlation field (request body hash, user_id, etc.) and pivot on that. Documented inproject_otel_microservice_transport_gap.md. - Websocket
/events(rt service). The socket.io gateway has no server-side OTel instrumentation; messages enter and leave without spans. Logs in the rt service still carrytrace_idif a parent context exists, but most ws-only operations have no parent. Workaround: trace by user_id ({service_name="ebit-rt"} |= "user_id=<id>"). - BullMQ bet-settled consumer.
traceparentis not stored in the job payload, so the consumer span is an orphan. Documented indocs/audits/perf-trace-coverage-audit.md. Workaround: pivot onbet_idfield — the producer's bet-place trace and the consumer's bet-settle trace can be joined manually.
6. Performance regression hunt¶
"p95 was 50 ms last week, now 200 ms".
- Open
ebit-service-overviewin Grafana with the time range set to 7d. - Look at the p95-by-service panel. Identify the moment p95 stepped up.
- Cross-reference deployments. If the panel has a
deployment_versionlabel series (currently{{TBD: deployment_version label not yet wired}}in the local stack — seedocs/perf-run-checklist.md), the regression maps to a release. Otherwise, correlate against thegit log --untilofebit-apifor the suspect window. - Drill into spanmetrics by
span_name. Switch toebit-perf-testand look at the per-span panels (Prisma, Redis, BullMQ). One of them will show the same step-up — that's the layer where the regression landed. - Pull a representative trace for the regressed window (Jaeger search with the same service+operation+min-duration). Compare span breakdown vs a pre-regression trace from the same operation.
- Confirm in source:
git logthe controller / service that owns the regressed span between the two timestamps.
7. Capacity check¶
"We doubled our user count, do we have headroom?"
- Open
ebit-perf-systemin Grafana for the 1h peak window. - Cross-reference five panels:
CPU utilization per VM— peaks below 70% means CPU has runway.Memory pressure—1 - MemAvailable / MemTotalshould stay below ~0.85.Disk IO weighted time— proxy for queue depth; rising linearly with load is fine, super-linear is a saturation warning.Network throughput per interface— compare against the instance's nominal bandwidth (c7g class details interraform/perf/README.md).Conntrack utilization—node_nf_conntrack_entries / node_nf_conntrack_entries_limit< 0.5 is healthy.- Pull the FD count (
process_open_fdsornode_filefd_allocated) — Node clusters at ~10k connections hit the default fd limit; raise viaulimitahead of doubling. - Compare against a prior ramp. If the perf programme has a "1× user count" reference run captured in
docs/performance-test-report.md, the deltas there give a linear extrapolation.
8. Common confusions (pitfalls)¶
- Spanmetrics names are unprefixed. The metrics emitted by the OTel
spanmetricsconnector arecalls_totalandduration_milliseconds_bucket— nottraces_spanmetrics_calls_total. Old docs that reference the prefixed form are wrong (already corrected inperf-promql-audit.md). - Process-exporter
nodegroup collapses 5 NestJS apps. The default group rule lumps all node processes intogroupname="node". To split per-app you needcmdlineregex rules inprocess-exporter.yml.{{verify before customer share}}— exact patterns documented in the perf-system dashboard variables. - Jaeger v2 healthcheck path is
/status, not/. v1 used/. Probes that haven't been updated will permanently report unhealthy. - Tail-sampling drops ~90% of OK traces. If you can't find a trace by min-duration=0 search, that's expected — the storage tier only keeps ERROR / SLOW / sampled. See
docs/audits/jaeger-storage-research.mdfor the exact policy. - Two metric sources for HTTP latency.
http_server_duration_milliseconds_bucket(OTel HTTP instrumentation, labelshttp_route/http_status_code) andduration_milliseconds_bucket{span_kind="SPAN_KIND_SERVER"}(spanmetrics, labelsspan_name/status_code). Dashboards use the spanmetrics form; older runbooks use the HTTP-instrumentation form. They report the same fact but from different sources — seedocs/audits/perf-promql-audit.mdMISMATCH 1. x-captcha-token: passonly works locally. This is a captcha bypass, not an observability one — but it bites operators who try to replay a prod trace from curl. Seedocs/api-reference/index.md§"Captcha bypass (local only)".bjandspeed-roulettehave no Swagger. Their REST surface is internal only; you won't find them in the API reference. Operate on them via logs + traces.
9. Local-dev observability¶
The same stack runs locally via docker compose -f observability/docker-compose.yml up -d (or the project-level compose if observability is wired into the main file). URLs:
| Tool | Local URL | Default credentials |
|---|---|---|
| Grafana | http://localhost:3003 |
admin / grafana |
| Jaeger UI | http://localhost:16686 |
none |
| Prometheus | http://localhost:9090 |
none |
| Loki HTTP | http://localhost:3100 |
none (curl/Grafana only) |
| OTel Collector OTLP gRPC | localhost:4317 |
none (clients) |
| OTel Collector OTLP HTTP | localhost:4318 |
none |
Datasources are pre-provisioned (observability/grafana/provisioning/datasources/datasource.yaml); dashboards land automatically (observability/grafana/provisioning/dashboards/). The only setup steps for a fresh dev box are: have Docker, have the compose file, docker compose up. The observability network is pre-wired so app containers reach the collector at otel-collector:4317.
If a dashboard says "No data": check that the relevant app exporter is running (docker compose ps), then check that Prometheus is scraping it (<prom>/targets).
10. Useful saved queries¶
These are the queries operators run repeatedly. Each is verbatim from a dashboard panel or a perf doc.
# 1. Service request rate (server side)
sum by (service_name) (rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m]))
# 2. Service p95 latency
histogram_quantile(0.95,
sum by (le, service_name) (
rate(duration_milliseconds_bucket{span_kind="SPAN_KIND_SERVER"}[1m])
)
)
# 3. Service error rate (5xx %)
sum by (service_name) (
rate(calls_total{span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_ERROR"}[1m])
) / sum by (service_name) (
rate(calls_total{span_kind="SPAN_KIND_SERVER"}[1m])
) * 100
# 4. Prisma ops/s by query type
sum by (span_name) (rate(calls_total{span_name=~"prisma:.*"}[1m]))
# 5. DB query p95 across all tables
histogram_quantile(0.95,
sum by (le) (rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m]))
)
# 6. Top 5 slowest tables
topk(5, histogram_quantile(0.95,
sum by (le, db_sql_table) (
rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m])
)
))
# 7. BullMQ queue depth (jobs waiting)
bullmq_queue_jobs{state="wait"}
# 8. Container CPU per VM
100 - 100 * avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))
# 9. Container memory pressure
1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
# 10. Conntrack utilization
node_nf_conntrack_entries / node_nf_conntrack_entries_limit
# 11. k6 virtual users (during a perf run)
k6_vus{scenario=~"$scenario"}
# 12. k6 request p95 (client-observed, in ms)
histogram_quantile(0.95, rate(k6_http_req_duration_seconds_bucket{scenario=~"$scenario"}[1m])) * 1000
# 13. Errors from any service in the last 5 min
{service_name=~"ebit-.*"} |= "ERROR"
# 14. Trace pivot — every record for one trace
{service_name="ebit-api"} |= "<trace_id>"
# 15. EvoLogger-only records (no service_name resource)
{source="docker_filelog"} |= "EvoLogger"
Cross-links¶
observability.md— architectural overview (sibling, why-it-works).../observability.md— top-level OTel/Loki/Jaeger pipeline + validation curls.../architecture/tracing-flow.md— Mermaid flowchart of the OTel pipeline.../audits/perf-promql-audit.md— cross-audit of every PromQL across the perf programme; cite when a query in this doc looks wrong.../audits/perf-trace-coverage-audit.md— per-endpoint OTel span coverage and the documented blind spots.../perf-run-checklist.md— pre-flight / in-flight / post-flight checklist for a perf run; uses many of these queries.../performance-test-report.md— the bottleneck-hunting patterns table referenced in §6.../e2e-trace-demo.md— runnable end-to-end trace walkthrough, complements the §2/§3/§4 pivots above.../audits/jaeger-storage-research.md— tail-sampling policy, retention.../adr/0002-spanmetrics-over-prisma-metrics.md,0007-evologger-kept-not-migrated.md— relevant decisions.