Tracing flow¶
How OTel data moves from instrumented processes through the Collector, into Jaeger / Prometheus / Loki, and finally into Grafana panels. Where the flow is blind.
Generated 2026-04-25. Source of truth for the pipeline shape:
observability/otel-collector.yml. Service-side bootstrap lives inebit-api/libs/shared/src/basic/pre/pre-otel.main.ts(api/rt/sr) and viaNODE_OPTIONS=--require .../register.jsfor bj/bo. admin-fe is now a Vite SPA (no SSR) — traces from the admin UI originate in the browser only.
Pipeline diagram¶
Two diagrams. A shows what emits telemetry and how it reaches the otel-collector (ingress). B shows the collector's internal pipeline and how processed data lands in each backend + Grafana. They share the otel-collector :4317/:4318 boundary — A ends where B begins.
A. Telemetry ingress — emitters → otel-collector¶
flowchart LR
subgraph emitters["Telemetry sources"]
direction TB
nest["All 5 Nest apps<br/><i>api · rt · bj · bo · sr</i>"]
ebit_fe["ebit-fe :3000<br/><i>SSR + browser RUM</i>"]
ebit_admin_fe["ebit-admin-fe<br/><i>browser RUM only<br/>(host network)</i>"]
loadgen["Playwright / k6<br/><i>tests-e2e + tests-perf</i>"]
end
docker_logs[("/var/lib/docker/containers<br/>JSON log files (stdout)")]
subgraph coll["otel-collector :4317/:4318"]
direction TB
rec_otlp["receivers.otlp<br/><i>http :4318 + grpc :4317<br/>CORS: localhost 3000/3001/3003</i>"]
rec_filelog["receivers.filelog/docker<br/><i>scrape JSON log files</i>"]
end
%% OTLP ingress
nest -- "OTLP/HTTP<br/>traces + logs" --> rec_otlp
ebit_fe -- "OTLP/HTTP<br/>SSR via bridge net<br/>browser via host :4318" --> rec_otlp
ebit_admin_fe -- "OTLP/HTTP<br/>host-network mode" --> rec_otlp
loadgen -- "browser RUM via @vercel/otel<br/>or k6 native" --> rec_otlp
%% Stdout / filelog path (winston/pino EvoLogger records)
nest -- "stdout JSON" --> docker_logs
docker_logs --> rec_filelog
Nest apps collapsed to one node because the edge type is identical for all five (OTLP/HTTP traces + logs over the bridge network, same
OTEL_EXPORTER_OTLP_ENDPOINT). The frontends are split out because their transport differs (SSR via bridge vs browser via host, plus admin-fe runs in host-network mode).
B. Collector pipeline → backends → Grafana¶
flowchart LR
subgraph coll["otel-collector internal pipeline"]
direction TB
rec["receivers<br/><i>otlp + filelog/docker</i>"]
proc["processors<br/><i>memory_limiter 512MiB<br/>batch 5s / 1024</i>"]
conn["connectors.spanmetrics<br/><i>buckets 5ms..5s<br/>dims: db.system, db.operation,<br/>db.sql.table, prisma.model, prisma.method</i>"]
exp_jaeger["exporters.otlphttp/jaeger"]
exp_prom["exporters.prometheus :8889"]
exp_loki["exporters.loki"]
rec --> proc
proc --> exp_jaeger
proc --> conn
conn --> exp_prom
proc --> exp_prom
proc --> exp_loki
end
subgraph backends["Telemetry backends"]
direction TB
jaeger[("Jaeger v2.17.0 :16686<br/>Badger /opt/jaeger-data")]
prom[("Prometheus 2.55 :9090")]
loki[("Loki 3.2 :3100")]
end
grafana["Grafana 11.3 :3003<br/><i>provisioned datasources</i>"]
exp_jaeger --> jaeger
exp_prom --> prom
exp_loki --> loki
jaeger --> grafana
prom --> grafana
loki --> grafana
The processor stage feeds both direct exporters (
otlphttp/jaeger,prometheus,loki) and the spanmetrics connector, which derives RED metrics (rate, errors, duration) from incoming spans and feeds the Prometheus exporter alongside the regular metrics pipeline. Two arrows intoexp_promis intentional — direct app metrics + derived spanmetrics share one Prom scrape endpoint.
Why the pipeline looks like this¶
Pino, not winston, is the logger that reaches Loki¶
Per memory project_evologger_trace_correlation.md, all five ebit-api
services run a two-logger setup:
nestjs-pinois the Nest framework logger. Wired viaNestLoggerModule.forRoot({ serviceName })fromlibs/shared/src/logger/pino-logger.module.ts, thenapp.useLogger(app.get(Logger))inbase.main.tsso Nest lifecycle + HTTP request logs are pino JSON to stdout.@bebkovan/server-core'sEvoLoggerstill backs the ~40 app-code call sites that useEvoLogger.log/debug/error(...). Backed by winston.
@opentelemetry/instrumentation-pino 0.60.0 is registered explicitly in
pre-otel.main.ts with logKeys: { traceId: 'trace_id', spanId: 'span_id',
traceFlags: 'trace_flags' }. Auto's default pino hook is disabled
('@opentelemetry/instrumentation-pino': { enabled: false }) so only our
configured instance runs. Pino records bridge into OTel's logs API and ship
via OTLP.
WinstonInstrumentation stays enabled in getNodeAutoInstrumentations
defaults so EvoLogger records carry the same three trace fields, but those
records reach Loki only via the filelog receiver scraping
/var/lib/docker/containers/*/*.log. They show up tagged with
source: docker_filelog (vs OTLP-bridged pino records which carry the
proper service.name resource attribute). Per the same memory:
don't remove EvoLogger — migrating 40+ call sites for no signal gain
was explicitly rejected.
spanmetrics connector exists because Prisma + ioredis emit spans only¶
Per memory project_otel_spanmetrics_connector.md, several offenders emit
spans but never histogram metrics:
@prisma/instrumentation6.5.0 →prisma:client:operation,prisma:engine:db_query@opentelemetry/instrumentation-ioredis→ per-command spans (SET/GET/EVALSHA etc.)@opentelemetry/instrumentation-bullmq→ queue job spans- All
@opentelemetry/instrumentation-*DB-family instrumentations, by design
The OTel-idiomatic fix is the spanmetrics connector (see config in
observability/otel-collector.yml). It sits in the traces pipeline as an
exporter and in the metrics pipeline as a receiver, deriving
traces_spanmetrics_calls_total + traces_spanmetrics_duration_milliseconds_bucket/_sum/_count
into the Prometheus pipeline. Configured dimensions in this stack:
db.system, db.operation, db.sql.table, prisma.model, prisma.method.
Dashboard PromQL pattern (note: spanmetrics series, not Prisma-native):
# rate of prisma operations
rate(traces_spanmetrics_calls_total{span_name=~"prisma:.*"}[5m])
# p95 latency by span_name
histogram_quantile(0.95,
sum by (le, span_name) (
rate(traces_spanmetrics_duration_milliseconds_bucket{span_name=~"prisma:.*"}[5m])
)
)
If a dashboard JSON references prisma_* or db_client_* metrics directly,
it's wrong — rewrite to traces_spanmetrics_*.
tail_sampling is enabled in monitoring¶
The monitoring VM's user-data ships a Collector config with
tail_sampling enabled (see terraform/perf-monitoring/user-data.sh). The
local-stack collector here does not tail-sample — local dev keeps every
trace so the flow docs can pin specific trace IDs (e.g.
8aaa902b3964af1d33dec7000bb36e02 referenced in
../flows/dropbet-bet-place.md). In the
perf VM the sampler keeps slow-tail traces (latency-decision policy) so
storage doesn't explode under k6 load.
Trace blind spots¶
The four known propagation gaps. See
../audits/perf-trace-coverage-audit.md for the
endpoint-by-endpoint coverage matrix.
1. WebSocket /events — zero connection coverage¶
ClientGateway at apps/rt/src/gateway/client.gateway.ts:48 uses
@WebSocketGateway({ namespace: 'events', transports: ['websocket'] }).
Socket.io is not covered by HttpInstrumentation. There are:
- no auto-spans for connection (no
WS upgradespan) - no auto-spans for
extractSocketAuthToken/authorizeClient - no auto-spans for the auth RPC into ebit-api over Redis pub/sub
- no auto-spans for client → server messages
Only send /events (per emit) appears. Fix would be a manual span around
ClientGateway.handleConnection() plus interceptor-style wrap of
emitEvent.
2. ExternalControllerClient — Redis pub/sub strips traceparent¶
Per memory project_otel_microservice_transport_gap.md, the speed-roulette
three-hop bet (api HTTP → sr microservice → walletClient back to api)
surfaces in Jaeger as three uncorrelated traces. Verified empirically
with 3a6e6edd6d57791d30ac82cfe7f6b774: 25 spans, services=["ebit-api"]
only, 76 ms gap between the last publish (RPC send) and the unlink
(PlaceBetLock release) — during which sr does its entire placeBet work
under its own root spans.
The @ExternalControllerClient decorator wires a Nest ClientProxy over
Redis pub/sub transport. The @nestjs/microservices auto-instrumentation
does not inject traceparent into the pub/sub envelope as of 2026-04. The
ioredis publish span terminates the propagation chain. Systemic — affects
any future apps/<x> app, not just speed-roulette.
When documenting any flow that crosses @ExternalControllerClient, expect
the Jaeger trace to show only the HTTP-entry service. Anchor E2E
assertions on the entry-point root span only.
3. BullMQ producer→consumer¶
Same shape as #2 but lives entirely inside one app. The EVALSHA enqueue
span is parented to the HTTP request, but BetQueueProcessor.process() (and
every other consumer) starts a new orphan trace. Affects:
bet_settled_queue— bet side effects (live-bets emit, user stats, rakeback, leaderboard, affiliate, gamesService.handleBet)updateSessionQueue— session metadata write-back- bot system queues
- promo / challenge / leaderboard / user-stats migration / skindeck deposits / both speed-roulette queues
Per the OTel transport-gap memory: this loss is expected and fine for queue-driven state transitions inside a single service — they aren't meant to be trace-children of the enqueue call. The fix has been declined.
4. DiceService.play / BetService.createAndSettleBet have no service span¶
Per ../audits/perf-trace-coverage-audit.md,
neither DiceService.play() (apps/api/src/casino/games/house/dice/dice.service.ts:34)
nor BetService.createAndSettleBet() (apps/api/src/bet/bet.service.ts:559)
carry manual spans. The trace jumps from HTTP controller → Prisma
$transaction with no service-level attribution. Under load you see the
total request time and the DB time, but cannot distinguish RNG computation,
balance-check logic, or transaction-assembly cost. Recommended fix exists
in the audit doc; not yet applied.
What this means for query patterns¶
| You want to know… | Use… | Why |
|---|---|---|
| End-to-end latency of a sign-in | Jaeger trace search service=ebit-api op=POST /auth/sign-in |
Browser → api hop is propagated since task #27 |
| End-to-end latency of a dice bet | Same; trace 8aaa902b3964af1d33dec7000bb36e02 is the worked example |
Same |
| Per-Prisma-model RED metrics | PromQL traces_spanmetrics_* filtered on span_name=~"prisma:.*" |
spanmetrics-derived; native metrics don't exist |
| Why is rt's UsersOnlineUpdated flapping | Loki {service_name="ebit-rt"} + correlate by trace_id to the api-side online-tracker.service.ts 10 s cron |
Rt's emit is a single send /events span; no further auto-coverage |
| Did the bj orphan service do anything | Jaeger service=ebit-bj will be empty for in-repo traffic |
Per project_ebit_bj_orphan.md no FE reaches it |
| Did FastTrack receive my bet | Don't ask Jaeger; check the RabbitMQ UI :15672 ft vhost — it'll be empty too |
Producer is stubbed (disabled=true), 11 silent drops |