Skip to content

Tracing flow

How OTel data moves from instrumented processes through the Collector, into Jaeger / Prometheus / Loki, and finally into Grafana panels. Where the flow is blind.

Generated 2026-04-25. Source of truth for the pipeline shape: observability/otel-collector.yml. Service-side bootstrap lives in ebit-api/libs/shared/src/basic/pre/pre-otel.main.ts (api/rt/sr) and via NODE_OPTIONS=--require .../register.js for bj/bo. admin-fe is now a Vite SPA (no SSR) — traces from the admin UI originate in the browser only.

Pipeline diagram

Two diagrams. A shows what emits telemetry and how it reaches the otel-collector (ingress). B shows the collector's internal pipeline and how processed data lands in each backend + Grafana. They share the otel-collector :4317/:4318 boundary — A ends where B begins.

A. Telemetry ingress — emitters → otel-collector

flowchart LR
    subgraph emitters["Telemetry sources"]
        direction TB
        nest["All 5 Nest apps<br/><i>api · rt · bj · bo · sr</i>"]
        ebit_fe["ebit-fe :3000<br/><i>SSR + browser RUM</i>"]
        ebit_admin_fe["ebit-admin-fe<br/><i>browser RUM only<br/>(host network)</i>"]
        loadgen["Playwright / k6<br/><i>tests-e2e + tests-perf</i>"]
    end

    docker_logs[("/var/lib/docker/containers<br/>JSON log files (stdout)")]

    subgraph coll["otel-collector :4317/:4318"]
        direction TB
        rec_otlp["receivers.otlp<br/><i>http :4318 + grpc :4317<br/>CORS: localhost 3000/3001/3003</i>"]
        rec_filelog["receivers.filelog/docker<br/><i>scrape JSON log files</i>"]
    end

    %% OTLP ingress
    nest          -- "OTLP/HTTP<br/>traces + logs"            --> rec_otlp
    ebit_fe       -- "OTLP/HTTP<br/>SSR via bridge net<br/>browser via host :4318" --> rec_otlp
    ebit_admin_fe -- "OTLP/HTTP<br/>host-network mode"        --> rec_otlp
    loadgen       -- "browser RUM via @vercel/otel<br/>or k6 native" --> rec_otlp

    %% Stdout / filelog path (winston/pino EvoLogger records)
    nest -- "stdout JSON" --> docker_logs
    docker_logs --> rec_filelog

Nest apps collapsed to one node because the edge type is identical for all five (OTLP/HTTP traces + logs over the bridge network, same OTEL_EXPORTER_OTLP_ENDPOINT). The frontends are split out because their transport differs (SSR via bridge vs browser via host, plus admin-fe runs in host-network mode).

B. Collector pipeline → backends → Grafana

flowchart LR
    subgraph coll["otel-collector internal pipeline"]
        direction TB
        rec["receivers<br/><i>otlp + filelog/docker</i>"]
        proc["processors<br/><i>memory_limiter 512MiB<br/>batch 5s / 1024</i>"]
        conn["connectors.spanmetrics<br/><i>buckets 5ms..5s<br/>dims: db.system, db.operation,<br/>db.sql.table, prisma.model, prisma.method</i>"]
        exp_jaeger["exporters.otlphttp/jaeger"]
        exp_prom["exporters.prometheus :8889"]
        exp_loki["exporters.loki"]

        rec --> proc
        proc --> exp_jaeger
        proc --> conn
        conn --> exp_prom
        proc --> exp_prom
        proc --> exp_loki
    end

    subgraph backends["Telemetry backends"]
        direction TB
        jaeger[("Jaeger v2.17.0 :16686<br/>Badger /opt/jaeger-data")]
        prom[("Prometheus 2.55 :9090")]
        loki[("Loki 3.2 :3100")]
    end

    grafana["Grafana 11.3 :3003<br/><i>provisioned datasources</i>"]

    exp_jaeger --> jaeger
    exp_prom   --> prom
    exp_loki   --> loki

    jaeger --> grafana
    prom   --> grafana
    loki   --> grafana

The processor stage feeds both direct exporters (otlphttp/jaeger, prometheus, loki) and the spanmetrics connector, which derives RED metrics (rate, errors, duration) from incoming spans and feeds the Prometheus exporter alongside the regular metrics pipeline. Two arrows into exp_prom is intentional — direct app metrics + derived spanmetrics share one Prom scrape endpoint.

Why the pipeline looks like this

Pino, not winston, is the logger that reaches Loki

Per memory project_evologger_trace_correlation.md, all five ebit-api services run a two-logger setup:

  • nestjs-pino is the Nest framework logger. Wired via NestLoggerModule.forRoot({ serviceName }) from libs/shared/src/logger/pino-logger.module.ts, then app.useLogger(app.get(Logger)) in base.main.ts so Nest lifecycle + HTTP request logs are pino JSON to stdout.
  • @bebkovan/server-core's EvoLogger still backs the ~40 app-code call sites that use EvoLogger.log/debug/error(...). Backed by winston.

@opentelemetry/instrumentation-pino 0.60.0 is registered explicitly in pre-otel.main.ts with logKeys: { traceId: 'trace_id', spanId: 'span_id', traceFlags: 'trace_flags' }. Auto's default pino hook is disabled ('@opentelemetry/instrumentation-pino': { enabled: false }) so only our configured instance runs. Pino records bridge into OTel's logs API and ship via OTLP.

WinstonInstrumentation stays enabled in getNodeAutoInstrumentations defaults so EvoLogger records carry the same three trace fields, but those records reach Loki only via the filelog receiver scraping /var/lib/docker/containers/*/*.log. They show up tagged with source: docker_filelog (vs OTLP-bridged pino records which carry the proper service.name resource attribute). Per the same memory: don't remove EvoLogger — migrating 40+ call sites for no signal gain was explicitly rejected.

spanmetrics connector exists because Prisma + ioredis emit spans only

Per memory project_otel_spanmetrics_connector.md, several offenders emit spans but never histogram metrics:

  • @prisma/instrumentation 6.5.0 → prisma:client:operation, prisma:engine:db_query
  • @opentelemetry/instrumentation-ioredis → per-command spans (SET/GET/EVALSHA etc.)
  • @opentelemetry/instrumentation-bullmq → queue job spans
  • All @opentelemetry/instrumentation-* DB-family instrumentations, by design

The OTel-idiomatic fix is the spanmetrics connector (see config in observability/otel-collector.yml). It sits in the traces pipeline as an exporter and in the metrics pipeline as a receiver, deriving traces_spanmetrics_calls_total + traces_spanmetrics_duration_milliseconds_bucket/_sum/_count into the Prometheus pipeline. Configured dimensions in this stack: db.system, db.operation, db.sql.table, prisma.model, prisma.method.

Dashboard PromQL pattern (note: spanmetrics series, not Prisma-native):

# rate of prisma operations
rate(traces_spanmetrics_calls_total{span_name=~"prisma:.*"}[5m])

# p95 latency by span_name
histogram_quantile(0.95,
  sum by (le, span_name) (
    rate(traces_spanmetrics_duration_milliseconds_bucket{span_name=~"prisma:.*"}[5m])
  )
)

If a dashboard JSON references prisma_* or db_client_* metrics directly, it's wrong — rewrite to traces_spanmetrics_*.

tail_sampling is enabled in monitoring

The monitoring VM's user-data ships a Collector config with tail_sampling enabled (see terraform/perf-monitoring/user-data.sh). The local-stack collector here does not tail-sample — local dev keeps every trace so the flow docs can pin specific trace IDs (e.g. 8aaa902b3964af1d33dec7000bb36e02 referenced in ../flows/dropbet-bet-place.md). In the perf VM the sampler keeps slow-tail traces (latency-decision policy) so storage doesn't explode under k6 load.

Trace blind spots

The four known propagation gaps. See ../audits/perf-trace-coverage-audit.md for the endpoint-by-endpoint coverage matrix.

1. WebSocket /events — zero connection coverage

ClientGateway at apps/rt/src/gateway/client.gateway.ts:48 uses @WebSocketGateway({ namespace: 'events', transports: ['websocket'] }). Socket.io is not covered by HttpInstrumentation. There are:

  • no auto-spans for connection (no WS upgrade span)
  • no auto-spans for extractSocketAuthToken / authorizeClient
  • no auto-spans for the auth RPC into ebit-api over Redis pub/sub
  • no auto-spans for client → server messages

Only send /events (per emit) appears. Fix would be a manual span around ClientGateway.handleConnection() plus interceptor-style wrap of emitEvent.

2. ExternalControllerClient — Redis pub/sub strips traceparent

Per memory project_otel_microservice_transport_gap.md, the speed-roulette three-hop bet (api HTTP → sr microservice → walletClient back to api) surfaces in Jaeger as three uncorrelated traces. Verified empirically with 3a6e6edd6d57791d30ac82cfe7f6b774: 25 spans, services=["ebit-api"] only, 76 ms gap between the last publish (RPC send) and the unlink (PlaceBetLock release) — during which sr does its entire placeBet work under its own root spans.

The @ExternalControllerClient decorator wires a Nest ClientProxy over Redis pub/sub transport. The @nestjs/microservices auto-instrumentation does not inject traceparent into the pub/sub envelope as of 2026-04. The ioredis publish span terminates the propagation chain. Systemic — affects any future apps/<x> app, not just speed-roulette.

When documenting any flow that crosses @ExternalControllerClient, expect the Jaeger trace to show only the HTTP-entry service. Anchor E2E assertions on the entry-point root span only.

3. BullMQ producer→consumer

Same shape as #2 but lives entirely inside one app. The EVALSHA enqueue span is parented to the HTTP request, but BetQueueProcessor.process() (and every other consumer) starts a new orphan trace. Affects:

  • bet_settled_queue — bet side effects (live-bets emit, user stats, rakeback, leaderboard, affiliate, gamesService.handleBet)
  • updateSessionQueue — session metadata write-back
  • bot system queues
  • promo / challenge / leaderboard / user-stats migration / skindeck deposits / both speed-roulette queues

Per the OTel transport-gap memory: this loss is expected and fine for queue-driven state transitions inside a single service — they aren't meant to be trace-children of the enqueue call. The fix has been declined.

4. DiceService.play / BetService.createAndSettleBet have no service span

Per ../audits/perf-trace-coverage-audit.md, neither DiceService.play() (apps/api/src/casino/games/house/dice/dice.service.ts:34) nor BetService.createAndSettleBet() (apps/api/src/bet/bet.service.ts:559) carry manual spans. The trace jumps from HTTP controller → Prisma $transaction with no service-level attribution. Under load you see the total request time and the DB time, but cannot distinguish RNG computation, balance-check logic, or transaction-assembly cost. Recommended fix exists in the audit doc; not yet applied.

What this means for query patterns

You want to know… Use… Why
End-to-end latency of a sign-in Jaeger trace search service=ebit-api op=POST /auth/sign-in Browser → api hop is propagated since task #27
End-to-end latency of a dice bet Same; trace 8aaa902b3964af1d33dec7000bb36e02 is the worked example Same
Per-Prisma-model RED metrics PromQL traces_spanmetrics_* filtered on span_name=~"prisma:.*" spanmetrics-derived; native metrics don't exist
Why is rt's UsersOnlineUpdated flapping Loki {service_name="ebit-rt"} + correlate by trace_id to the api-side online-tracker.service.ts 10 s cron Rt's emit is a single send /events span; no further auto-coverage
Did the bj orphan service do anything Jaeger service=ebit-bj will be empty for in-repo traffic Per project_ebit_bj_orphan.md no FE reaches it
Did FastTrack receive my bet Don't ask Jaeger; check the RabbitMQ UI :15672 ft vhost — it'll be empty too Producer is stubbed (disabled=true), 11 silent drops