Skip to content

ADR-0005 — OTel traceparent not propagated on Redis pub/sub RPC

Status: Accepted Date: 2026-04-16 Author(s): Platform engineering

Context

The five NestJS apps communicate via a Redis pub/sub microservice transport (@nestjs/microservices, Transport.REDIS). The @ExternalControllerClient decorator at libs/gateway/src/ms-controller/external-controller-client.decorator.ts:18 creates typed proxy clients that send EventMessage payloads through EventsGateway.sendEvent() at libs/gateway/src/events.gateway.ts:45. The GatewayClientFactory.createProxy() at libs/gateway/src/ms-controller/gateway-client.factory.ts:71 constructs messages with data and event fields but no W3C Trace Context.

Empirically verified: a speed-roulette bet POST produces three uncorrelated traces in Jaeger — one for ebit-api (HTTP entry), one for ebit-speed-roulette (placeBet processing), and one for the wallet RPC callback. The 76 ms gap between the last publish span and the @PlaceBetLock release is invisible in the caller's trace.

Decision

Accept the propagation gap as a known limitation. The Redis pub/sub transport does not carry traceparent, and we do not patch it at this time. Cross-service RPC calls via @ExternalControllerClient produce orphan root traces in the callee service.

The EventMessage DTO at libs/gateway/src/dto/base.dto.ts:49 has a traceId field, but RpcContextInterceptor at libs/gateway/src/interceptors/rpc-context.interceptor.ts:33 populates it from the JWT session ID or idempotencyKey — not from the OTel active span. This traceId is a business correlation ID, not a W3C trace ID.

Affected call sites (all use @ExternalControllerClient):

File Line Target controller
apps/api/src/casino/house/speed-roulette-api/speed-roulette-api.service.ts 25 SpeedRouletteGatewayController
apps/speed-roulette/src/bet/bet.service.ts 42 EvoGamesWalletGatewayController
apps/bj/src/bets/bets.service.ts 26 EvoGamesWalletGatewayController
apps/api/src/casino/slots/providers/evogames/api/evogames-api.service.ts 12 SessionGatewayController

Alternatives considered

  1. Custom Nest interceptor to inject/extract OTel context into the message envelope. Evaluated but deferred: would require modifying GatewayClientFactory.createProxy() to call propagation.inject(context.active(), envelope) before sendEvent, and a matching server-side interceptor to call propagation.extract() and wrap the handler in the restored context. The change touches a shared library used by all five apps and needs coordinated testing. Correct implementation but not worth the risk for the current observability needs.

  2. Switch from Redis pub/sub to HTTP for inter-service RPC. Rejected: the Redis transport provides fire-and-forget semantics, automatic retrying via retryAttempts: 20, and the existing RedisConverter serializer. HTTP would add connection pooling, retry logic, and service discovery concerns. The transport choice is architecturally sound; only the trace propagation is missing.

  3. Use BullMQ instead of Redis pub/sub for RPC. Rejected: BullMQ is designed for async job queues with persistence and retry, not synchronous request-response RPC. The @ExternalControllerClient pattern expects a response within timeoutMs: 5000 — a queue-based transport would add unnecessary latency and complexity.

  4. Correlate manually via Loki log queries. This is the current workaround. Both services log with trace_id from their respective OTel contexts. A Loki query filtering by timestamp and user ID can stitch the logical flow across the orphan traces. Not automated, but sufficient for debugging.

Consequences

  • Jaeger trace views for any flow crossing @ExternalControllerClient show only the HTTP-entry service's spans. The callee's work is a separate root trace, discoverable by service name + timestamp but not linked in the trace tree.
  • Latency analysis of end-to-end flows (e.g., speed-roulette bet placement) requires querying both traces manually or correlating via Loki.
  • The EventMessage.traceId field is a business ID, not an OTel trace ID. Do not confuse them in dashboards or alerts.
  • When the interceptor fix is eventually implemented, no schema or transport changes are needed — only the createProxy() call path and a new server-side interceptor. The EventMessage DTO already has space for the field.

References

  • libs/gateway/src/ms-controller/external-controller-client.decorator.ts:18 — decorator definition
  • libs/gateway/src/ms-controller/gateway-client.factory.ts:71-104 — proxy creation (no context injection)
  • libs/gateway/src/events.gateway.ts:45-76sendEvent (no traceparent in payload)
  • libs/gateway/src/dto/base.dto.ts:49-101EventMessage DTO
  • libs/gateway/src/interceptors/rpc-context.interceptor.ts:33traceId from JWT, not OTel
  • libs/gateway/src/const.ts:18-26 — Redis transport options with custom serializer
  • docs/architecture.md AF-2 — aggregated weakness entry