ADR-0005 — OTel traceparent not propagated on Redis pub/sub RPC¶
Status: Accepted Date: 2026-04-16 Author(s): Platform engineering
Context¶
The five NestJS apps communicate via a Redis pub/sub microservice transport (@nestjs/microservices, Transport.REDIS). The @ExternalControllerClient decorator at libs/gateway/src/ms-controller/external-controller-client.decorator.ts:18 creates typed proxy clients that send EventMessage payloads through EventsGateway.sendEvent() at libs/gateway/src/events.gateway.ts:45. The GatewayClientFactory.createProxy() at libs/gateway/src/ms-controller/gateway-client.factory.ts:71 constructs messages with data and event fields but no W3C Trace Context.
Empirically verified: a speed-roulette bet POST produces three uncorrelated traces in Jaeger — one for ebit-api (HTTP entry), one for ebit-speed-roulette (placeBet processing), and one for the wallet RPC callback. The 76 ms gap between the last publish span and the @PlaceBetLock release is invisible in the caller's trace.
Decision¶
Accept the propagation gap as a known limitation. The Redis pub/sub transport does not carry traceparent, and we do not patch it at this time. Cross-service RPC calls via @ExternalControllerClient produce orphan root traces in the callee service.
The EventMessage DTO at libs/gateway/src/dto/base.dto.ts:49 has a traceId field, but RpcContextInterceptor at libs/gateway/src/interceptors/rpc-context.interceptor.ts:33 populates it from the JWT session ID or idempotencyKey — not from the OTel active span. This traceId is a business correlation ID, not a W3C trace ID.
Affected call sites (all use @ExternalControllerClient):
| File | Line | Target controller |
|---|---|---|
apps/api/src/casino/house/speed-roulette-api/speed-roulette-api.service.ts |
25 | SpeedRouletteGatewayController |
apps/speed-roulette/src/bet/bet.service.ts |
42 | EvoGamesWalletGatewayController |
apps/bj/src/bets/bets.service.ts |
26 | EvoGamesWalletGatewayController |
apps/api/src/casino/slots/providers/evogames/api/evogames-api.service.ts |
12 | SessionGatewayController |
Alternatives considered¶
-
Custom Nest interceptor to inject/extract OTel context into the message envelope. Evaluated but deferred: would require modifying
GatewayClientFactory.createProxy()to callpropagation.inject(context.active(), envelope)beforesendEvent, and a matching server-side interceptor to callpropagation.extract()and wrap the handler in the restored context. The change touches a shared library used by all five apps and needs coordinated testing. Correct implementation but not worth the risk for the current observability needs. -
Switch from Redis pub/sub to HTTP for inter-service RPC. Rejected: the Redis transport provides fire-and-forget semantics, automatic retrying via
retryAttempts: 20, and the existingRedisConverterserializer. HTTP would add connection pooling, retry logic, and service discovery concerns. The transport choice is architecturally sound; only the trace propagation is missing. -
Use BullMQ instead of Redis pub/sub for RPC. Rejected: BullMQ is designed for async job queues with persistence and retry, not synchronous request-response RPC. The
@ExternalControllerClientpattern expects a response withintimeoutMs: 5000— a queue-based transport would add unnecessary latency and complexity. -
Correlate manually via Loki log queries. This is the current workaround. Both services log with
trace_idfrom their respective OTel contexts. A Loki query filtering by timestamp and user ID can stitch the logical flow across the orphan traces. Not automated, but sufficient for debugging.
Consequences¶
- Jaeger trace views for any flow crossing
@ExternalControllerClientshow only the HTTP-entry service's spans. The callee's work is a separate root trace, discoverable by service name + timestamp but not linked in the trace tree. - Latency analysis of end-to-end flows (e.g., speed-roulette bet placement) requires querying both traces manually or correlating via Loki.
- The
EventMessage.traceIdfield is a business ID, not an OTel trace ID. Do not confuse them in dashboards or alerts. - When the interceptor fix is eventually implemented, no schema or transport changes are needed — only the
createProxy()call path and a new server-side interceptor. TheEventMessageDTO already has space for the field.
References¶
libs/gateway/src/ms-controller/external-controller-client.decorator.ts:18— decorator definitionlibs/gateway/src/ms-controller/gateway-client.factory.ts:71-104— proxy creation (no context injection)libs/gateway/src/events.gateway.ts:45-76—sendEvent(no traceparent in payload)libs/gateway/src/dto/base.dto.ts:49-101—EventMessageDTOlibs/gateway/src/interceptors/rpc-context.interceptor.ts:33—traceIdfrom JWT, not OTellibs/gateway/src/const.ts:18-26— Redis transport options with custom serializerdocs/architecture.mdAF-2 — aggregated weakness entry