Skip to content

Observability

Local stack for traces, metrics, and logs. Everything runs out of the root compose.yml alongside the apps.

Component Port Role
otel-collector 4317 (gRPC), 4318 (HTTP) OTLP ingress gateway
jaeger 16686 Trace UI
prometheus 9090 Metrics TSDB
loki 3100 Log store
grafana 3003 Unified UI (admin/grafana)

Config lives in observability/: otel-collector.yml, loki.yml, prometheus.yml, grafana/ (provisioned datasources + dashboards).

How traces are produced

All five NestJS apps (api, rt, bj, bo, speed-roulette) share libs/shared/src/basic/pre/pre-otel.main.ts, which is imported at the top of every main.ts before Nest bootstraps. It initializes @opentelemetry/sdk-node with:

  • getNodeAutoInstrumentations() — covers http, express, nestjs-core, ioredis, pg, bullmq, winston, and more. fs and dns are disabled to keep spans readable.
  • new PrismaInstrumentation() — Prisma spans (model + method as span attrs).

The SDK exports via OTLP HTTP to OTEL_EXPORTER_OTLP_ENDPOINT (set to http://otel-collector:4318 in compose). OTEL_SERVICE_NAME identifies the service in Jaeger.

ebit-fe (Next.js) exports browser + server spans via @vercel/otel with propagateContextUrls covering the ebit-api base URL so traceparent is forwarded across the FE→API hop.

How traces correlate to logs

Every log record carries trace_id/span_id/trace_flags matching the active OTel span, so a Jaeger trace can pivot into Loki with {service_name="ebit-api"} |= "<trace_id>" and vice-versa.

Mixed stack: pino for framework, winston for app-code facade

All five ebit-api services run a two-logger setup:

  1. nestjs-pino is the Nest framework logger. It captures Nest lifecycle output and every HTTP request/response. Records are JSON on stdout, bridged into OTel's logs API by @opentelemetry/instrumentation-pino, and exported via OTLP to the collector. These are the records that land in Loki.
  2. @bebkovan/server-core's EvoLogger facade still backs the ~40 app-code call sites that already use EvoLogger.log/debug/error(...). It writes to winston. WinstonInstrumentation (enabled by default in getNodeAutoInstrumentations) injects the same trace_id/span_id/trace_flags at the winston transport layer. Those records go to docker stdout only — no filelog receiver scrapes them, so they don't reach Loki today.

Pino is the canonical 2025+ OTel log-correlation recipe — pino's @opentelemetry/instrumentation-pino bridges records into the logs SDK so OTLP export is free. Winston's equivalent doesn't exist in a stable form. We kept EvoLogger (winston) for the existing call sites because a mass rewrite would touch 40+ files for no gain — the records are still trace-tagged, just confined to stdout.

The wiring (NestJS)

Shared helper in libs/shared/src/logger/pino-logger.module.ts:

import { LoggerModule } from 'nestjs-pino';

export class NestLoggerModule {
  static forRoot(opts: { serviceName: string; level?: string }) {
    return LoggerModule.forRoot({
      pinoHttp: { name: opts.serviceName, level: opts.level ?? 'info', autoLogging: true },
    });
  }
}

Every app.module.ts imports it:

imports: [
  EnvConfigModule,
  NestLoggerModule.forRoot({ serviceName: `${Project.Name}-api` }),  // before EvoLoggerModule
  MetricsModule,
  EvoLoggerModule.forRoot({ winston: {...}, ... }),                   // unchanged, still registered
  ...
]

libs/shared/src/basic/base.main.ts swaps the Nest framework logger to pino after NestFactory.create(...):

import { Logger } from 'nestjs-pino';
app.useLogger(app.get(Logger));

libs/shared/src/basic/pre/pre-otel.main.ts registers pino instrumentation explicitly so the logKeys field names are under our control:

getNodeAutoInstrumentations({
  '@opentelemetry/instrumentation-pino': { enabled: false },  // we register one below
}),
new PinoInstrumentation({
  logKeys: { traceId: 'trace_id', spanId: 'span_id', traceFlags: 'trace_flags' },
}),

WinstonInstrumentation stays in the default-enabled set, so EvoLogger records still carry the same three fields on docker stdout.

The wiring (Next.js)

ebit-fe/src/lib/log.ts is a thin structured-logger wrapper:

import { trace } from '@opentelemetry/api';

const emit = (level, msg, fields) => {
  const ctx = trace.getActiveSpan()?.spanContext();
  const record = {
    time: new Date().toISOString(), level, msg,
    service: process.env.OTEL_SERVICE_NAME ?? 'ebit-fe',
    trace_id: ctx?.traceId, span_id: ctx?.spanId, trace_flags: ctx?.traceFlags,
    ...fields,
  };
  (level === 'error' ? console.error : console.log)(JSON.stringify(record));
};

We wire this only at proof points (e.g., src/app/api/auth/cookies/route.ts), not repo-wide — Next.js's own request logs don't need trace_id for our current use cases.

Validating correlation

  1. Run the sign-in E2E (ebit-fe/e2e/), grab the FE traceparent from browser devtools or the Jaeger search UI.
  2. Find the trace in Jaeger: http://localhost:16686/search?service=ebit-api — click through to the trace detail. The root span's trace_id is the anchor.
  3. Query Loki for the same ID:
curl -sG 'http://localhost:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={service_name="ebit-api"} |= "<trace_id>"' \
  --data-urlencode 'start='$(date -d '10 minutes ago' +%s%N) \
  --data-urlencode 'end='$(date +%s%N) | jq '.data.result[].values'

Expect: at least one log line per service that participated in the trace.

  1. In Grafana Explore (http://localhost:3003, admin/grafana), choose the Loki datasource, paste the same LogQL, and click a matching line. The provisioned derivedFields config renders a "View trace" button that links back to Jaeger.

Dashboards (provisioned as code)

Lives under observability/grafana/provisioning/dashboards/:

File Purpose
service-overview.json RED per service (rate / error / duration) from the spanmetrics connector
bullmq.json Queue depth + processing rate across all BullMQ queues (sessions, bets, bots, leaderboard, promo, rakeback, skindeck, SpeedRoulette*)
redis.json ioredis ops + latency, split cache vs bot Redis
prisma-postgres.json Prisma model/method heatmap + Postgres top slow queries + connection pool saturation
browser-rum.json Web Vitals p75 / p95 from ebit-fe (@vercel/otel browser export)
logs-trace-pivot.json Loki search with derivedFields "View trace" link back to Jaeger
perf-test.json k6 custom metrics, threshold status, run-vs-baseline (during perf runs)
perf-system.json Host metrics from node_exporter (CPU, mem, disk, net)

What produces what

Produces spans Transport Notes
Every NestJS app OTLP HTTP → otel-collector :4318 @opentelemetry/sdk-node + PrismaInstrumentation + auto-instrumentations (http/express/nestjs/ioredis/pg/bullmq/winston)
ebit-fe browser + server OTLP via @vercel/otel Requires propagateContextUrls covering ebit-api base URL
ebit-admin-fe none (Vite SPA — no SSR; browser-only traces if any) Migrated from Next.js; AF-1 still applies for cookie/header propagation
Inter-Nest-app RPC Redis pub/sub transport does NOT propagate traceparent; callee produces orphan traces (AF-2 in weaknesses-register.md)

Known sharp edges

  • EvoLogger.log(...) records now reach Loki via the filelog/docker receiver in observability/otel-collector.yml, which scrapes Docker container JSON logs from /var/lib/docker/containers/. These records carry a source=docker_filelog resource attribute so they are distinguishable from OTLP-bridged pino records. Query EvoLogger-only records in Loki: {source="docker_filelog"} |= "EvoLogger". Pino records still arrive via both OTLP (primary, with trace correlation) and filelog (secondary, without service.name resource); prefer the OTLP path ({service_name="ebit-api"}) for trace-correlated queries.
  • @opentelemetry/resources@1.30 exports Resource (class constructor), not resourceFromAttributes — that helper arrived in 2.x. pre-otel.main.ts uses the class form.
  • Next.js @vercel/otel needs propagateContextUrls set to the ebit-api base URL or traceparent won't propagate across the FE→API fetch boundary. See the FE's instrumentation.ts.
  • bj/bo bootstrap their own NestFactory.create in apps/bj/src/main.ts and apps/bo/src/main.ts instead of going through createNestApp — they must import '@app/shared/basic/pre-imports' as the first line (to boot the OTel SDK) and call app.useLogger(app.get(Logger)) (nestjs-pino) explicitly.