Skip to content

ADR-0012 — Tail-sampling policy: 100 % errors + 100 % slow + 10 % OK

Status: Accepted Date: 2026-04-25 (codified at perf-stack provisioning) Author(s): Platform engineering

Context

The OpenTelemetry collector ingests spans from all five ebit-api apps plus the FE RUM path. Two pressures shape the sampling decision:

  1. Volume. The perf-test ramp targets 1k → 10k VUs over 42 minutes with peak ~50 k spans / sec. Storing every span in Jaeger Badger (ADR-0009) at that rate would burn the 50 GB EBS volume in well under the 72 h TTL window and OOM the collector before that.
  2. Forensic value. OK / fast traces are mostly noise — they exhibit the steady state, which is what the spanmetrics-derived RED metrics already capture (ADR-0002). The traces operators need at incident time are the slow ones, the failed ones, and a representative slice of the rest.

The collector pipeline supports two sampling stages:

  • Head sampling — decide at span-start before any data is collected. Cheaper (CPU and bandwidth), but cannot inspect downstream attributes (status codes, latency) because those aren't known yet.
  • Tail sampling — decide at trace-completion. More expensive (must buffer the full trace until the root span ends), but can inspect every attribute. Required if the policy depends on outcome.

This ADR codifies the choice between them and the specific policy chosen.

Decision

The collector applies the tail_sampling processor with three policies in OR composition (a trace is kept if any policy matches):

# excerpt from terraform/modules/monitoring/user-data.sh.tftpl:144-167
tail_sampling:
  decision_wait: 10s
  num_traces: 100000
  expected_new_traces_per_sec: 5000
  policies:
    - name: errors
      type: status_code
      status_code:
        status_codes: [ERROR]
    - name: slow_traces
      type: latency
      latency:
        threshold_ms: 500
    - name: random_sample
      type: probabilistic
      probabilistic:
        sampling_percentage: 10

In plain language:

  • Every ERROR-status trace is kept (100 %).
  • Every trace where any span exceeds 500 ms latency is kept (100 %).
  • 10 % of all other traces are kept uniformly at random.

The processor is positioned in the traces pipeline after memory_limiter and before batch, so backpressure cannot starve the buffer:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlphttp/jaeger, spanmetrics]

enable_tail_sampling is a Terraform variable so the policy can be toggled per environment (e.g. local dev runs with sampling disabled).

Considered alternatives

A. Head sampling only (probabilistic_sampler at the receiver)

Cheaper, simpler, no buffer. Rejected because the high-value traces (errors, slow outliers) are precisely the ones we need to not lose, and head sampling cannot tell them apart from steady-state OK traces. A 10 % head sample over 50 k spans/sec drops 90 % of the failures along with the noise — useless at incident time.

B. Tail sampling at 100 % (no probabilistic policy)

Keep every error and every slow trace; drop the rest entirely. Rejected because the random 10 % slice serves a valuable secondary purpose: representative steady-state traces for capacity planning, performance baselining, and exemplar dots in Grafana panels. Without it, panels lose exemplar density at low-traffic moments and operators can't pivot from a metric to a trace via the "click an exemplar" UX.

C. Tail sampling at 100 % + 100 % + 100 %

Keep everything. Rejected by the throughput math. At 50 k spans / sec, an average trace size of ~10 spans, and a span's compressed size on Badger ~300 bytes, 72 h of full-fidelity ingest is roughly:

50 000 × 10 × 300 × 60 × 60 × 72 ÷ 0.5 (compaction overhead) ≈ 7.8 TB

The 50 GB EBS volume cannot accept that. The 90 % drop on OK traces brings the volume back inside the budget.

D. Tail sampling 100 / 100 / 5 % (more aggressive on OK)

Cuts steady-state retention in half. Rejected marginally — at 5 % the exemplar density on Grafana panels degrades to the point where some low-traffic endpoints have no exemplar dots in the time window an operator is looking at. 10 % keeps exemplars usable across all but the lowest-traffic endpoints.

E. Tail sampling 100 / 100 / 10 % with a higher latency threshold (1000 ms)

Slow-trace policy fires only at p99-shaped latencies. Rejected because the SLO targets in docs/performance-test-report.md are p95 < 200 ms for sign-in and < 100 ms for game endpoints; a 500 ms cutoff catches "slower than SLO" traces, which is the right operational signal. 1000 ms would only catch outright pathology, missing the SLO-budget-burn cases.

F. Per-service sampling rates (different policies per service)

Apply 100 % to bo (low volume, high-value admin operations) and 5 % to api (high volume). Considered, deferred. Per-service policies are supported by tail_sampling via the and_sub_policy composition, but the configuration complexity outweighs the benefit at our current scale. Revisit if bo traces start dominating the budget under aggressive admin use.

Consequences

What operators see

  • ~90 % of OK traces are dropped. When searching Jaeger by service + min-duration = 0, fewer traces appear than were actually served. Documented prominently in docs/engineering/observability-runbook.md §8 pitfall #4.
  • Exemplar density in Grafana panels is reduced for steady-state. At 10 k requests / minute on a route, that's ~17 OK traces / sec sampled — sufficient for exemplar dots in the perf-test dashboard (ebit-perf-test).
  • All errors and all slow traces are present. Click any error spike in Grafana → the trace is in Jaeger.

Buffering & memory

  • The processor buffers spans for decision_wait: 10s, holding up to num_traces: 100000 partial traces before evicting the oldest. At our shape the buffer holds ~50 k traces simultaneously — comfortably under the cap.
  • expected_new_traces_per_sec: 5000 sizes the internal hash maps; misconfiguration surfaces as collector OOM, not data loss.
  • Net memory cost: ~200 MB at peak, within the collector's memory_limiter 80 % ceiling on the 8 GB VM.

Trace propagation gotchas

  • Tail sampling decides per-trace, not per-span. All spans of a kept trace are kept (correct behaviour).
  • The decision is made when the root span ends; child-span sampling is consistent with the root.
  • Cross-service traces fragmented by the Redis pub/sub gap (ADR-0005) are evaluated independently because the gap creates separate root spans. A slow callee will be kept on its own root's latency; a fast caller may still be dropped. Reasonable behaviour given the underlying transport limitation.
  • OK traces from BetService.createAndSettleBet are subject to the 10 % sample — this means the BullMQ orphan trace (docs/audits/perf-trace-coverage-audit.md) is dropped 90 % of the time. Operators investigating bet-settled latency must look at spanmetrics aggregates first; pivot to Jaeger only for ERROR or > 500 ms cases.

Reporting

  • Perf-test reports must clarify "this is sampled." Spans counts cited from Jaeger are ~10 % of actual; latency distributions skewed toward the slow end. Spanmetrics-derived metrics (Prometheus) are not sampled and remain authoritative for rate / error / duration aggregates.
  • Add a banner to performance-test-report.md noting the sampling policy in effect.

Toggle

  • The enable_tail_sampling Terraform variable lets a future operator disable sampling entirely for a focused investigation. Use case: forensic capture of a 5-minute window where every trace must survive. Be aware: sustained disabled sampling will fill 50 GB of Badger storage in roughly 2 hours at perf-test load; intentionally short windows only.

Revisit triggers

Reopen this decision if:

  1. The probabilistic 10 % rate causes systemic exemplar-density gaps on production dashboards (not seen on perf-test rig).
  2. The budget for forensic storage grows substantially (e.g. larger EBS, or move to Tempo / S3) — at which point higher sampling becomes affordable.
  3. Customer compliance forces 100 % retention for audit purposes; sampling is incompatible with that requirement, would force a full pipeline rework.

References