ADR-0009 — Jaeger v2 + Badger over Tempo / managed OpenSearch¶
Status: Accepted
Date: 2026-04-25
Author(s): Platform engineering (research in ../audits/jaeger-storage-research.md)
Context¶
The dev observability VM ran jaegertracing/all-in-one:1.57 with the default in-memory backend. On 2026-04-23 the host OOM-killed: Jaeger consumed 19.2 GB RAM under perf-test-shaped span volume and breached the 16 GB host limit. The default MEMORY_MAX_TRACES=0 means "unbounded" — this is documented behaviour, not a Jaeger bug.
Two pressures forced a decision:
- The OOM is recurring. The next perf-test ramp (1k → 10k VUs over 42 min, peak ~50k spans/sec) will reproduce the failure on any host smaller than ~32 GB, which is bigger than the budget for the observability VM (
c7g.xlarge, 8 GB RAM, eu-north-1). - Jaeger v1 reached EOL on 2025-12-31 (CNCF blog 2024-11-12, issue #6321). Migrating to v2 is mandatory regardless of the OOM. v2 also rebuilds Jaeger on the OpenTelemetry Collector core, simplifying the pipeline.
The team is two engineers; ops capacity is the dominant constraint. Forensic retention requirement: 24–72 h post-test trace replay for incident analysis. No multi-tenancy, no global search across regions.
Decision¶
- Switch to
jaegertracing/jaeger:2.17with the Badger backend on the observability VM. - Mount Badger storage on a 50 GB gp3 EBS volume, paths
/var/lib/jaeger/keysand/var/lib/jaeger/values. - TTL the spans at
72h(extensions.jaeger_storage.backends.badger_main.badger.ttl.spans: 72hperterraform/modules/monitoring/jaeger-v2-config.yaml.tftpl). - Cap RAM with
GOMEMLIMIT=1500MiB(and a Dockermem_limitof 2 GB) so the LSM compactor never exceeds the host budget. - Set
ephemeral: falseso storage survives container restarts. - Tempo single-binary on local backend is the documented fallback, picked only if Badger sweats under high-cardinality writes (Jaeger #2987). The team is less familiar with TraceQL, so we accept the migration cost only if forced.
The Terraform user-data renders the v2 config from jaeger-v2-config.yaml.tftpl and starts the container with GOMEMLIMIT set; see terraform/modules/monitoring/main.tf for the docker_compose definition.
Considered alternatives¶
A. Stay on Jaeger v1 in-memory, set MEMORY_MAX_TRACES¶
The minimum-effort path: cap the ring buffer at e.g. 100 000 traces, bypass the OOM. Rejected because v1 is EOL and we need to migrate anyway; doing it now is cheaper than splitting the work.
B. Self-hosted Elasticsearch single-node¶
Mature backend, well-supported by Jaeger v2. Rejected because ES needs ≥ 4 GB JVM heap on its own — c7g.xlarge's 8 GB RAM cannot fit ES + the OTel collector + Prometheus + Grafana + Loki on the same box. Splitting onto a dedicated ES VM doubles the cost. JVM tuning, snapshot scripts, version upgrades — operational cost not justified for a 2-person team.
C. AWS managed OpenSearch (t3.medium.search)¶
Removes the ops burden. Jaeger v2 has first-class OpenSearch support. Rejected on cost: ~$60–80/month for the smallest viable instance plus EBS, against a $0 baseline for Badger on the same box. The forensic-only use case doesn't justify the spend.
D. Self-hosted single-node Cassandra¶
Jaeger's most mature backend. Rejected decisively: Cassandra ops cost is the highest of any considered option. Two engineers cannot operate Cassandra. JVM heap + repair scheduling + tombstone management — out of scope.
E. AWS Keyspaces (managed Cassandra)¶
Removes ops burden of self-hosted Cassandra. Rejected on cost: per-request pricing means ~$1.45 / million WRU × 1 B writes ≈ $1.5k for the perf-test ingest alone. Forensic use does not justify the spend.
F. Grafana Tempo single-binary, backend: local¶
Purpose-built for high-throughput trace ingest; Parquet block storage compresses well. Rejected for now, kept as fallback. Two reasons:
- Loses the Jaeger UI — search happens in Grafana via TraceQL, which the team is less familiar with than Jaeger's text-search.
- Badger is sufficient for our throughput; Tempo's higher-throughput strengths are wasted at our scale.
If Badger sweats under perf-test ingest, Tempo replaces it with the same config-only-change posture (still OTLP-on-4318 ingest from the OTel collector).
Consequences¶
Capacity & retention¶
- Disk: 50 GB gp3 EBS holds ~30–60 GB of compressed Badger blocks for 72 h of spans (depends on tail-sampling rate from ADR-0012). Cost: ~$4.64/month.
- RAM: capped at 1.5 GB Go heap + ~256 MB Badger working set; well under the host's 8 GB.
- TTL: spans automatically purge after 72 h. No retention dial available below that without code changes.
Operations¶
- Single-node only. No HA. If the observability VM dies, traces older than 72 h are lost (acceptable — they were going to expire anyway). Newer traces are lost from the gap window. Mitigation: snapshot the EBS volume nightly via Terraform.
- Healthcheck path is
/statusin v2 (not/as in v1). Existing probes that point at/permanently report unhealthy. Already corrected injaeger-v2-config.yaml.tftpl(healthcheckv2.use_v2: true); flagged indocs/engineering/observability-runbook.md§8. - Badger compaction is asynchronous; sustained ingest can briefly raise disk usage above the steady-state mean. Mitigation: 50 GB volume gives ~3× headroom over peak.
Observability impact¶
- The OTel collector pipeline is unchanged: still OTLP gRPC :4317 + HTTP :4318 → batch → spanmetrics + jaeger_storage_exporter.
- Spanmetrics-derived metrics (ADR-0002) continue to flow into Prometheus regardless of trace storage.
- Tail-sampling decisions (ADR-0012) are upstream of Jaeger, so Badger only ingests already-sampled traces.
Migration & rollback¶
- v1 → v2 cutover was a Terraform apply on 2026-04-25. Rollback to v1 not planned (v1 is EOL).
- If Badger underperforms during a perf-test, swap to Tempo by changing one Terraform variable; the rest of the pipeline stays unchanged.
References¶
docs/audits/jaeger-storage-research.md— full deep-dive (351 lines, 19 sources) underpinning this decision.terraform/modules/monitoring/jaeger-v2-config.yaml.tftpl— Jaeger v2 Badger config.terraform/modules/monitoring/user-data.sh.tftpl— collector + Jaeger startup, includingGOMEMLIMIT.- CNCF blog: Jaeger v2 released (2024-11-12).
- Jaeger #6321 — Jaeger v1 EOL.
- Jaeger #2987 — Badger high-cardinality writes.
- BadgerDB GitHub.
- Prior incident: 2026-04-23 OOM on dev VM with
jaegertracing/all-in-one:1.57. - Sibling ADRs: 0001, 0002, 0005, 0012.
- Operator reference:
docs/engineering/observability-runbook.md§8 (common confusions, including v2 healthcheck path).