Skip to content

Playwright + agent context efficiency

Research brief for adding E2E tests to the ebit workspace (ebit-api :4000, ebit-fe :3000, ebit-admin-fe :5173 host / :3003 compose, Jaeger :16686). The goal: let agents drive Playwright against those services without the per-turn context budget collapsing under screenshots, full accessibility dumps, or verbose reporter output.

Versions referenced: Playwright 1.58.2 (latest stable image tag mcr.microsoft.com/playwright:v1.58.2-noble) and @playwright/mcp 0.0.70 (2026-04-01).


1. Three ways an agent can drive Playwright

The agent shells out to npx playwright test … and reads the reporter output. What the agent sees is entirely controlled by the reporter and the flags; tests themselves never leak raw page content into the model.

Reporters ranked by per-turn cost (ascending) per the official table at https://playwright.dev/docs/test-reporters:

Reporter Output shape Good for agent?
dot one char per test (·/F/×); the CI default ✅ smoke runs
line single updating line + failures inline ✅ daily driver
json one JSON object at end (pipe to jq) ✅ when we need structure
junit XML — larger than JSON for same data ⚠ CI only
list one line per test — local default ❌ noisy
html interactive folder, self-opens on failure ❌ never for agent
blob / github shard merging / PR annotations ❌ CI only

There is no --quiet flag — the docs explicitly note this. dot is the minimalist substitute.

Tradeoff: cheap, stateless, each invocation is a fresh process. The agent never holds a live browser between turns, so it cannot "explore" interactively.

(b) Official Playwright MCP server (@playwright/mcp v0.0.70)

Repo: https://github.com/microsoft/playwright-mcp. The server exposes tool calls for navigation, clicking, form filling, tab management, network mocking, storage, tracing, PDF, and assertion generation. It is explicitly built around accessibility snapshots, not pixel screenshots — "Uses Playwright's accessibility tree, not pixel-based input … No vision models needed, operates purely on structured data."

Positioning, per the project's own README: well suited to "persistent state, rich introspection, and iterative reasoning over page structure." The same README concedes CLI-based alternatives are "more token-efficient for high-throughput agents with limited context windows."

Tradeoff: every tool call returns a fresh a11y snapshot of the current page. Even with the YAML-ish structured form (see §1c), a full-page snapshot of the admin dashboard can be multiple KB per turn. Fine for one-shot exploration ("tell me what's on the login page") but wasteful for suite runs.

(c) page.accessibility.snapshot() vs screenshot vs locator.ariaSnapshot()

  • page.accessibility.snapshot()deprecated; docs redirect to axe-core for a11y testing.
  • page.screenshot() — returns a PNG buffer. Irrelevant for an agent unless you also run vision. Disable by default.
  • locator.ariaSnapshot() + expect(locator).toMatchAriaSnapshot() — the modern replacement, documented at https://playwright.dev/docs/aria-snapshots. Emits compact YAML scoped to the locator:
- heading "Sign in" [level=1]
- textbox "Email"
- textbox "Password"
- button "Submit"

This is the right thing to capture in a test — the locator-scoped aria snapshot of the sign-in form (maybe 200 bytes) rather than the whole page (several KB).

Recommendation. Default to (a) with --reporter=line or --reporter=json | jq. Keep (b) available as an npx @playwright/mcp command the agent can spawn for one-off interactive debugging of a broken flow, but do not use it as the primary driver for the suite.


2. Per-turn context-reduction patterns

Defaults taken from https://playwright.dev/docs/api/class-testoptions: trace: 'off', video: 'off', screenshot: 'off' — already minimal. The failure modes below come from overriding those defaults or from agents reading the wrong artifacts.

Reporter & CLI flags. - --reporter=line for watchful runs; --reporter=dot when we only care pass/fail. - --reporter=json piped to jq to extract what the agent actually needs:

npx playwright test --reporter=json 2>/dev/null \
  | jq '[.suites[].specs[] | select(.ok==false) | {title, file, error: .tests[0].results[0].error.message}]'
Returns only failing specs with one-line error messages. 10× smaller than reading list output. - Use --fail-on-flaky-tests to force determinism instead of silent retries. - --workers=1 when debugging so log interleaving doesn't trick the model.

Artifacts. - Keep trace: 'retain-on-failure' (docs: "Record trace for each test. When test run passes, remove the recorded trace"). Agents should never cat a trace zip — instead npx playwright show-trace <path> is a human tool; the agent reads the .last-run.json summary. - video: 'off' unconditionally; re-enable per-project only during a flake investigation. - screenshot: 'only-on-failure' if we must, but prefer not taking them at all for agent-driven runs.

Selector discipline (from https://playwright.dev/docs/locators, ranked priority): 1. page.getByRole('button', { name: 'Sign in' }) — "closest way to how users and assistive technology perceive the page." 2. page.getByLabel('Email') — for form fields. 3. page.getByText(...) — non-interactive text (div/span/p). 4. page.getByTestId('login-submit') — when the first three don't apply. Add data-testid to ebit-fe/ebit-admin-fe components rather than fighting the DOM. 5. CSS / XPath — explicitly discouraged: "can break when the DOM structure changes."

Role locators double as self-documenting assertions and survive Tailwind/shadcn/NextUI reshuffles without test changes, which is the dominant source of maintenance cost in this codebase.

Snapshot discipline. When a test needs to capture structure, scope toMatchAriaSnapshot to the smallest meaningful locator (page.getByRole('form'), not page). That keeps each snapshot ≤1 KB.

Agent-read discipline. The agent reads test-results/.last-run.json and playwright-report/ summary files, never *.webm / trace.zip / screenshots. Enforce with a .claudeignore or equivalent.


3. Run Playwright from the host, not the compose network

Two options:

Option A — add a playwright service inside compose.yml. Uses mcr.microsoft.com/playwright:v1.58.2-noble. Tests address services by compose name: http://ebit-fe:3000, http://ebit-admin-fe:3001, http://ebit-api:4000, http://jaeger:16686. Needs --init, --ipc=host, and often --cap-add=SYS_ADMIN per https://playwright.dev/docs/docker.

Option B — run npx playwright test on the host VM. Tests target http://localhost:3000, http://localhost:3001, http://localhost:4000, http://localhost:16686 — the same URLs a developer sees in the browser.

Recommendation: Option B (host). Justification:

  1. URL parity with humans. localhost:3000 is what the README, the unified-compose docs, and every dev tool in memory already document. A test failing at http://ebit-fe:3000 forces the human reader to mentally translate. Playwright traces, Sentry events, and /swagger links all refer to localhost.
  2. Sign-in cookies, CORS, same-site. Next.js auth cookies are scoped per host. Writing tests against ebit-fe (compose DNS) exercises a different cookie origin than what real users hit. Since the stated goal (MEMORY.md) is "login must work on both FEs," tests must validate the human-facing URL.
  3. Browser binary weight. Adding the Playwright image costs ~2 GB and another ~1–2 GB at runtime — notable on a 16 GB VM already running 5 NestJS apps, 2 Next servers, Postgres, Redis, RabbitMQ, and Jaeger.
  4. Developer iteration. On the host, pnpm playwright test -g "sign in" works immediately; in compose it requires docker compose run playwright … and volume-mounting the repo.
  5. Cross-FE trace correlation. Both FEs talk to ebit-api on the host's published port; traces from the browser → localhost:4000 → OTel exporter → Jaeger stay in one span tree.

The one legitimate argument for Option A is CI reproducibility. Address that by pinning Playwright in a dedicated tests-e2e/ workspace (task #11) and running npx playwright test in CI with the same pinned version — you don't need the container for that.

Network caveat if we ever flip to Option A: the Playwright docs recommend docker run --add-host=hostmachine:host-gateway … to reach host-published ports from inside the container. But compose-internal DNS already makes that moot — it's the URL-drift that's the real cost, not the connectivity mechanism.


4. Capturing OTel trace IDs from Next.js in Playwright

Does Next.js put traceparent on outgoing responses? No. The official Next.js OpenTelemetry guide (https://nextjs.org/docs/pages/guides/open-telemetry) and the unresolved vercel/next.js#59321 discussion confirm: Next.js auto-instrumentation extracts incoming traceparent headers and injects them into outbound fetch() calls (that's what @vercel/otel's propagateContextUrls enables — already in memory), but it does not add traceparent to the HTTP response headers the browser (or Playwright) sees.

Two viable workarounds, in order of preference

(a) Inject traceparent in a Next.js middleware (preferred). Per https://opentelemetry.io/docs/languages/js/propagation/:

// middleware.ts (both ebit-fe and ebit-admin-fe)
import { NextResponse } from 'next/server';
import { context, propagation, trace } from '@opentelemetry/api';

export function middleware() {
  const res = NextResponse.next();
  const carrier: Record<string, string> = {};
  propagation.inject(context.active(), carrier);
  for (const [k, v] of Object.entries(carrier)) res.headers.set(k, v);
  return res;
}

Then in the test:

const response = await page.goto('/login');
const traceparent = response?.headers()['traceparent']; // "00-<traceId>-<spanId>-01"
const traceId = traceparent?.split('-')[1];
// attach to test metadata for later lookup
testInfo.annotations.push({ type: 'traceId', description: traceId! });

Caveat (noted in #59321): propagation.inject into responses that Next caches per-URL can poison the cache — the header value is per-request. Gate to non-cached routes or set cache-control: no-store on instrumented responses. For auth-guarded pages (the ones we actually care about), they're already non-cached.

(b) Post-hoc Jaeger query by service + operation + timestamp (fallback). If middleware is out of reach, bracket the test in wall-clock time and query Jaeger:

const start = Date.now() * 1000; // µs
await page.goto('/login');
await page.getByRole('button', { name: 'Sign in' }).click();
const end = Date.now() * 1000;

const url = `http://localhost:16686/api/traces?service=ebit-fe&operation=POST%20%2Fapi%2Fauth%2Fsignin&start=${start}&end=${end}&limit=5`;
const res = await page.request.get(url);
const { data } = await res.json();
const traceId = data[0]?.traceID;

The Jaeger Query HTTP API is documented at https://www.jaegertracing.io/docs/1.76/architecture/apis/. Timestamps are unix microseconds (not milliseconds — bug magnet).

Tradeoff: (b) is flaky in parallel CI — timestamps from two workers can overlap and return each other's traces. (a) is deterministic because the trace ID travels with the response. Implement (a); keep (b) documented for debugging ad-hoc incidents where middleware wasn't in place.


Summary recommendation

  1. Drive Playwright 1.58.2 from the host via the tests-e2e/ workspace (task #11), against localhost:3000 / :3001 / :4000.
  2. Default reporter: line for humans, json | jq for agent-driven turns. Never list / html.
  3. Artifacts off-by-default; trace: 'retain-on-failure' only.
  4. Locators: getByRolegetByLabelgetByTextgetByTestId. Add data-testid to components when the first three are genuinely ambiguous.
  5. Capture trace IDs by adding propagation.inject middleware to both FEs and reading response.headers()['traceparent'] in the test.
  6. Reserve @playwright/mcp 0.0.70 for interactive one-off debugging, not as the suite driver.

Sources