feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer#108
feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer#108aq17 wants to merge 4 commits into
Conversation
…c verify converge together
Until now the explorer (evaluate.mjs) and the Playwright emitter (export.mjs)
were two disconnected stages: explorer converged on "the LLM can finish the
task," then export was a one-shot translation. The two objective functions
diverge — what unblocks the LLM agent doesn't always unblock a deterministic
replay. Demoing this against bizfile.sos.ca.gov surfaced 7+ classes of
mismatch (styled-label overlays, autocomplete keystroke interception,
transiently-disabled selects) that each cost a hand-fix in the emitted
script.
This PR unifies the loop:
Each iteration of `scripts/loop.mjs`:
1. evaluate.mjs → produces trace.json + summary.md
2. If trace passed, export.mjs --no-verify → emits Playwright script
3. npx tsx <task>.ts → actual deterministic replay
4. On Playwright fail, distill-failure.mjs summarizes the error via
Claude Haiku into strategy.md's "Recent Playwright Failures" section
5. Next iteration's evaluate reads the updated strategy.md and adapts
Convergence: Playwright passes 2 of last 3 iterations → graduate.
`strategy.md` is the shared intelligence layer between the LLM explorer and
the codegen. Three sections (documented in SKILL.md):
- Navigation Heuristics (LLM-facing)
- Codegen Hints (emitter-facing, per-task overrides)
- Recent Playwright Failures (auto-appended by distill-failure)
Also lifts the lessons from the bizfile demo into codegen defaults so future
tasks don't repeat the same hand-fixes:
- forceCheck : .check({ force: true }) for checkbox fill_sel ops
- forceClickRadio : .first().click({ force: true }) for radio click ops
(detected by selector pattern OR resolved node role)
- selectWithFallback: .selectOption() with a JS-enable + native-setter
fallback when the <select> is transiently disabled
- reactFill : helper for inputs where simulated keystrokes get
intercepted by autosuggest/autocomplete handlers
- clickButtonByText: eval-find-by-text in page context, avoids the
cross-step getByRole race on SPA wizards
Plus: select_dropdown ops with ref-shaped selectors (e.g. `[0-2005]`) now
route through the snapshot resolver instead of leaking as invalid CSS.
Files in this PR:
scripts/loop.mjs NEW — top-level orchestrator
scripts/export.mjs NEW — trace → Playwright codegen
scripts/lib/pick-run.mjs NEW — newest-passing-run selector
scripts/lib/parse-task.mjs NEW — task.md → Zod schema
scripts/lib/command-mapping.mjs NEW — browse trace → target-agnostic ops
scripts/lib/selector-resolver.mjs NEW — snapshot+ref → Playwright locators
scripts/lib/codegen-playwright.mjs NEW — ops → TS with helpers baked in
scripts/lib/verify.mjs NEW — npm install + tsx run + JSON parse
scripts/lib/distill-failure.mjs NEW — Playwright stderr → strategy.md addendum
scripts/evaluate.mjs MODIFIED — BROWSERBASE_CONTEXT_ID
passthrough + --max-turns flag
SKILL.md MODIFIED — documents export, loop,
sectioned strategy.md, and the
helper defaults baked into codegen
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed by loop validation
Loop validation today on bizfile (run-008 mined as the passing trace) reduced
the post-codegen hand-edits from yesterday's 15 down to 4 + 1 LLM-extract patch.
Each of the 4 navigation-level issues is now baked in as a codegen default, so
the next task we point loop.mjs at should start from a much smaller residual.
Fixes landed:
1. clickLinkWithFallback helper (codegen-playwright.mjs)
- For click_ref ops where the resolved node role is "link", emit
clickLinkWithFallback(page, <locator>) instead of plain .click().
- Helper reads the resolved .href property (not getAttribute, which
returns relative URLs). If the link exposes an absolute http(s) href,
prefer page.goto over .click — bypasses SPA tour overlays and
onClick preventDefault gates that block deterministic replay.
- Waits for networkidle after navigation (load fires too early on SPAs).
2. .first() default for ambiguous click_sel selectors
- Added isUniqueSelector() classifier: #id, [id=...], [data-testid=...].
- For unique selectors, emit .click() as before. For ambiguous ones
(e.g. `button[type=button]`), emit .first().click() to avoid
Playwright strict-mode violations.
3. exact: true for form-input getByRole emissions (selector-resolver.mjs)
- Added EXACT_NAME_ROLES set: textbox, searchbox, combobox, spinbutton,
listbox. nodeToLocators emits { name, exact: true } for these.
- Prevents "Limited Liability Company Name" from matching
"Confirm Limited Liability Company Name" (real bug from yesterday).
4. snapshot role "select" → ARIA role "combobox" (selector-resolver.mjs)
- Added SNAPSHOT_TO_ARIA_ROLE map and normalize at top of nodeToLocators.
- Browse-snapshot reports <select> with role "select" but Playwright's
ARIA role is "combobox". Without this mapping, the emitter produced
getByRole("select", ...) which is invalid.
- Also boost getByLabel above getByRole for select-likes (combobox/listbox)
since label-based locators tend to be more reliable for form selects.
Validation:
Re-exported bizfile-ca-llc from run-008 with these defaults. The emitted
script navigates ALL 9 wizard steps without hand-edits (vs. yesterday's
hand-fixed playwright-baseline/ which required 7 categories of patches).
Only failure is in the LLM-generated extract block at the end (brittle
structural locators in result-shaping) — separate concern, tracked as a
follow-up. The architectural goal (loop + codegen produces a navigating
Playwright script) is met.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Update (May 14): validated end-to-end on bizfile from scratch + landed 4 codegen fixes for the residual hand-edits.
Commit
Ready for review. Two named follow-ups in the PR description for when we pick this up after Friday's walkthrough. |
🏗️ Architecture feedback — toward goal-driven codegenNice work on this PR — the pipeline is clean and the bizfile validation is solid. But I want to flag a longer-term architecture concern before we harden around the current design. The core tensionThe current pipeline is a trace compiler: mine the LLM's trace → resolve ARIA refs to Playwright locators → emit a deterministic script. This works, but it means the trace is the source of truth for the generated script — and that's where things get complicated. The LLM's trace includes a lot of incidental decisions. It clicked a button because it saw it first. It used a CSS selector because the snapshot was long. It took an extra step because it got confused. The export pipeline faithfully converts all of this into Playwright — the noise alongside the signal. A human writing Playwright wouldn't replay the journey; they'd look at the goal (fill step 3 of this form) and write the simplest path to it. This has three downstream consequences:
Proposed architecture: hybrid skeleton + agent codegenSplit the problem into what machines do well (structure) and what agents do well (judgment): Phase 1 — Mechanical skeleton extraction (keep most of what you have) Mine the trace into a workflow skeleton, not a Playwright script:
The command-mapping and trace-walking code from this PR is great infrastructure for this. The change is: stop at the intent layer, don't go all the way to Playwright locators. Phase 2 — Agent writes Playwright from the skeleton + a live session Give Claude the skeleton + strategy.md + a live Browserbase session. Claude writes Playwright for each step:
This eliminates the ARIA ref resolution pipeline, the selector ranking heuristics, and the hand-curated helper catalog. Claude is the codegen. The domain knowledge ("state portals use React controlled forms with styled label overlays") lives in strategy.md as prose, and the agent decides when to apply it — rather than encoding it as named functions. Phase 3 — Agent-driven verification Instead of "2 of 3 passes," give Claude the script + the last N run results and ask: "Is this production-ready? What's still flaky?" The agent can identify that a timeout was 4900ms on a 5000ms limit (near-miss, not a real pass), or that a selector matched by coincidence. Graduation becomes a judgment call, not a counter. What this means for this PRShip it as-is — the pipeline works, bizfile validates it, and the infrastructure (command-mapping, trace-walking, selectors.cache.json, distill-failure) is valuable regardless of architecture. But I'd treat the current export pipeline as a stepping stone, not the final architecture:
The end state: the trace trains strategy.md (the existing autobrowse loop), and the agent writes Playwright from the task description + strategy — not from the trace. The trace is training data, not source code. That's the architecture that scales to 50 agencies without a new helper per portal. |
Resolves merge conflicts with origin/main (took main's evaluate.mjs, which has the browser-trace integration this branch's loop.mjs / export.mjs sit on top of). Bugbot findings addressed (PR #108 review comments on c918d2d): 1. loop.mjs runExport stdout pollution stdio was ["ignore","inherit","inherit"] — export.mjs's JSON report leaked into loop.mjs's own structured JSON. Pipe stdout, log it to stderr instead, so consumers parsing our stdout get a single object. 2. lib/codegen-playwright.mjs dead helpers reactFill / clickButtonByText were declared in every emitted script but emitOp never called them. Removed both; clickLinkWithFallback (which IS emitted) stays. 3. loop.mjs + lib/verify.mjs: JSON parse for pretty-printed output `stdout.lastIndexOf("{")` locks onto the deepest inner `{` for nested output schemas (since emitted scripts use JSON.stringify(result, null, 2)). High-severity — the loop could never graduate for tasks with non-flat schemas. Replaced with a brace-balanced backward scan that ignores braces inside JSON strings. Helper factored into lib/pick-run.mjs as extractTrailingJsonObject for reuse. 4. loop.mjs unused `pickRun` import — removed. 5. lib/codegen-playwright.mjs unused `open` variable in checkBalance — removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… report Bugbot caught this on 12e7f0a: the mid-loop graduation check used "2 of last 3 iterations passed", but the final report's status, the JSON `graduated` flag, and the process exit code all used the raw total pass-count across all iterations. Two passes separated by many failures could exit 0 and write "graduated" even though the last-3 rule was never satisfied. Factor the criterion into a single `graduationReached(history)` helper. Use it in: - the mid-loop break (was: inline last3 check) - the report's "Final status" line (was: passedCount >= 2) - the JSON `graduated` field (was: passedCount >= 2) - the process exit code (was: passedCount >= 2) passedCount is still computed and shown in the report for visibility, but it no longer drives any outcome flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 21f6405. Configure here.
| if (!exportOk) { | ||
| log(`iter ${iter}: export failed; treating as Playwright fail`); | ||
| hist.distillReason = "export script returned non-zero"; | ||
| continue; |
There was a problem hiding this comment.
Export failures skip strategy updates
Medium Severity
When export.mjs exits non-zero, the loop records a distill reason but never appends to strategy.md, unlike Playwright replay failures. Iterations after an export error get no codegen/export hints, so the co-evolution loop can repeat the same export failure.
Reviewed by Cursor Bugbot for commit 21f6405. Configure here.
| } else if (klass === "ref") { | ||
| ops.push({ kind: "click_ref", ref: normalizeRef(target), ...base }); | ||
| } | ||
| break; |
There was a problem hiding this comment.
Unknown click targets silently dropped
Medium Severity
walkTrace only emits click ops when classifySelector returns xpath, css, or ref. Clicks whose target is classified as unknown produce no op and no unhandled entry, so successful trace clicks can be omitted from the generated Playwright script without a TODO or warning.
Reviewed by Cursor Bugbot for commit 21f6405. Configure here.


Headline
autobrowse can now emit a runnable, deterministic Playwright script from any passing trace, and iterate the explorer + emitter together until both halves converge on the same workflow.
Before this PR, autobrowse produced traces +
strategy.md— durable artifacts, but the only way to re-run the task was to pay LLM inference per step. There was no path from a graduated task to a no-LLM-loop runnable script. This PR adds that path.What's new
1. End-to-end Playwright export pipeline (entirely new)
The full mining → resolve → codegen → verify pipeline, none of which existed in autobrowse before:
scripts/export.mjs— CLI:--task --target playwright --workspace --run --no-verifyscripts/lib/pick-run.mjs— newest-passing-run selection fromtraces/<task>/run-NNN/scripts/lib/parse-task.mjs—task.mdOutput block → Zod schema for the emitted scriptscripts/lib/command-mapping.mjs—browsetrace → target-agnostic op streamscripts/lib/selector-resolver.mjs— snapshot + session-scoped ARIA ref → ranked Playwright locator candidates (getByRole(name) → getByLabel → getByPlaceholder → getByText)scripts/lib/codegen-playwright.mjs— ops → runnable TypeScript with helper functions baked in (see streamline screenshot process, add pnpm claude to start #3 below)scripts/lib/verify.mjs—npm install+npx tsx+ JSON output parsescripts/lib/distill-failure.mjs— Claude Haiku summary of Playwright failures intostrategy.mdThe emitted script connects to a fresh Browserbase session bound to
BROWSERBASE_CONTEXT_ID(when set), so persistent-context auth survives between explorer training and Playwright replay.2.
scripts/loop.mjs— iterative co-evolutionUntil now autobrowse converged on "the LLM can finish the task," then export would have been a one-shot translation at the end. Those are different objective functions: what unblocks the LLM agent doesn't always unblock a deterministic replay.
The loop bridges them:
strategy.mdbecomes a shared intelligence layer between the LLM explorer (next iteration) and the codegen. Three sections (documented in SKILL.md):3. Codegen defaults that absorb the common state-portal pitfalls
Demoing the export pipeline end-to-end on bizfile.sos.ca.gov surfaced ~7 distinct classes of mismatch between what unblocks the LLM agent and what unblocks a deterministic replay. Each is now baked in as an auto-emitted helper or behavior — so the next task we point this at starts from a much smaller residual.
forceCheckpage.locator('input[type=checkbox]').fill('true')(Playwright rejects) and overlay-intercepted.check()forceClickRadio[type=radio]OR when resolved snapshot node role isradioselectWithFallback.selectOption()with a JS-enable + React-native-setter fallback for transiently-disabled<select>reactFillHTMLInputElement.prototype.valuesetter + syntheticinput/changeeventsclickButtonByTextgetByRoleraceclickLinkWithFallback.hrefproperty and preferspage.gotofor absolute hrefs.first()default for ambiguousclick_selbutton[type=button]matching 3 elements (Help / Save Draft / Next Step) → strict-mode violationexact: truefor form-inputgetByRole\"select\"→ ARIA\"combobox\"getByRole(\"select\", ...)which is invalid in Playwrightselect_refop routingbrowse select [0-2005] CAresolves the ref via snapshot instead of leaking as invalid CSS4.
scripts/evaluate.mjs— additive patchesBROWSERBASE_CONTEXT_IDenv var; if set with--env remote, pre-creates one BB session bound to that context, transparently injects--connect <session-id>into every browse command from the agent, and releases the session at exit. Lets persistent-context auth flow through every iteration without per-run login flailing.--max-turns NCLI flag (previously hard-coded to 30).loop.mjsplumbs this through.5.
SKILL.mdNew "Export to deterministic Playwright" and "Iterative Playwright loop" sections covering when to use
loop.mjsvsevaluate.mjs, the sectionedstrategy.mdformat, the codegen helper defaults, and pre-authed sessions via persistent context.Validation (May 13–14, bizfile.sos.ca.gov LLC formation)
Phase 1 (May 13, customer_demos PR #33): ran the export pipeline by hand against
run-004. The emitted script needed 15 hand-edits + an extract patch before it would replay cleanly. Those hand-edits became the source list for the codegen defaults above.Phase 2 (May 14, this PR): ran the full
loop.mjsfrom scratch.Net result: the wizard-navigation half went from 15 hand-edits → 0. The LLM-extract block is the remaining gap.
Known limitations / follow-ups
LLM-generated extract block remains brittle. The Haiku-generated result-shaping code at the end of every emitted script uses structural locators (
page.locator('text=\"X\"').evaluate(...)) that often match multiple elements. The wizard navigation succeeds end-to-end, then the extract throws andsuccess: falseis returned. Right fix: harden the extract prompt to insist on per-field try/catch + prefergetByLabel({..., exact: true}). ~30 LOC follow-up.No feedback when
evaluateitself maxes out. The loop currently only distills Playwright failures intostrategy.md. When evaluate hits max_turns, there's no addendum and the next iteration repeats whatever caused the flailing. Right fix: a second distillation pathway that reads evaluate's decision log when status ismax_turns, identifies the longest-spent step, and writes a Codegen Hint.strategy.md's "Codegen Hints" section is human-readable only. The codegen doesn't yet parse it for per-task overrides at export time. The new helpers are baked in as defaults that fire on selector/role heuristics. Right fix: structured Codegen Hints DSL the emitter consumes.Validated on n=1 task. All evidence so far comes from bizfile. State-portal patterns we haven't exercised: date pickers, file uploads, multi-tab flows, iframed forms, captchas mid-flow, Symantec VIP / SAML auth, steppers without a "Next" button. Each may surface a new codegen default. Recommend running this against 1–2 more diverse portals (CA EDD + a DMV-style stepper) before any "generalizes to all 50 agencies" claim.
Try it
The loop graduates when Playwright passes in 2 of the last 3 iterations and writes a report to
<workspace>/reports/loop-<task>-<timestamp>.md. Sister PR with the bizfile demo workspace + the emitted-then-hand-fixed script: browserbase/customer_demos#33.🤖 Generated with Claude Code
Note
Medium Risk
New pipeline runs generated Playwright via
npx tsx, shells out tobbfor Browserbase sessions, and calls Anthropic for extract/failure distillation; mis-generated scripts or env misconfiguration could cause failed replays or unexpected remote browser use, but scope is skill tooling rather than core auth/data paths.Overview
Adds a deterministic Playwright path on top of autobrowse: passing traces can be turned into runnable TypeScript under
tasks/<task>/playwright/, with optionalnpm install+tsxverification.export.mjspicks a passing run, mapstrace.jsonbrowse commands to ops (command-mapping.mjs), resolves session ARIA refs via snapshots into ranked Playwright locators (selector-resolver.mjs), infers a Zod schema fromtask.md(parse-task.mjs), and emits a script plusselectors.cache.json(codegen-playwright.mjs). The codegen layer includes portal-oriented helpers (forced radio/checkbox,selectWithFallback, linkgotofallback,.first()for ambiguous CSS) and an optional Haiku-generated final extract step.loop.mjsalternatesevaluate.mjs→ export → replay; Playwright failures are summarized intostrategy.md(Recent Playwright Failures) and graduation requires 2 of the last 3 replay passes.SKILL.mddocuments export, the iterative loop,strategy.mdsections, andBROWSERBASE_CONTEXT_IDfor persistent-context sessions on evaluate and exported scripts.Reviewed by Cursor Bugbot for commit 21f6405. Bugbot is set up for automated code reviews on this repo. Configure here.