feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer by aq17 · Pull Request #108 · browserbase/skills

aq17 · 2026-05-14T21:22:58Z

Headline

autobrowse can now emit a runnable, deterministic Playwright script from any passing trace, and iterate the explorer + emitter together until both halves converge on the same workflow.

Before this PR, autobrowse produced traces + strategy.md — durable artifacts, but the only way to re-run the task was to pay LLM inference per step. There was no path from a graduated task to a no-LLM-loop runnable script. This PR adds that path.

What's new

1. End-to-end Playwright export pipeline (entirely new)

The full mining → resolve → codegen → verify pipeline, none of which existed in autobrowse before:

scripts/export.mjs — CLI: --task --target playwright --workspace --run --no-verify
scripts/lib/pick-run.mjs — newest-passing-run selection from traces/<task>/run-NNN/
scripts/lib/parse-task.mjs — task.md Output block → Zod schema for the emitted script
scripts/lib/command-mapping.mjs — browse trace → target-agnostic op stream
scripts/lib/selector-resolver.mjs — snapshot + session-scoped ARIA ref → ranked Playwright locator candidates (getByRole(name) → getByLabel → getByPlaceholder → getByText)
scripts/lib/codegen-playwright.mjs — ops → runnable TypeScript with helper functions baked in (see streamline screenshot process, add pnpm claude to start #3 below)
scripts/lib/verify.mjs — npm install + npx tsx + JSON output parse
scripts/lib/distill-failure.mjs — Claude Haiku summary of Playwright failures into strategy.md

The emitted script connects to a fresh Browserbase session bound to BROWSERBASE_CONTEXT_ID (when set), so persistent-context auth survives between explorer training and Playwright replay.

2. `scripts/loop.mjs` — iterative co-evolution

Until now autobrowse converged on "the LLM can finish the task," then export would have been a one-shot translation at the end. Those are different objective functions: what unblocks the LLM agent doesn't always unblock a deterministic replay.

The loop bridges them:

For each iteration (max --max-iterations):
  1. evaluate.mjs                   → trace.json + summary.md
  2. If trace passed:
       export.mjs --target playwright --no-verify  → emits script
       npx tsx <task>.ts                            → deterministic replay
       If replay passed → record pass
       Else → distill failure into strategy.md
  3. Next iteration's evaluate reads the updated strategy.md and adapts
  4. Graduate when Playwright passes in 2 of the last 3 iterations

strategy.md becomes a shared intelligence layer between the LLM explorer (next iteration) and the codegen. Three sections (documented in SKILL.md):

Navigation Heuristics — LLM-facing prose
Codegen Hints — per-task overrides for the emitter
Recent Playwright Failures — auto-appended by the distiller

3. Codegen defaults that absorb the common state-portal pitfalls

Demoing the export pipeline end-to-end on bizfile.sos.ca.gov surfaced ~7 distinct classes of mismatch between what unblocks the LLM agent and what unblocks a deterministic replay. Each is now baked in as an auto-emitted helper or behavior — so the next task we point this at starts from a much smaller residual.

Helper / Behavior	Replaces / fixes
`forceCheck`	`page.locator('input[type=checkbox]').fill('true')` (Playwright rejects) and overlay-intercepted `.check()`
`forceClickRadio`	Radio clicks blocked by styled-label overlays — applied automatically when selector matches `[type=radio]` OR when resolved snapshot node role is `radio`
`selectWithFallback`	`.selectOption()` with a JS-enable + React-native-setter fallback for transiently-disabled `<select>`
`reactFill`	Inputs where keystroke handlers (autosuggest, autocomplete) drop chars — uses `HTMLInputElement.prototype.value` setter + synthetic `input`/`change` events
`clickButtonByText`	Wizard "Next Step" buttons across SPA page transitions — avoids `getByRole` race
`clickLinkWithFallback`	SPA link clicks intercepted by tour/onboarding overlays — reads resolved `.href` property and prefers `page.goto` for absolute hrefs
`.first()` default for ambiguous `click_sel`	`button[type=button]` matching 3 elements (Help / Save Draft / Next Step) → strict-mode violation
`exact: true` for form-input `getByRole`	"Limited Liability Company Name" matching "Confirm Limited Liability Company Name"
Snapshot role `\"select\"` → ARIA `\"combobox\"`	Resolver was emitting `getByRole(\"select\", ...)` which is invalid in Playwright
`select_ref` op routing	`browse select [0-2005] CA` resolves the ref via snapshot instead of leaking as invalid CSS

4. `scripts/evaluate.mjs` — additive patches

Reads BROWSERBASE_CONTEXT_ID env var; if set with --env remote, pre-creates one BB session bound to that context, transparently injects --connect <session-id> into every browse command from the agent, and releases the session at exit. Lets persistent-context auth flow through every iteration without per-run login flailing.
--max-turns N CLI flag (previously hard-coded to 30). loop.mjs plumbs this through.

5. `SKILL.md`

New "Export to deterministic Playwright" and "Iterative Playwright loop" sections covering when to use loop.mjs vs evaluate.mjs, the sectioned strategy.md format, the codegen helper defaults, and pre-authed sessions via persistent context.

Validation (May 13–14, bizfile.sos.ca.gov LLC formation)

Phase 1 (May 13, customer_demos PR #33): ran the export pipeline by hand against run-004. The emitted script needed 15 hand-edits + an extract patch before it would replay cleanly. Those hand-edits became the source list for the codegen defaults above.

Phase 2 (May 14, this PR): ran the full loop.mjs from scratch.

Run	Stage	Result
Loop iter 1	evaluate	❌ max_turns at Step 7 (eval-flakiness on Confirm name field cost ~15 turns)
Loop iter 1	Playwright	(skipped — no passing trace)
Loop iter 2	evaluate	✅ reached Review (Step 9 of 11, run-008)
Loop iter 2	Playwright export	88 ops, 18 cached, 25 ref_resolved, 8 ref_failed, LLM extract generated
Loop iter 2	Playwright replay	❌ failed on the issues since fixed below
Loop iter 2	distill-failure	✅ wrote LLM-summarized addendum to strategy.md
Post-loop regen (after this PR's codegen fixes)	Wizard navigation	✅ all 9 steps, zero hand-edits
Post-loop regen	LLM-extract block	❌ still brittle (tracked as follow-up #1)

Net result: the wizard-navigation half went from 15 hand-edits → 0. The LLM-extract block is the remaining gap.

Known limitations / follow-ups

LLM-generated extract block remains brittle. The Haiku-generated result-shaping code at the end of every emitted script uses structural locators (page.locator('text=\"X\"').evaluate(...)) that often match multiple elements. The wizard navigation succeeds end-to-end, then the extract throws and success: false is returned. Right fix: harden the extract prompt to insist on per-field try/catch + prefer getByLabel({..., exact: true}). ~30 LOC follow-up.
No feedback when evaluate itself maxes out. The loop currently only distills Playwright failures into strategy.md. When evaluate hits max_turns, there's no addendum and the next iteration repeats whatever caused the flailing. Right fix: a second distillation pathway that reads evaluate's decision log when status is max_turns, identifies the longest-spent step, and writes a Codegen Hint.
strategy.md's "Codegen Hints" section is human-readable only. The codegen doesn't yet parse it for per-task overrides at export time. The new helpers are baked in as defaults that fire on selector/role heuristics. Right fix: structured Codegen Hints DSL the emitter consumes.
Validated on n=1 task. All evidence so far comes from bizfile. State-portal patterns we haven't exercised: date pickers, file uploads, multi-tab flows, iframed forms, captchas mid-flow, Symantec VIP / SAML auth, steppers without a "Next" button. Each may surface a new codegen default. Recommend running this against 1–2 more diverse portals (CA EDD + a DMV-style stepper) before any "generalizes to all 50 agencies" claim.

Try it

cd <your-workspace>
export BROWSERBASE_CONTEXT_ID=<id-of-an-authed-context>
node ~/Desktop/skills/skills/autobrowse/scripts/loop.mjs \\\\
  --task <task-name> \\\\
  --env remote \\\\
  --max-iterations 5 \\\\
  --max-turns-per-iter 100

The loop graduates when Playwright passes in 2 of the last 3 iterations and writes a report to <workspace>/reports/loop-<task>-<timestamp>.md. Sister PR with the bizfile demo workspace + the emitted-then-hand-fixed script: browserbase/customer_demos#33.

🤖 Generated with Claude Code

Note

Medium Risk
New pipeline runs generated Playwright via npx tsx, shells out to bb for Browserbase sessions, and calls Anthropic for extract/failure distillation; mis-generated scripts or env misconfiguration could cause failed replays or unexpected remote browser use, but scope is skill tooling rather than core auth/data paths.

Overview
Adds a deterministic Playwright path on top of autobrowse: passing traces can be turned into runnable TypeScript under tasks/<task>/playwright/, with optional npm install + tsx verification.

export.mjs picks a passing run, maps trace.json browse commands to ops (command-mapping.mjs), resolves session ARIA refs via snapshots into ranked Playwright locators (selector-resolver.mjs), infers a Zod schema from task.md (parse-task.mjs), and emits a script plus selectors.cache.json (codegen-playwright.mjs). The codegen layer includes portal-oriented helpers (forced radio/checkbox, selectWithFallback, link goto fallback, .first() for ambiguous CSS) and an optional Haiku-generated final extract step. loop.mjs alternates evaluate.mjs → export → replay; Playwright failures are summarized into strategy.md (Recent Playwright Failures) and graduation requires 2 of the last 3 replay passes.

SKILL.md documents export, the iterative loop, strategy.md sections, and BROWSERBASE_CONTEXT_ID for persistent-context sessions on evaluate and exported scripts.

^{Reviewed by Cursor Bugbot for commit 21f6405. Bugbot is set up for automated code reviews on this repo. Configure here.}

…c verify converge together Until now the explorer (evaluate.mjs) and the Playwright emitter (export.mjs) were two disconnected stages: explorer converged on "the LLM can finish the task," then export was a one-shot translation. The two objective functions diverge — what unblocks the LLM agent doesn't always unblock a deterministic replay. Demoing this against bizfile.sos.ca.gov surfaced 7+ classes of mismatch (styled-label overlays, autocomplete keystroke interception, transiently-disabled selects) that each cost a hand-fix in the emitted script. This PR unifies the loop: Each iteration of `scripts/loop.mjs`: 1. evaluate.mjs → produces trace.json + summary.md 2. If trace passed, export.mjs --no-verify → emits Playwright script 3. npx tsx <task>.ts → actual deterministic replay 4. On Playwright fail, distill-failure.mjs summarizes the error via Claude Haiku into strategy.md's "Recent Playwright Failures" section 5. Next iteration's evaluate reads the updated strategy.md and adapts Convergence: Playwright passes 2 of last 3 iterations → graduate. `strategy.md` is the shared intelligence layer between the LLM explorer and the codegen. Three sections (documented in SKILL.md): - Navigation Heuristics (LLM-facing) - Codegen Hints (emitter-facing, per-task overrides) - Recent Playwright Failures (auto-appended by distill-failure) Also lifts the lessons from the bizfile demo into codegen defaults so future tasks don't repeat the same hand-fixes: - forceCheck : .check({ force: true }) for checkbox fill_sel ops - forceClickRadio : .first().click({ force: true }) for radio click ops (detected by selector pattern OR resolved node role) - selectWithFallback: .selectOption() with a JS-enable + native-setter fallback when the <select> is transiently disabled - reactFill : helper for inputs where simulated keystrokes get intercepted by autosuggest/autocomplete handlers - clickButtonByText: eval-find-by-text in page context, avoids the cross-step getByRole race on SPA wizards Plus: select_dropdown ops with ref-shaped selectors (e.g. `[0-2005]`) now route through the snapshot resolver instead of leaking as invalid CSS. Files in this PR: scripts/loop.mjs NEW — top-level orchestrator scripts/export.mjs NEW — trace → Playwright codegen scripts/lib/pick-run.mjs NEW — newest-passing-run selector scripts/lib/parse-task.mjs NEW — task.md → Zod schema scripts/lib/command-mapping.mjs NEW — browse trace → target-agnostic ops scripts/lib/selector-resolver.mjs NEW — snapshot+ref → Playwright locators scripts/lib/codegen-playwright.mjs NEW — ops → TS with helpers baked in scripts/lib/verify.mjs NEW — npm install + tsx run + JSON parse scripts/lib/distill-failure.mjs NEW — Playwright stderr → strategy.md addendum scripts/evaluate.mjs MODIFIED — BROWSERBASE_CONTEXT_ID passthrough + --max-turns flag SKILL.md MODIFIED — documents export, loop, sectioned strategy.md, and the helper defaults baked into codegen Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed by loop validation Loop validation today on bizfile (run-008 mined as the passing trace) reduced the post-codegen hand-edits from yesterday's 15 down to 4 + 1 LLM-extract patch. Each of the 4 navigation-level issues is now baked in as a codegen default, so the next task we point loop.mjs at should start from a much smaller residual. Fixes landed: 1. clickLinkWithFallback helper (codegen-playwright.mjs) - For click_ref ops where the resolved node role is "link", emit clickLinkWithFallback(page, <locator>) instead of plain .click(). - Helper reads the resolved .href property (not getAttribute, which returns relative URLs). If the link exposes an absolute http(s) href, prefer page.goto over .click — bypasses SPA tour overlays and onClick preventDefault gates that block deterministic replay. - Waits for networkidle after navigation (load fires too early on SPAs). 2. .first() default for ambiguous click_sel selectors - Added isUniqueSelector() classifier: #id, [id=...], [data-testid=...]. - For unique selectors, emit .click() as before. For ambiguous ones (e.g. `button[type=button]`), emit .first().click() to avoid Playwright strict-mode violations. 3. exact: true for form-input getByRole emissions (selector-resolver.mjs) - Added EXACT_NAME_ROLES set: textbox, searchbox, combobox, spinbutton, listbox. nodeToLocators emits { name, exact: true } for these. - Prevents "Limited Liability Company Name" from matching "Confirm Limited Liability Company Name" (real bug from yesterday). 4. snapshot role "select" → ARIA role "combobox" (selector-resolver.mjs) - Added SNAPSHOT_TO_ARIA_ROLE map and normalize at top of nodeToLocators. - Browse-snapshot reports <select> with role "select" but Playwright's ARIA role is "combobox". Without this mapping, the emitter produced getByRole("select", ...) which is invalid. - Also boost getByLabel above getByRole for select-likes (combobox/listbox) since label-based locators tend to be more reliable for form selects. Validation: Re-exported bizfile-ca-llc from run-008 with these defaults. The emitted script navigates ALL 9 wizard steps without hand-edits (vs. yesterday's hand-fixed playwright-baseline/ which required 7 categories of patches). Only failure is in the LLM-generated extract block at the end (brittle structural locators in result-shaping) — separate concern, tracked as a follow-up. The architectural goal (loop + codegen produces a navigating Playwright script) is met. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aq17 · 2026-05-14T23:33:49Z

Update (May 14): validated end-to-end on bizfile from scratch + landed 4 codegen fixes for the residual hand-edits.

Loop converges through evaluate to a passing trace in 2 iterations (run-008 graduated)
Post-this-PR, the auto-emitted Playwright script navigates all 9 wizard steps with zero hand-edits
Only remaining failure is in the LLM-generated extract block at the end (brittle structural locators) — tracked as follow-up default to sonnet #1 in the updated PR description

Commit c918d2d adds:

clickLinkWithFallback helper for SPA links that don't navigate via .click() (the bizfile dashboard tour-overlay case)
.first() default for ambiguous click_sel selectors (the button[type=button] strict-mode case)
exact: true on form-input getByRole emissions (the "Limited Liability Company Name" matching "Confirm..." case)
Snapshot role "select" → ARIA role "combobox" mapping in the resolver (the getByRole("select") invalid case)

Ready for review. Two named follow-ups in the PR description for when we pick this up after Friday's walkthrough.

shubh24 · 2026-05-27T02:36:36Z

🏗️ Architecture feedback — toward goal-driven codegen

Nice work on this PR — the pipeline is clean and the bizfile validation is solid. But I want to flag a longer-term architecture concern before we harden around the current design.

The core tension

The current pipeline is a trace compiler: mine the LLM's trace → resolve ARIA refs to Playwright locators → emit a deterministic script. This works, but it means the trace is the source of truth for the generated script — and that's where things get complicated.

The LLM's trace includes a lot of incidental decisions. It clicked a button because it saw it first. It used a CSS selector because the snapshot was long. It took an extra step because it got confused. The export pipeline faithfully converts all of this into Playwright — the noise alongside the signal. A human writing Playwright wouldn't replay the journey; they'd look at the goal (fill step 3 of this form) and write the simplest path to it.

This has three downstream consequences:

The codegen helpers are a hand-curated catalog of workarounds. `forceCheck`, `selectWithFallback`, `reactFill`, `clickLinkWithFallback` — each one exists because you ran bizfile, hit a specific failure class, and wrote a helper. The PR body acknowledges this: "each [new site] may surface a new codegen default." That pattern won't scale to 50 state agencies.
The convergence criterion is mechanical, not intelligent. "Playwright passes in 2 of the last 3 iterations" doesn't understand why something passed. A test that passed because a longer timeout absorbed a race condition isn't the same as one that passed because the selectors are robust. An agent could make this judgment; a counter can't.
The feedback loop is indirect. When Playwright fails, the failure gets distilled into strategy.md, the LLM explorer adapts on the next iteration, and the trace gets re-exported. But the codegen doesn't read strategy.md's "Codegen Hints" section (acknowledged in the PR). So the explorer is learning, but the compiler isn't — you're optimizing the input to the compiler rather than the compiler itself.

Proposed architecture: hybrid skeleton + agent codegen

Split the problem into what machines do well (structure) and what agents do well (judgment):

Phase 1 — Mechanical skeleton extraction (keep most of what you have)

Mine the trace into a workflow skeleton, not a Playwright script:

Page-level navigation sequence (goto URL A → fill form → click Next → goto URL B → ...)
Per-page: which fields need filling, which buttons need clicking, what values to use
Don't resolve selectors. Just record the intent: "fill the Company Name field with 'Acme Corp'"

The command-mapping and trace-walking code from this PR is great infrastructure for this. The change is: stop at the intent layer, don't go all the way to Playwright locators.

Phase 2 — Agent writes Playwright from the skeleton + a live session

Give Claude the skeleton + strategy.md + a live Browserbase session. Claude writes Playwright for each step:

It can see the live page, pick its own selectors using its judgment
It runs each step interactively, sees what works
When something fails, it fixes it in-place — no roundtrip through strategy.md
It decides when to use `force: true`, when to use `evaluate()` for React inputs, when to add waits — from the DOM context, not from baked-in helpers

This eliminates the ARIA ref resolution pipeline, the selector ranking heuristics, and the hand-curated helper catalog. Claude is the codegen. The domain knowledge ("state portals use React controlled forms with styled label overlays") lives in strategy.md as prose, and the agent decides when to apply it — rather than encoding it as named functions.

Phase 3 — Agent-driven verification

Instead of "2 of 3 passes," give Claude the script + the last N run results and ask: "Is this production-ready? What's still flaky?" The agent can identify that a timeout was 4900ms on a 5000ms limit (near-miss, not a real pass), or that a selector matched by coincidence. Graduation becomes a judgment call, not a counter.

What this means for this PR

Ship it as-is — the pipeline works, bizfile validates it, and the infrastructure (command-mapping, trace-walking, selectors.cache.json, distill-failure) is valuable regardless of architecture. But I'd treat the current export pipeline as a stepping stone, not the final architecture:

The trace → ops → skeleton extraction is durable infrastructure. Keep investing here.
The ops → Playwright codegen (selector resolution, helper functions, emitOp) is the part that should eventually be replaced by agent-driven codegen from the skeleton.
The loop + distillation machinery is good, but the convergence check should move toward agent judgment.

The end state: the trace trains strategy.md (the existing autobrowse loop), and the agent writes Playwright from the task description + strategy — not from the trace. The trace is training data, not source code. That's the architecture that scales to 50 agencies without a new helper per portal.

Resolves merge conflicts with origin/main (took main's evaluate.mjs, which has the browser-trace integration this branch's loop.mjs / export.mjs sit on top of). Bugbot findings addressed (PR #108 review comments on c918d2d): 1. loop.mjs runExport stdout pollution stdio was ["ignore","inherit","inherit"] — export.mjs's JSON report leaked into loop.mjs's own structured JSON. Pipe stdout, log it to stderr instead, so consumers parsing our stdout get a single object. 2. lib/codegen-playwright.mjs dead helpers reactFill / clickButtonByText were declared in every emitted script but emitOp never called them. Removed both; clickLinkWithFallback (which IS emitted) stays. 3. loop.mjs + lib/verify.mjs: JSON parse for pretty-printed output `stdout.lastIndexOf("{")` locks onto the deepest inner `{` for nested output schemas (since emitted scripts use JSON.stringify(result, null, 2)). High-severity — the loop could never graduate for tasks with non-flat schemas. Replaced with a brace-balanced backward scan that ignores braces inside JSON strings. Helper factored into lib/pick-run.mjs as extractTrailingJsonObject for reuse. 4. loop.mjs unused `pickRun` import — removed. 5. lib/codegen-playwright.mjs unused `open` variable in checkBalance — removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… report Bugbot caught this on 12e7f0a: the mid-loop graduation check used "2 of last 3 iterations passed", but the final report's status, the JSON `graduated` flag, and the process exit code all used the raw total pass-count across all iterations. Two passes separated by many failures could exit 0 and write "graduated" even though the last-3 rule was never satisfied. Factor the criterion into a single `graduationReached(history)` helper. Use it in: - the mid-loop break (was: inline last3 check) - the report's "Final status" line (was: passedCount >= 2) - the JSON `graduated` field (was: passedCount >= 2) - the process exit code (was: passedCount >= 2) passedCount is still computed and shown in the report for visibility, but it no longer drives any outcome flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 21f6405. Configure here.}

cursor · 2026-06-03T22:57:25Z

+    if (!exportOk) {
+      log(`iter ${iter}: export failed; treating as Playwright fail`);
+      hist.distillReason = "export script returned non-zero";
+      continue;


Export failures skip strategy updates

Medium Severity

When export.mjs exits non-zero, the loop records a distill reason but never appends to strategy.md, unlike Playwright replay failures. Iterations after an export error get no codegen/export hints, so the co-evolution loop can repeat the same export failure.

^{Reviewed by Cursor Bugbot for commit 21f6405. Configure here.}

cursor · 2026-06-03T22:57:26Z

+          } else if (klass === "ref") {
+            ops.push({ kind: "click_ref", ref: normalizeRef(target), ...base });
+          }
+          break;


Unknown click targets silently dropped

Medium Severity

walkTrace only emits click ops when classifySelector returns xpath, css, or ref. Clicks whose target is classified as unknown produce no op and no unhandled entry, so successful trace clicks can be omitted from the generated Playwright script without a TODO or warning.

^{Reviewed by Cursor Bugbot for commit 21f6405. Configure here.}

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread skills/autobrowse/scripts/loop.mjs

Comment thread skills/autobrowse/scripts/lib/codegen-playwright.mjs

aq17 changed the title ~~feat(autobrowse): iterative Playwright loop + emitter co-evolved with explorer~~ feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer May 14, 2026

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread skills/autobrowse/scripts/loop.mjs Outdated

Comment thread skills/autobrowse/scripts/loop.mjs Outdated

Comment thread skills/autobrowse/scripts/lib/codegen-playwright.mjs Outdated

aq17 requested review from rcbrowder, shubh24 and ziruihao May 14, 2026 23:46

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread skills/autobrowse/scripts/loop.mjs Outdated

cursor Bot reviewed Jun 3, 2026

View reviewed changes

ziruihao closed this Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer#108

feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer#108
aq17 wants to merge 4 commits into
mainfrom
aq/autobrowse-iterative-playwright-loop

aq17 commented May 14, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

aq17 commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shubh24 commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 3, 2026

Uh oh!

cursor Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aq17 commented May 14, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Headline

What's new

1. End-to-end Playwright export pipeline (entirely new)

2. scripts/loop.mjs — iterative co-evolution

3. Codegen defaults that absorb the common state-portal pitfalls

4. scripts/evaluate.mjs — additive patches

5. SKILL.md

Validation (May 13–14, bizfile.sos.ca.gov LLC formation)

Known limitations / follow-ups

Try it

Uh oh!

Uh oh!

Uh oh!

aq17 commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shubh24 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🏗️ Architecture feedback — toward goal-driven codegen

The core tension

Proposed architecture: hybrid skeleton + agent codegen

What this means for this PR

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Export failures skip strategy updates

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Unknown click targets silently dropped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aq17 commented May 14, 2026 •

edited by cursor Bot

Loading

2. `scripts/loop.mjs` — iterative co-evolution

4. `scripts/evaluate.mjs` — additive patches

5. `SKILL.md`

shubh24 commented May 27, 2026 •

edited

Loading