Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions skills/autobrowse/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,99 @@ Write the file `./autobrowse/reports/YYYY-MM-DD-HH-MM-<tasks>.md` with:

---

## Export to deterministic Playwright

Once a task has graduated, you can collapse the LLM-driven replay loop into a single deterministic TypeScript script via the `export` subcommand. The export mines the most recent passing run's `trace.json`, resolves session-scoped ARIA refs against the snapshots they came from, and emits a Playwright script that connects to a fresh Browserbase session (optionally bound to a persistent context).

```bash
# Default — generate and verify against the latest passing run
node ${CLAUDE_SKILL_DIR}/scripts/export.mjs --task <task-name>

# Custom workspace / specific run / skip verification
node ${CLAUDE_SKILL_DIR}/scripts/export.mjs --task <task-name> --workspace ./autobrowse
node ${CLAUDE_SKILL_DIR}/scripts/export.mjs --task <task-name> --run run-022
node ${CLAUDE_SKILL_DIR}/scripts/export.mjs --task <task-name> --no-verify
```

The export writes to `<workspace>/tasks/<task>/playwright/`:

- `<task>.ts` — runnable Playwright script. Connects to Browserbase via `chromium.connectOverCDP` when `BROWSERBASE_CONTEXT_ID` is set; falls back to local Chromium otherwise.
- `selectors.cache.json` — resolved locators + ranked fallbacks per action. Used by future self-healing tooling.
- `package.json`, `tsconfig.json` — minimal scaffold with `playwright`, `zod`, `tsx`, `dotenv`.

How refs are resolved: every `[X-Y]` ref in the trace is looked up against the most recent prior `browse snapshot` containing it. The matched node's role + accessible name are turned into a ranked list of Playwright locator candidates — `getByRole({ name })` first, then `getByLabel` / `getByPlaceholder` for form inputs, then `getByText`, then bare `getByRole`. The best candidate is emitted inline; lower-ranked candidates are saved to `selectors.cache.json` for self-healing.

The final extract step is generated with one Claude Haiku call at export time (requires `ANTHROPIC_API_KEY`). The LLM is given the final snapshot, the Zod schema parsed from `task.md`'s `## Output` block, and the agent's final reasoning. If the API key is missing the export still produces a script — the extract block is a TODO placeholder.

For a Stagehand-targeted export (LLM-driven replay via `stagehand.act`/`observe`), use the standalone `/stagehand-export` skill.

## Iterative Playwright loop (recommended for tasks that need a deterministic artifact)

When the end goal is a runnable Playwright script (cron, Browserbase Functions, etc.), prefer `loop.mjs` over manually orchestrating evaluate + export. The loop converges on a workflow that **both** the LLM explorer **and** the deterministic Playwright replay can complete — which is a strictly stronger guarantee than "the LLM agent's trace ends with success: true."

```bash
node ${CLAUDE_SKILL_DIR}/scripts/loop.mjs --task <task-name> --env remote \
--max-iterations 8 --max-turns-per-iter 60
```

What it does per iteration:

1. Runs `evaluate.mjs` (one LLM-driven exploration round).
2. If the trace passed (`success: true` in the final JSON), runs `export.mjs --target playwright --no-verify` to emit a fresh script.
3. Runs the emitted script (`npx tsx <task>.ts`) against a new BB session — the actual deterministic replay.
4. If the Playwright replay passed → records a pass. If it failed → distills the failure (Claude Haiku, ~$0.01) into a new entry under `strategy.md`'s "Recent Playwright Failures" section.
5. Next iteration's evaluate reads the updated strategy.md and adapts.

**Convergence**: graduates when the emitted script passes in 2 of the last 3 iterations.

### Strategy.md sections

The loop expects (and the distiller maintains) this structure:

```markdown
# <task> Navigation Strategy

## Navigation Heuristics
(prose for the LLM explorer — fast-path URLs, timing notes, step sequences)

## Codegen Hints
(per-task overrides for the Playwright emitter — e.g., "use force:true for all radios on this site")

## Recent Playwright Failures
### Iteration 3 — <one-line>
- **What failed**: ...
- **Likely cause**: ...
- **Fix to try next iteration**: ...
```

The emitter (`codegen-playwright.mjs`) bakes in baseline defaults for the most common state-portal patterns: `forceCheck` for checkbox `fill_sel` ops, `forceClickRadio` for radio click ops, `selectWithFallback` (JS-enable + native setter) for every `select_dropdown`, and a `reactFill` helper for inputs that need to bypass keystroke-by-keystroke event handling.

### When to use `loop.mjs` vs `evaluate.mjs` directly

- **Use `loop.mjs`** when you want a Playwright script as the deliverable. Costs more per iteration (each adds a script export + replay) but converges on something that actually replays in prod.
- **Use `evaluate.mjs`** when you want a `/<task>` skill that future Claude sessions invoke (the original autobrowse flow). Cheaper, doesn't generate a Playwright script.

---

### Pre-authed sessions via persistent context

For tasks that need authentication, create a Browserbase context once, log in interactively, and point autobrowse at it via the env var:

```bash
# One-time: create a context, log into the target site via live-view
bb contexts create --project-id $BROWSERBASE_PROJECT_ID --json

# Then, every autobrowse run for this task reuses the cached cookies/storage
export BROWSERBASE_CONTEXT_ID=<id-from-above>
node ${CLAUDE_SKILL_DIR}/scripts/evaluate.mjs --task <name> --env remote
```

When `BROWSERBASE_CONTEXT_ID` is set with `--env remote`, evaluate.mjs creates one BB session bound to that context before the agent loop, transparently injects `--connect <session-id>` into every browse command the agent issues, and releases the session at exit. The agent's `browse env` / `browse stop` / `browse status` calls become no-ops in this mode. Iterations skip the per-run login dance.

The same env var, when set at runtime for the exported script, makes it attach to the same persisted context.

---

## Rules

- **Only edit `strategy.md`** — never touch `task.md` (unless creating it from the template) or `evaluate.mjs`
Expand Down
88 changes: 82 additions & 6 deletions skills/autobrowse/scripts/evaluate.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ const SKILL_DIR = path.resolve(__dirname, "..");
// ── Config ─────────────────────────────────────────────────────────

const DEFAULT_MODEL = "claude-sonnet-4-6";
const MAX_TURNS = 30;
const DEFAULT_MAX_TURNS = 30;
const MAX_TOKENS = 4096;
const EXEC_TIMEOUT_MS = 30_000;

Expand Down Expand Up @@ -161,6 +161,50 @@ function getNextRunNumber(tracesDir) {

const ALLOWED_COMMAND = "browse";

// ── Managed Browserbase session (BROWSERBASE_CONTEXT_ID passthrough) ──
//
// When BROWSERBASE_CONTEXT_ID is set and --env remote, autobrowse creates
// one persistent BB session before the agent loop and rewrites every
// browse command from the agent to attach to it. This lets the agent
// inherit a pre-authed state (cookies/storage) without having to log in
// during every training iteration. Set via env var (not a CLI flag) so
// callers can run autobrowse for the same task with or without the
// context attached.
let MANAGED_SESSION_ID = null;

function preCreateBrowserbaseSession(ctxId) {
console.error(`[autobrowse] BROWSERBASE_CONTEXT_ID detected — creating BB session bound to context ${ctxId.slice(0, 8)}…`);
try {
const stdout = execFileSync(
"bb",
["sessions", "create", "--context-id", ctxId, "--persist", "--advanced-stealth", "--solve-captchas"],
{ encoding: "utf-8" },
);
const session = JSON.parse(stdout);
if (!session.id) throw new Error(`bb output missing id: ${stdout.slice(0, 200)}`);
console.error(`[autobrowse] Created managed session ${session.id} (context ${ctxId.slice(0, 8)}…)`);
return session.id;
} catch (err) {
console.error(`[autobrowse] FATAL: could not create BB session — ${err.message || err}`);
process.exit(1);
}
}

function releaseManagedSession() {
if (!MANAGED_SESSION_ID) return;
const sid = MANAGED_SESSION_ID;
MANAGED_SESSION_ID = null; // guard against re-entry from signal handlers
try {
execFileSync("bb", ["sessions", "update", sid, "--status", "REQUEST_RELEASE"], { stdio: "ignore" });
console.error(`[autobrowse] Released managed session ${sid}`);
} catch (err) {
console.error(`[autobrowse] Warning: failed to release session ${sid}: ${err.message || err}`);
}
}
process.on("exit", releaseManagedSession);
process.on("SIGINT", () => { releaseManagedSession(); process.exit(130); });
process.on("SIGTERM", () => { releaseManagedSession(); process.exit(143); });

function parseCommand(command) {
const args = [];
let current = "";
Expand Down Expand Up @@ -243,6 +287,23 @@ function parseCommand(command) {
}

function executeCommand(command) {
// If a managed BB session is active, rewrite browse commands to attach to
// it. Session-lifecycle commands become no-ops so the agent's prompt-baked
// `browse env remote` / `browse stop` muscle memory doesn't fight us.
if (MANAGED_SESSION_ID) {
const trimmed = command.trim();
if (/^browse\s+(env|stop|status|start)(\s|$)/.test(trimmed)) {
return {
output: `[managed] no-op — session ${MANAGED_SESSION_ID.slice(0, 8)}… is pre-attached`,
error: false,
duration_ms: 0,
};
}
if (/^browse\s+/.test(trimmed) && !trimmed.includes("--connect")) {
command = trimmed.replace(/^browse\s+/, `browse --connect ${MANAGED_SESSION_ID} `);
}
}

// Security: only allow the browse CLI and execute it without a shell so
// metacharacters are treated as literal arguments instead of extra commands.
const parsed = parseCommand(command);
Expand Down Expand Up @@ -272,15 +333,17 @@ function executeCommand(command) {
}
}

function buildSystemPrompt(strategy, traceDir, browseEnv) {
const envDesc = browseEnv === "remote"
? `Use **remote mode** (Browserbase) — anti-bot stealth, CAPTCHA solving, residential proxies:
function buildSystemPrompt(strategy, traceDir, browseEnv, managedSessionId) {
const envDesc = managedSessionId
? `The browser is **pre-attached** to a managed Browserbase session (id starting ${managedSessionId.slice(0, 8)}…) with a persistent context — cookies/storage from prior sessions are loaded. **Skip session lifecycle commands** (\`browse env\`, \`browse stop\`, \`browse status\`) — they are silently no-ops. Begin with \`browse open <url>\`.`
: browseEnv === "remote"
? `Use **remote mode** (Browserbase) — anti-bot stealth, CAPTCHA solving, residential proxies:
\`\`\`
browse stop
browse env remote
\`\`\`
Always run \`browse stop\` first to kill any existing local session before switching to remote.`
: `Use **local mode** — runs on local Chrome:
: `Use **local mode** — runs on local Chrome:
\`\`\`
browse env local
\`\`\``;
Expand Down Expand Up @@ -391,6 +454,19 @@ async function main() {
}

const browseEnv = getArg("env", "local");
const maxTurnsArg = getArg("max-turns");
const MAX_TURNS = maxTurnsArg ? parseInt(maxTurnsArg, 10) : DEFAULT_MAX_TURNS;
if (!Number.isFinite(MAX_TURNS) || MAX_TURNS < 1) {
console.error(`ERROR: --max-turns must be a positive integer; got "${maxTurnsArg}".`);
process.exit(1);
}

// Pre-create a managed Browserbase session if BROWSERBASE_CONTEXT_ID is set.
// Falls back to the agent driving session setup itself when unset.
if (process.env.BROWSERBASE_CONTEXT_ID && browseEnv === "remote") {
MANAGED_SESSION_ID = preCreateBrowserbaseSession(process.env.BROWSERBASE_CONTEXT_ID);
}

const client = new Anthropic();
const runNumber = getNextRunNumber(tracesDir);
const runId = `run-${String(runNumber).padStart(3, "0")}`;
Expand All @@ -400,7 +476,7 @@ async function main() {

const strategy = fs.readFileSync(strategyFile, "utf-8");
const task = fs.readFileSync(taskFile, "utf-8");
const systemPrompt = buildSystemPrompt(strategy, traceDir, browseEnv);
const systemPrompt = buildSystemPrompt(strategy, traceDir, browseEnv, MANAGED_SESSION_ID);

console.error(`\n${"=".repeat(60)}`);
console.error(` AUTOBROWSE — ${taskName} — Run ${runNumber}`);
Expand Down
Loading