Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions skills/autobrowse/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,6 +349,99 @@ Write the file `./autobrowse/reports/YYYY-MM-DD-HH-MM-<tasks>.md` with:

---

## Export to deterministic Playwright

Once a task has graduated, you can collapse the LLM-driven replay loop into a single deterministic TypeScript script via the `export` subcommand. The export mines the most recent passing run's `trace.json`, resolves session-scoped ARIA refs against the snapshots they came from, and emits a Playwright script that connects to a fresh Browserbase session (optionally bound to a persistent context).

```bash
# Default — generate and verify against the latest passing run
node ${CLAUDE_SKILL_DIR}/scripts/export.mjs --task <task-name>

# Custom workspace / specific run / skip verification
node ${CLAUDE_SKILL_DIR}/scripts/export.mjs --task <task-name> --workspace ./autobrowse
node ${CLAUDE_SKILL_DIR}/scripts/export.mjs --task <task-name> --run run-022
node ${CLAUDE_SKILL_DIR}/scripts/export.mjs --task <task-name> --no-verify
```

The export writes to `<workspace>/tasks/<task>/playwright/`:

- `<task>.ts` — runnable Playwright script. Connects to Browserbase via `chromium.connectOverCDP` when `BROWSERBASE_CONTEXT_ID` is set; falls back to local Chromium otherwise.
- `selectors.cache.json` — resolved locators + ranked fallbacks per action. Used by future self-healing tooling.
- `package.json`, `tsconfig.json` — minimal scaffold with `playwright`, `zod`, `tsx`, `dotenv`.

How refs are resolved: every `[X-Y]` ref in the trace is looked up against the most recent prior `browse snapshot` containing it. The matched node's role + accessible name are turned into a ranked list of Playwright locator candidates — `getByRole({ name })` first, then `getByLabel` / `getByPlaceholder` for form inputs, then `getByText`, then bare `getByRole`. The best candidate is emitted inline; lower-ranked candidates are saved to `selectors.cache.json` for self-healing.

The final extract step is generated with one Claude Haiku call at export time (requires `ANTHROPIC_API_KEY`). The LLM is given the final snapshot, the Zod schema parsed from `task.md`'s `## Output` block, and the agent's final reasoning. If the API key is missing the export still produces a script — the extract block is a TODO placeholder.

For a Stagehand-targeted export (LLM-driven replay via `stagehand.act`/`observe`), use the standalone `/stagehand-export` skill.

## Iterative Playwright loop (recommended for tasks that need a deterministic artifact)

When the end goal is a runnable Playwright script (cron, Browserbase Functions, etc.), prefer `loop.mjs` over manually orchestrating evaluate + export. The loop converges on a workflow that **both** the LLM explorer **and** the deterministic Playwright replay can complete — which is a strictly stronger guarantee than "the LLM agent's trace ends with success: true."

```bash
node ${CLAUDE_SKILL_DIR}/scripts/loop.mjs --task <task-name> --env remote \
--max-iterations 8 --max-turns-per-iter 60
```

What it does per iteration:

1. Runs `evaluate.mjs` (one LLM-driven exploration round).
2. If the trace passed (`success: true` in the final JSON), runs `export.mjs --target playwright --no-verify` to emit a fresh script.
3. Runs the emitted script (`npx tsx <task>.ts`) against a new BB session — the actual deterministic replay.
4. If the Playwright replay passed → records a pass. If it failed → distills the failure (Claude Haiku, ~$0.01) into a new entry under `strategy.md`'s "Recent Playwright Failures" section.
5. Next iteration's evaluate reads the updated strategy.md and adapts.

**Convergence**: graduates when the emitted script passes in 2 of the last 3 iterations.

### Strategy.md sections

The loop expects (and the distiller maintains) this structure:

```markdown
# <task> Navigation Strategy

## Navigation Heuristics
(prose for the LLM explorer — fast-path URLs, timing notes, step sequences)

## Codegen Hints
(per-task overrides for the Playwright emitter — e.g., "use force:true for all radios on this site")

## Recent Playwright Failures
### Iteration 3 — <one-line>
- **What failed**: ...
- **Likely cause**: ...
- **Fix to try next iteration**: ...
```

The emitter (`codegen-playwright.mjs`) bakes in baseline defaults for the most common state-portal patterns: `forceCheck` for checkbox `fill_sel` ops, `forceClickRadio` for radio click ops, `selectWithFallback` (JS-enable + native setter) for every `select_dropdown`, and a `reactFill` helper for inputs that need to bypass keystroke-by-keystroke event handling.

### When to use `loop.mjs` vs `evaluate.mjs` directly

- **Use `loop.mjs`** when you want a Playwright script as the deliverable. Costs more per iteration (each adds a script export + replay) but converges on something that actually replays in prod.
- **Use `evaluate.mjs`** when you want a `/<task>` skill that future Claude sessions invoke (the original autobrowse flow). Cheaper, doesn't generate a Playwright script.

---

### Pre-authed sessions via persistent context

For tasks that need authentication, create a Browserbase context once, log in interactively, and point autobrowse at it via the env var:

```bash
# One-time: create a context, log into the target site via live-view
bb contexts create --project-id $BROWSERBASE_PROJECT_ID --json

# Then, every autobrowse run for this task reuses the cached cookies/storage
export BROWSERBASE_CONTEXT_ID=<id-from-above>
node ${CLAUDE_SKILL_DIR}/scripts/evaluate.mjs --task <name> --env remote
```

When `BROWSERBASE_CONTEXT_ID` is set with `--env remote`, evaluate.mjs creates one BB session bound to that context before the agent loop, transparently injects `--connect <session-id>` into every browse command the agent issues, and releases the session at exit. The agent's `browse env` / `browse stop` / `browse status` calls become no-ops in this mode. Iterations skip the per-run login dance.

The same env var, when set at runtime for the exported script, makes it attach to the same persisted context.

---

## Rules

- **Only edit `strategy.md`** — never touch `task.md` (unless creating it from the template) or `evaluate.mjs`
Expand Down
203 changes: 203 additions & 0 deletions skills/autobrowse/scripts/export.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
#!/usr/bin/env node

/**
* export.mjs — Translate a graduated autobrowse task into a deterministic
* runnable script.
*
* Currently supports --target playwright. The Stagehand variant lives in
* the standalone /stagehand-export skill; once Playwright is shipped and
* proven we can fold both targets behind this CLI.
*
* Usage:
* node scripts/export.mjs --task <name> --target playwright \\
* [--workspace ./autobrowse] [--run run-NNN] \\
* [--output <dir>] [--no-verify]
*/

import "dotenv/config";
import * as fs from "node:fs";
import * as path from "node:path";

import { pickRun, listRuns } from "./lib/pick-run.mjs";
import { taskToSchema, parseStrategySections } from "./lib/parse-task.mjs";
import { walkTrace } from "./lib/command-mapping.mjs";
import {
generatePlaywrightScript,
playwrightPackageJson,
playwrightTsconfig,
} from "./lib/codegen-playwright.mjs";
import { verifyGenerated } from "./lib/verify.mjs";

// ── CLI args ───────────────────────────────────────────────────────

function getArg(name, fallback) {
const i = process.argv.indexOf(`--${name}`);
return i !== -1 && process.argv[i + 1] ? process.argv[i + 1] : fallback;
}
const hasFlag = (n) => process.argv.includes(`--${n}`);

if (hasFlag("help") || hasFlag("h")) {
console.log(`autobrowse export — generate deterministic replay scripts from autobrowse traces

Usage: node scripts/export.mjs --task <name> [options]

Options:
--task <name> Task name — matches tasks/<name>/ (required)
--target <kind> playwright (default; stagehand lives in /stagehand-export)
--workspace <dir> Workspace root holding tasks/ and traces/ (default: ./autobrowse)
--run <id> Force a specific run (default: newest passing)
--output <dir> Output directory for generated files (default: <workspace>/tasks/<name>/<target>)
--no-verify Skip the npm install + tsx run verification step

Env:
ANTHROPIC_API_KEY Used for LLM-generated extract block. If unset, a TODO placeholder is emitted.
BROWSERBASE_* Pass through to the generated script at runtime.

Exit codes: 0 generated+verified, 2 generated but verify failed (or --no-verify), 1 generator error.`);
process.exit(0);
}

const TASK = getArg("task");
const TARGET = getArg("target", "playwright");
const WORKSPACE = path.resolve(getArg("workspace", "autobrowse"));
const FORCED_RUN = getArg("run");
const VERIFY = !hasFlag("no-verify");
const OUTPUT = getArg("output");

if (!TASK) {
console.error("ERROR: --task <name> is required");
console.error("Run with --help for usage.");
process.exit(1);
}
if (TARGET !== "playwright") {
console.error(`ERROR: --target=${TARGET} not yet supported here. Use the /stagehand-export skill for Stagehand output.`);
process.exit(1);
}

// ── Locate sources ────────────────────────────────────────────────

const taskDir = path.join(WORKSPACE, "tasks", TASK);
const tracesDir = path.join(WORKSPACE, "traces", TASK);
const outDir = OUTPUT ? path.resolve(OUTPUT) : path.join(taskDir, TARGET);

const taskFile = path.join(taskDir, "task.md");
const strategyFile = path.join(taskDir, "strategy.md");

for (const [label, file] of [["task.md", taskFile], ["strategy.md", strategyFile]]) {
if (!fs.existsSync(file)) {
console.error(`ERROR: ${label} not found at ${file} — run autobrowse first.`);
process.exit(1);
}
}
if (!fs.existsSync(tracesDir)) {
console.error(`ERROR: no traces at ${tracesDir} — run autobrowse first.`);
process.exit(1);
}

const runId = pickRun(tracesDir, FORCED_RUN);
if (!runId) {
console.error(`ERROR: no passing runs found in ${tracesDir}.`);
console.error("Graduate the task with autobrowse first, or pass --run <id> to force.");
console.error("Available runs:", listRuns(tracesDir).join(", ") || "(none)");
process.exit(1);
}

const runDir = path.join(tracesDir, runId);
const tracePath = path.join(runDir, "trace.json");
if (!fs.existsSync(tracePath)) {
console.error(`ERROR: trace.json missing at ${tracePath}`);
process.exit(1);
}

console.error(`[export] task=${TASK} target=${TARGET} run=${runId} workspace=${WORKSPACE}`);

const trace = JSON.parse(fs.readFileSync(tracePath, "utf-8"));
const taskMd = fs.readFileSync(taskFile, "utf-8");
const strategyMd = fs.readFileSync(strategyFile, "utf-8");

// ── Schema + sections ──────────────────────────────────────────────

const { outputShape, zodSchema, schemaFieldCount } = taskToSchema(taskMd);
const sections = parseStrategySections(strategyMd);
const ops = walkTrace(trace, sections);

// Find the agent's final natural-language summary (for LLM extract grounding).
let finalReasoning = "";
for (let i = trace.length - 1; i >= 0; i--) {
if (trace[i].role === "assistant" && trace[i].reasoning) {
finalReasoning = trace[i].reasoning;
break;
}
}

// ── Generate Playwright script ─────────────────────────────────────

const { scriptCode, cachedActions, stats, extract } = await generatePlaywrightScript({
task: TASK,
runId,
workspace: WORKSPACE,
trace,
ops,
zodSchema,
outputShape,
taskMd,
finalReasoning,
});

// ── Write outputs ──────────────────────────────────────────────────

fs.mkdirSync(outDir, { recursive: true });
const scriptPath = path.join(outDir, `${TASK}.ts`);
const cachePath = path.join(outDir, "selectors.cache.json");
const pkgPath = path.join(outDir, "package.json");
const tsconfigPath = path.join(outDir, "tsconfig.json");

fs.writeFileSync(scriptPath, scriptCode);
fs.writeFileSync(
cachePath,
JSON.stringify(
{
task: TASK,
target: TARGET,
generated_from: { workspace: WORKSPACE, run: runId },
stats,
extract,
actions: cachedActions,
},
null,
2,
),
);
if (!fs.existsSync(pkgPath)) {
fs.writeFileSync(pkgPath, JSON.stringify(playwrightPackageJson(TASK), null, 2));
}
if (!fs.existsSync(tsconfigPath)) {
fs.writeFileSync(tsconfigPath, JSON.stringify(playwrightTsconfig(), null, 2));
}

console.error(`[export] wrote ${path.relative(process.cwd(), scriptPath)}`);
console.error(`[export] ops: ${ops.length} | cached: ${stats.cached} | ref_resolved: ${stats.ref_resolved} | ref_failed: ${stats.ref_failed} | dropped: ${stats.dropped}`);
console.error(`[export] schema fields: ${schemaFieldCount} | extract: ${extract.generated ? "LLM-generated" : `fallback (${extract.reason})`}`);

// ── Verify ─────────────────────────────────────────────────────────

const baseReport = {
task: TASK,
target: TARGET,
run: runId,
script: scriptPath,
cache: cachePath,
stats,
schema_fields: schemaFieldCount,
extract: { generated: extract.generated, reason: extract.reason },
};

if (!VERIFY) {
console.log(JSON.stringify({ ...baseReport, verified: false }, null, 2));
process.exit(0);
}

const v = verifyGenerated(outDir, `${TASK}.ts`);
const report = { ...baseReport, verified: true, passed: v.passed, exit_code: v.exit_code, run_log: v.run_log, output: v.output };
console.log(JSON.stringify(report, null, 2));
process.exit(v.passed ? 0 : 2);
Loading