Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ This plugin includes the following skills (see `skills/` for details):
| [fetch](skills/fetch/SKILL.md) | Fetch HTML or JSON from static pages without a browser session — inspect status codes, headers, follow redirects |
| [search](skills/search/SKILL.md) | Search the web and return structured results (titles, URLs, metadata) without a browser session |
| [ui-test](skills/ui-test/SKILL.md) | AI-powered adversarial UI testing — analyzes git diffs to test changes, or explores the full app to find bugs |
| [browsability](skills/browsability/SKILL.md) | Score how usable a website is by an AI browser agent — Access Resistance (how much stealth/proxy/captcha help is needed), Drivability (do controls survive the accessibility-tree prune, iframe/shadow-DOM traps), and Agent tax (steps over the human baseline); emits a graded report with concrete fixes |

## Installation

Expand Down
1 change: 1 addition & 0 deletions skills/browsability/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
browsability-out/
21 changes: 21 additions & 0 deletions skills/browsability/LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2026 Browserbase, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
103 changes: 103 additions & 0 deletions skills/browsability/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
name: browsability
description: "Score how usable a website is BY AN AI BROWSER AGENT — its Browsability Index. Measures how little infrastructure assistance an agent needs to operate the site (Access Resistance), whether the agent can perceive and drive the live DOM (Drivability — does each control survive the accessibility-tree prune, are there iframe/shadow-DOM/deep-DOM traps), and how many more steps the agent needs than a human (Agent tax). Grounded in what the open-source Stagehand framework treats as hard. Use when the user asks how browsable / agent-friendly / agent-ready a website or a specific web flow (signup, checkout, search) is for a BROWSER agent, to compare sites on browser-agent usability, or to produce a browsability report card with concrete fixes. Triggers: 'how browsable is <site>', 'is this site agent-friendly for a browser agent', 'grade this checkout/signup flow for agents', 'browser-agent friendliness', 'DOM friction', 'browsability of <url>'. NOT for SEO/AEO or content discoverability (a different layer), and NOT for docs/SDK onboarding DX (use the agent-experience skill for that)."
license: MIT
metadata:
author: browserbase
version: "0.1.0"
allowed-tools: Bash Read Write Edit Glob Grep Agent
compatibility: "Requires `bun` and the browse CLI (`npm install -g @browserbasehq/browse-cli`). Remote mode needs BROWSERBASE_API_KEY. The full agent-ladder pass additionally needs a model-driven reference agent (use the `browser` skill as the driver)."
---

# Browsability — how usable is a site for a browser agent?

Score how well an AI **browser** agent can *operate* a website. The opinion: *browsability is how
little help an agent needs to succeed, and how much harder the site is for an agent than for a human.*
This is the operability layer — not discoverability, so ignore `llms.txt`, sitemaps, SEO/AEO.

**Before scoring, read `references/rubric.md`** — the full code-grounded rubric (axes, signals, the
assistance ladder, the agent-vs-human delta, and remediation knowledge). The summary below is only the
operating procedure.

## The score (0–100)

| Axis | Pts | Source |
|---|---|---|
| **A · Access Resistance** | 30 | lowest assistance rung that completes the task (agent ladder) |
| **B1 · Reachability** | 25 | % of controls that survive the accessibility-tree prune (deterministic probe) |
| **B3 · Structural traps** | 15 | cross-origin iframes, shadow DOM, DOM depth/size (deterministic probe) |
| **C · Agent tax** | 20 | agent steps OVER the human baseline (the delta — not absolute click count) |
| **D · Recoverability** | 10 | self-heal / site errors / blocking overlays / step ceiling (agent run) |

Score only counts for tasks a verifier confirms actually completed. **Agent-native affordance** (an
API / deep-link / structured action path) is a *ceiling badge*, not a scored component — flag it, do
not add it to the number; this rubric measures operability of the UI.

## Workflow

### Step 1 — Drivability probe (always; deterministic, no model)

Run the probe on the target URL (a page, or the entry point of a flow):

```bash
cd skills/browsability
bun scripts/friction.ts <url> --out browsability-out
```

This loads the page through the browse CLI and reports **B1 reachability** + **B3 structural traps**
straight from the live DOM (40 of 100 points). It needs no model and finishes in seconds. Use remote
mode (`browse env remote`, needs `BROWSERBASE_API_KEY`) for bot-protected sites; local is fine
otherwise. This alone is a useful friction profile and is the right answer for a quick assessment.

### Step 2 — Agent ladder + tasks (for the full score)

Derive a small set of **canonical tasks** for the site (informational / navigational / transactional —
e.g. "find the price of the paid plan", "create an account", "submit the contact form"). For each
task, run a reference browser agent across the **Access Resistance ladder** and record results:

- **rung 0** vanilla headless — captcha-solving **off** (`solveCaptchas:false`), no proxy, no fingerprint
- **rung 1** default assist — captcha-solving on
- **rung 2** proxy + realistic fingerprint
- **rung 3** advanced stealth + persisted context
- **rung 4** maximum assistance

Stop climbing once a task succeeds; the lowest passing rung is its Access Resistance. Drive the agent
with the `browser` skill (the browse CLI) or Stagehand, and judge each run's `success` with a verifier
— do not trust the agent's self-report. Capture **real step counts** and a **`humanBaselineSteps`**
estimate per task so Agent tax is computed as the delta. Record into `tasks.json`:

```json
{ "url": "https://example.com",
"tasks": [
{ "name": "Create an account", "type": "transactional", "humanBaselineSteps": 4,
"runs": [ {"rung":0,"success":false,"steps":10,"model":"<model>","note":"signup CTA unlabeled"},
{"rung":2,"success":true,"steps":7,"model":"<model>","note":""} ] } ] }
```

If no model-driven agent is available, act as the reference agent using the `browser` skill: execute
each task's browse steps, count the steps, and write the runs into `tasks.json` honestly (mark
single-model). This produces a real, if single-model, result.

### Step 3 — Composite score + report

```bash
bun scripts/score.ts --friction browsability-out/friction.json --tasks tasks.json --out browsability-out
```

Writes `browsability-out/browsability.json` with the 0–100 score, grade, and per-axis breakdown. When
`tasks.json` is absent it reports a **Drivability-only** score (B1 + B3, 40 max) and marks A/C/D
pending — still honest, just partial.

### Step 4 — Report to the user

Present a **profile, not just a number**: the grade, the per-axis breakdown, the lowest passing rung,
and — most usefully — a **ranked remediation list** drawn from the rubric's remediation table (e.g.
"signup CTA has no accessible name → add `aria-label`; estimated lift +X"). Cite the concrete signal
each finding came from.

## Notes & gotchas

- `solveCaptchas` defaults to **on** in Browserbase — an honest rung-0 must explicitly disable it, or rungs 0 and 1 collapse and captcha-walled sites get over-credited.
- The deterministic probe approximates "closed shadow DOM" via custom-element count with zero open shadow hosts; treat it as a hint and confirm during the agent run.
- Keep the human baseline honest — Agent tax is the *delta*, so a genuinely long workflow (10 steps for humans too) must not be penalized as un-browsable.
- The scripts call `browse stop` on exit; if a daemon hangs, `pkill -f "browse.*daemon"`.
153 changes: 153 additions & 0 deletions skills/browsability/references/rubric.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# The Browsability Rubric

A code-grounded, operational definition of how usable a website is **by an AI browser agent** —
and how to score it. Grounded in what the open-source [Stagehand](https://github.com/browserbase/stagehand)
browser-automation framework actually treats as hard, plus the public Browserbase session settings.

## The opinion, in one line

**Browsability is how little help an agent needs to succeed** — and, more precisely, **how much
harder the site is for an agent than for a motivated human.**

It is *not* discoverability. Forget `llms.txt`, sitemaps, token efficiency, and SEO/AEO — those
measure whether content can be *found and cited*. Browsability measures whether an agent can
*operate* the live site: perceive the controls, drive the DOM, and complete a real task.

It is measured **operationally** — by running real agent tasks and reading harness + session
telemetry (which controls survived the accessibility tree, how many steps a flow took, which errors
fired, how much stealth/proxy assistance was needed) — not by linting static HTML.

> **Scope note:** this rubric covers *UI operability* — driving a website in a browser. It is the
> sibling of, not a substitute for, auditing docs/SDK onboarding experience.

## The key reframe: score the agent-vs-human delta, not absolute effort

A 10-click checkout that also takes a human 10 clicks is *perfectly browsable* — that's just the
workflow. A 3-click task that takes the agent 10 because controls are unlabeled is *not browsable* —
those extra 7 clicks are the **agent tax**.

Scoring the **delta over the human baseline** mathematically subtracts out UX/design length (which
costs humans and agents equally) and isolates exactly the agent-specific penalty. This resolves the
"is click-count a UX problem or a browsability problem?" question: only the *excess* over the human
path counts.

Stagehand surfaces a piece of this directly — a native `<select>` is a **one-step** action; a custom
dropdown must be clicked open, re-snapshotted, then selected — a **two-step** action. That second
step *is* agent tax: incidental inflation, not essential workflow.

## The scored axes (+ one ceiling badge)

| Axis | What it measures | Weight | In score? |
|---|---|---|---|
| **A · Access Resistance** | How much infrastructure assistance the agent needs to operate at all (the ladder) | 30 | ✅ |
| **B1 · Reachability** | Can the agent perceive the controls (survive the accessibility-tree prune) | 25 | ✅ |
| **B3 · Structural traps** | iframes, shadow DOM, DOM depth/size | 15 | ✅ |
| **C · Agent tax** | Steps *above the human baseline* (incidental inflation only) | 20 | ✅ |
| **D · Recoverability** | What happens when something breaks (self-heal, site errors, blocking overlays, step ceiling) | 10 | ✅ |
| — Essential path length | Inherent workflow steps (humans pay too) | — | ❌ separate "Agent UX" lens |
| — Agent-native affordance | An API / deep-link / structured action path exists | — | ⭐ ceiling badge, not scored |

Agent-native affordance (offering a non-UI path so an agent need not drive the browser at all) is
noted as the *ceiling*, not a scored component — this rubric deliberately measures **operability of
the UI**, the realistic last mile for the large share of the web that is UI-only.

Gate everything on a **success verdict** per task (a verifier, not the agent's self-report): friction
and tax scores only count for tasks confirmed to have actually completed.

---

## Axis A — Access Resistance (the assistance ladder)

Browserbase exposes public session settings, each mitigating a specific site-side obstacle. Re-run
the *same task* climbing the ladder; the **lowest rung at which it succeeds** is the site's Access
Resistance. Lower = more browsable.

| Public setting | Mitigates |
|---|---|
| `solveCaptchas` | CAPTCHA challenges |
| `proxies` | IP blocks, rate limits, geo-gating (residential / geo-targeted) |
| `fingerprint` | headless-browser fingerprint detection |
| `advancedStealth` | advanced anti-bot detection |
| `context` (persist) | re-auth / re-consent walls; session continuity |

The ladder to re-run a task across:

- **L0 Vanilla headless** — captcha-solving **off**, no proxy, no fingerprint, fresh context. The agent looks like raw headless Chrome. *Passing here = maximally browsable.*
- **L1 Default assist** — captcha-solving on, still no proxy/fingerprint.
- **L2 Proxied + realistic fingerprint** — geo proxy + a realistic desktop fingerprint.
- **L3 Advanced stealth + persisted context** — advanced anti-bot mitigation on; cookies persisted.
- **L4 Maximum assistance** — top-tier anti-bot mitigation. *Needing this rung = barely browsable.*

> **Gotcha:** `solveCaptchas` defaults to **on** in Browserbase, so an honest rung-0 baseline must
> explicitly turn it off — otherwise L0 and L1 collapse and captcha-walled sites get over-credited.

**Score:** `A = 30 * (1 - minPassingRung / 4)`.

---

## Axis B — Drivability (per-step technical difficulty)

### B1 · Element reachability — can the agent even *see* the control?

Stagehand builds an accessibility tree and **prunes any node that lacks all three of**: an accessible
name, named children, or a non-structural role. An unlabeled `<div role="generic">` button is removed
*before the model ever sees it.* The survival rule, from the open-source accessibility snapshot:

```js
// keep a node iff:
const keep = !!(name && name.trim()) // it has an accessible name, OR
|| !!(childIds && childIds.length) // it has named children, OR
|| !isStructural(role); // it has a real role (not generic/none/inlinetextbox)
```

- **Signal:** reachable-control ratio = interactive controls that survive the prune ÷ all interactive controls.
- **Penalize:** icon-only buttons with no `aria-label`; `<div onclick>` controls; inputs with no associated `<label>`; closed-shadow custom components.
- **Reward:** native semantic elements (`button`, `a[href]`, `input`, `select`) with text/labels — they always survive.

### B3 · Structural traps — the hard walls

| Trap | Why it hurts an agent |
|---|---|
| Closed shadow DOM | roots closed before instrumentation are effectively invisible |
| Cross-origin iframes | short-lived, separately-managed frames that can drop out mid-operation |
| Deep DOM (>256 levels) | serialization stack limits force shallower, slower retries |
| Never-settling network | streaming / sub-second polling never reaches "network idle" → timeout every step |
| Virtualized lists | no automatic "scroll until found"; an observe→scroll→observe loop is required |
| Very large DOM | the serialized tree is truncated; elements past the cap become invisible |

---

## Axis C — Agent tax (steps over the human baseline)

For each verifier-confirmed task: `agentTax = agentSteps - humanBaselineSteps`. Where a human baseline
is unavailable, approximate the incidental inflation from the **two-step ratio** (custom controls the
framework must expand-then-act on) plus needless modal steps. Only the *excess* counts; essential
workflow length is reported separately as "Agent UX," not scored as browsability.

---

## Axis D — Recoverability — what happens when something breaks

Stagehand's error taxonomy cleanly separates *site-caused* friction from agent-caused, and its
self-heal path is the tell: on a stale selector (the DOM mutated under the agent) it re-snapshots and
re-asks the model once. Frequent self-heal = an unstable, hostile DOM.

- **Site-caused errors (penalize):** element-not-visible, selector-resolution failures, element-not-found, captcha timeouts, navigation timeouts.
- **Blocking overlays (penalize):** cookie/consent walls, login walls, paywalls — not auto-dismissed; they eat steps or wall the flow entirely.
- **Max-steps blowout:** agent loops have a default step budget; tasks that exhaust it score as failures.
- **Signal:** self-heal count, site-caused-error count, overlay-encountered flag, whether the run hit the step ceiling.

---

## Remediation knowledge (turn findings into fixes)

| Finding | Fix |
|---|---|
| Low reachable-ratio | add `aria-label` to icon-only controls; use semantic `<button>` / `<a>` |
| Many custom dropdowns | use native `<select>` where possible |
| Cross-origin iframes in the flow | same-origin embed, or a direct route |
| Closed shadow DOM | open shadow roots, or expose semantic fallbacks |
| Deep / very large DOM | flatten nesting, paginate, reduce node count |
| High Access Resistance | reduce hostile bot-walls on agent-relevant flows |
| High agent tax | collapse the funnel; remove needless modal steps |
| (ceiling) UI-only | offer an API / deep-link / structured action path for agents |
Loading