Skip to content

software factory: explore running it only using Claude Code + Skills + Boxel CLI#4843

Draft
jurgenwerk wants to merge 20 commits into
mainfrom
cs-11149-create-a-set-of-skills-extracted-from-the-orchestrator
Draft

software factory: explore running it only using Claude Code + Skills + Boxel CLI#4843
jurgenwerk wants to merge 20 commits into
mainfrom
cs-11149-create-a-set-of-skills-extracted-from-the-orchestrator

Conversation

@jurgenwerk
Copy link
Copy Markdown
Contributor

@jurgenwerk jurgenwerk commented May 15, 2026

Why we're doing this

Today, running the software factory means running pnpm factory:go, which spins up a Node process that drives the loop via the Claude Agent SDK. Like an interactive Claude Code session, it bills against your Claude subscription.

Starting June 15, 2026, Anthropic is carving Agent SDK usage out of the regular subscription pool — SDK and claude -p invocations on subscription plans will draw from a separate, smaller monthly Agent SDK credit, while interactive Claude Code sessions stay uncapped. So the cheapest, most predictable place to run long agentic loops going forward is inside an interactive session you already have open.

This PR teaches the software factory to run that way: you paste one prompt into Claude Code, and the agent does the whole run — bootstrap, implement each Issue, validate, mark done, repeat until the project is complete. No SDK process, no separate credit pool.

The existing SDK orchestrator (issue-loop.ts, factory-agent/, etc.) is not deleted in this PR. The two paths run side by side. We'll retire the orchestrator in a follow-up PR once the interactive path has been used in practice for a while.

How to try it

Quick one-time setup:

# Build the dev \`boxel\` CLI and put it on your PATH
cd /path/to/boxel/packages/boxel-cli
pnpm build
ln -sf \"\$(pwd)/bin/boxel.js\" ~/.local/bin/boxel
boxel --version

# Build the host app (needed by \`boxel test\`)
pnpm --filter @cardstack/host build

# Install Playwright's headless Chromium (also \`boxel test\`)
npx playwright install chromium

Then run a factory:

unset ANTHROPIC_API_KEY        # in case it's set — ensures Claude Code uses your subscription
mise run dev-all               # in another terminal — realm server, host, etc.

cd /path/to/boxel/packages/software-factory
claude                         # open Claude Code from here so the skills are picked up

In the Claude Code session, paste:

> ❯ Run the software factory on http://localhost:4201/software-factory/Wiki/gradebook into a fresh target realm at http://localhost:4201/user/factory-test-gradebook-1/. Use ./factory-test-gradebook-1/ as the workspace. Follow docs/runbook.md.                                                                                                                                                          

The agent bootstraps the realm, works each implementation Issue end-to-end, runs all five validators per Issue (with bail-out limits so it doesn't spiral), writes audit-trail cards under `Validations/`, and flips the project to `completed`. Open the target realm in the Boxel host UI to see what got built.

Want to try a richer brief? Substitute `recipe`, `gradebook`, or any of the other briefs in the source realm.

The runbook (`packages/software-factory/docs/runbook.md`) is the canonical reference — the snippet above is just the TL;DR.

What's in the PR

3 new `boxel-cli` commands so the agent can run the validators from its Bash surface:

  • `boxel lint` — ESLint + Prettier via the realm's existing `_lint` endpoint
  • `boxel parse` — glint (`ember-tsc`) type-checking
  • `boxel test` — drives headless Chromium against the host app's compiled test bundle

The other two validators (`evaluate-module`, `instantiate-card`) and the schema fetcher (`get-card-type-schema`) were already host commands reachable via `boxel run-command`.

3 vendor-neutral skills the agent loads when it enters a software-factory session:

  • `software-factory-bootstrap` — read the brief, create the target realm + Project + IssueTracker + Knowledge Articles + implementation Issues
  • `software-factory-scheduling` — pick the next unblocked Issue, status-transition lifecycle, bail-out rules
  • `software-factory-operations` — write the card code + tests + sample instances + Catalog Spec, run validators, persist audit-trail cards, mark Issues done

A runbook at `packages/software-factory/docs/runbook.md` documenting the user-facing prompt, prerequisites, and what to expect in the target realm.

The agent owns the whole loop: bootstrap, pick next Issue, implement, validate, persist audit cards, mark done — repeat until the backlog is empty. The validator loop has bail-out limits (8 iterations per Issue, or 3 identical consecutive failures, or 5 distinct fix attempts without a pass) so it doesn't spiral on a brief it can't satisfy. If an Issue genuinely can't be made green, it ends up `blocked` with a comment explaining what was tried, and the agent moves on.

What's deliberately not in this PR

  • Retiring the SDK orchestrator. Follow-up PR once the interactive path has been used in practice.
  • `boxel parse` / `boxel test` outside the monorepo. Both need access to `packages/host/dist` (for the test bundle) and the monorepo's `packages/base` etc. (for type-checking paths). Anyone outside the monorepo can still use the other three validators. The proper fix is a realm-server `_parse` endpoint mirroring `_lint`; tracked as follow-up.
  • A `/factory-run` slash command wrapping the prompt so the user doesn't paste prose every time. Also follow-up.

Validated against

End-to-end clean runs (single-prompt, no manual intervention) against:

  • `software-factory/Wiki/sticky-note` — simple single-card brief
  • `software-factory/Wiki/recipe` — richer brief with nested fields and multiple sample instances

Both produced a populated target realm with the full set of expected artifacts including the `Validations/` folder.

How to review

The skill text under `packages/software-factory/.agents/skills/` is the biggest surface — those three Markdown files are what the agent reads to drive the loop, so they're worth a careful read. The runbook and the three new CLI commands are smaller and self-contained.

🤖 Generated with Claude Code

jurgenwerk and others added 8 commits May 15, 2026 10:09
Documents the prompt sequence a user runs in a subscription-billed
Claude Code session to drive a full factory run without the SDK
orchestrator. Tracks the gaps that must be closed first: missing
`boxel lint` / `parse` / `test` CLI commands, skill rewrites to drop
factory-MCP-tool references, a new scheduling skill encoding the
issue-pickup rules currently in `issue-scheduler.ts`. CS-11149.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lints every lintable (.gts/.gjs/.ts/.js) file in a realm via the realm
`_lint` endpoint, or a single file when a realm-relative path is
passed. Aggregates per-file violations into a single summary; exits
non-zero on any error-severity violation.

This is the realm-wide companion to the existing single-file
`boxel file lint <path>` command. Closes the first gap from the
Phase 1 runbook (CS-11149): the software factory's
`runLintInMemory` validator becomes reachable from an interactive
Claude Code session via Bash.

`software-factory` keeps its in-process `runLintInMemory` for now;
both coexist during the migration window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts the factory's glint runner + JSON document validator from
`packages/software-factory/src/parse-execution.ts` into a top-level
`boxel parse` command in boxel-cli. Behavior matches the existing
factory tool:

- Without a path: discovers every `.gts` / `.gjs` / `.ts` in the
  realm plus every `.json` file linked as a `Spec.linkedExamples`,
  runs glint (`ember-tsc`) over the GTS batch in a temp dir with
  monorepo-aware tsconfig paths, and validates the document
  structure of each JSON example.
- With a path: parses just that single file (GTS → glint, JSON →
  document validation).

Path resolution is anchored on this file's `__dirname`, so the
command requires the Boxel monorepo layout — `packages/base`,
`packages/host`, `packages/boxel-ui`, and `@glint/ember-tsc` (added
as a boxel-cli devDependency) must all be resolvable. This is a
factory-developer tool, not an end-user CLI feature.

The factory keeps its own copy of `parse-execution.ts` for now;
both coexist during the CS-11149 migration window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts the factory's in-memory QUnit runner (`runTestsInMemory`) from
`packages/software-factory/src/test-run-execution.ts` into a
top-level `boxel test` command in boxel-cli. The runner:

- Discovers every `*.test.gts` file in the realm.
- Locates the host app's compiled `dist/` (env override, sibling
  packages/host, or the root checkout when in a git worktree).
- Spins up a tiny HTTP server that serves the host's test bundles
  + a synthesized QUnit harness page with live-test enabled.
- Drives a headless Chromium against that page with the realm URL
  in the query string; injects the per-realm JWT (if the active
  profile has one) via `page.route()` so private realms can be
  reached.
- Collects per-test QUnit results via `QUnit.on('testEnd' / 'runEnd')`
  hooks and aggregates them into pass/fail/skip counts + per-failure
  details.

Unlike the factory's `executeTestRunFromRealm`, this command does
NOT create or update a TestRun card — results are returned in-memory
only. Card persistence is the agent's responsibility in the new
Phase 1 flow.

`@playwright/test` is added as a boxel-cli devDependency. The
`findHostDistPackageDir` discovery helper is inlined from
`@cardstack/realm-test-harness/host-dist` to avoid pulling the
harness in as a dependency. Like `boxel parse`, this is a
monorepo-only command and not usable from the published CLI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The three validator CLI commits (`boxel lint`, `boxel parse`,
`boxel test`) closed the first chunk of the runbook's gaps section.
This commit brings the runbook in line with what was actually built:

- Capability table no longer marks the three commands as missing.
- Gaps section is rewritten to document them as landed, with the
  monorepo-only caveats for parse and test (host dist + glint paths
  are resolved relative to the CLI), and the deferred realm-server
  `_parse` endpoint mentioned as a separate Phase 2 ticket.

The remaining gaps section (skill rewrites) is unchanged — that is
the next slice of work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New skill that captures the loop-control logic the orchestrator used
to own — picking the next unblocked Issue from a target realm,
transitioning its status across the lifecycle (`backlog →
in_progress → done | blocked`), pushing between flips, and bailing
out cleanly when the backlog is exhausted.

Lifts the rules from `src/issue-scheduler.ts` (eligibility filter +
ordering) and the status-flip choreography from `src/issue-loop.ts`
into agent-readable markdown. Also documents how to discover the
tracker module URL — previously injected into the agent's system
prompt by the orchestrator's `inferDarkfactoryModuleUrl` — and how
to assemble per-issue context from the issue's relationships
(`project`, `relatedKnowledge`), replacing
`factory-context-builder.ts`.

Companion to the existing `software-factory-bootstrap` and
`software-factory-operations` skills; covers "which issue do I work
next and how do I record progress" while those cover "what do I do
inside the issue I picked." CS-11149.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the factory-MCP-tool references throughout with the
corresponding `boxel` CLI invocations now that the agent drives the
loop without an orchestrator:

- `run_lint` → `boxel lint [path] --realm <url>`
- `run_parse` → `boxel parse [path] --realm <url>`
- `run_evaluate` → `boxel run-command evaluate-module --realm <url> --input '{"path": "..."}'`
- `run_instantiate` → `boxel run-command instantiate-card --realm <url> --input '{"path": "..."}'`
- `run_tests` → `boxel test --realm <url>`
- `get_card_schema` → `boxel run-command get-card-type-schema --realm <url> --input '{"module":"...","name":"..."}'`

Drops the `Control Flow` section (`signal_done` /
`request_clarification` no longer exist as tools — they are replaced
by status flips on the Issue card, documented in the new
`software-factory-scheduling` skill).

Removes the "never set status to done" rule — the agent now owns
the full status lifecycle. The `description`-is-immutable rule and
the read-before-write rule stay (they're agent-side discipline, not
orchestrator-enforced).

Adds explicit "push the workspace before running validators"
guidance — the new CLI validators all read from the realm, so a
sync step is required between workspace writes and validation.

Updates the Required Flow section to reflect the new control loop:
write → push → validate → fix → push → mark done. Adds cross-links
to `software-factory-scheduling` for status transitions and to the
boxel-cli plugin skills (boxel-api, boxel-command, realm-sync) for
the underlying CLI surfaces. CS-11149.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the bootstrap skill to match the new interactive Claude
Code flow now that the agent drives bootstrap directly:

- Replace every `get_card_schema({module,name})` reference with
  the equivalent `boxel run-command get-card-type-schema` Bash
  invocation.
- Add a "Creating the target realm" section — previously the
  orchestrator created the realm via the entrypoint before the
  agent ran; the agent now does this itself via
  `boxel run-command create-realm` (or the realm-sync skill's
  sugar variant).
- Add a "Discover the tracker module URL" section — previously
  injected into the agent's system prompt as
  `darkfactoryModuleUrl`; the agent now constructs it from the
  target realm's origin (`<origin>/software-factory/darkfactory`)
  and confirms by introspecting one of its exports.
- Drop the `signal_done()` reference from the Completion section
  and replace with explicit `boxel push` + status flip on the
  bootstrap Issue, cross-linking to the
  `software-factory-scheduling` skill for the status-transition
  rules.
- Add cross-references to the new
  `software-factory-scheduling` and existing
  `software-factory-operations` skills so the agent knows where
  to go after bootstrap completes.
- Use `boxel push` / `boxel pull` explicitly for sync; reference
  the `realm-sync` skill for the canonical command. CS-11149.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29c7c40c24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"@cardstack/postgres": "workspace:*",
"@cardstack/runtime-common": "workspace:*",
"@glint/ember-tsc": "catalog:",
"@playwright/test": "catalog:",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Move Playwright to runtime deps or lazy-load the test command

@playwright/test is added under devDependencies, but the CLI eagerly imports ./commands/test during startup and that module has a top-level import { chromium } from '@playwright/test'. In normal npm/global installs, dev dependencies are not installed, so even unrelated commands (for example boxel --help or boxel profile list) will fail at process start with a module-resolution error before argument parsing. This needs either a runtime dependency or deferred import inside the test command path.

Useful? React with 👍 / 👎.

| Realm creation | `boxel run-command create-realm` (or `boxel realm create`) |
| Workspace pull / push | `boxel pull` / `boxel push` (realm-sync skill) |
| Federated search | `boxel search --realm <url> --query '<json>'` (boxel-api skill) |
| Card-type schema | `boxel run-command get-card-type-schema --realm <url> --input '{module,name}'` |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace invalid run-command shorthand in factory runbook

The documented invocation boxel run-command get-card-type-schema ... is not a valid run-command specifier for this CLI; run-command expects a full command module reference (for example @cardstack/boxel-host/commands/get-card-type-schema/default). Following the runbook as written will error immediately, which blocks bootstrap and per-issue validation flows that depend on these calls. Update the runbook/skills to use valid specifiers (or a supported alternative command).

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

Host Test Results

    1 files      1 suites   1h 41m 11s ⏱️
2 659 tests 2 644 ✅ 15 💤 0 ❌
2 678 runs  2 663 ✅ 15 💤 0 ❌

Results for commit 2e31989.

Realm Server Test Results

    1 files  ±    0      1 suites  +1   9m 20s ⏱️ + 9m 20s
1 377 tests +1 377  1 377 ✅ +1 377  0 💤 ±0  0 ❌ ±0 
1 458 runs  +1 458  1 458 ✅ +1 458  0 💤 ±0  0 ❌ ±0 

Results for commit 2e31989. ± Comparison against earlier commit 622dc9f.

jurgenwerk and others added 2 commits May 15, 2026 11:25
Three issues surfaced when running the Phase 1 prompt sequence
against the sticky-note brief in a fresh Claude Code session.

1. **`get-card-type-schema` invocation form was wrong.** The skills
   said `boxel run-command get-card-type-schema --input '{"module":
   "...", "name": "..."}'`. The realm-server actually wants the
   fully-qualified command specifier
   `@cardstack/boxel-host/commands/get-card-type-schema/default`
   and a `codeRef` wrapper around the module+name:

       boxel run-command @cardstack/boxel-host/commands/get-card-type-schema/default \
         --realm <url> --input '{"codeRef":{"module":"...","name":"..."}}'

   Matches the canonical form documented in the `boxel-command`
   plugin skill. Updated everywhere the new skills + runbook show
   the command, but kept short prose references ("the schema you
   fetched with `get-card-type-schema`") since those are
   informational, not copy-pasteable.

2. **Project does not carry a `board` relationship.** The
   bootstrap skill instructed populating
   `relationships.board → ../Boards/<slug>` on the Project card.
   Live schema introspection shows Project has no `board` field;
   the link is one-way IssueTracker → Project via the board's own
   `project` relationship. Removed the Project-side link from the
   bootstrap guidance and added an explicit note about the one-way
   direction.

3. **Dev `boxel` CLI is not on PATH during Phase 1.** The test
   agent had to discover the dev binary at
   `<monorepo>/packages/boxel-cli/bin/boxel.js`, work around a
   stale `dist/`, and symlink onto PATH itself. That work
   shouldn't be left to improvisation. Added a "First: verify the
   `boxel` CLI works" section at the top of the bootstrap skill
   documenting the dev-binary location, the two safe ways to use
   it (direct `node` invocation or PATH symlink), and the
   stale-dist gotcha. Mirrored the setup advice in the runbook
   prerequisites. Phase 1 only — Phase 2 packaging will retire
   this wrinkle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the skills said "verify `boxel --version` works; if it
doesn't, fall back to the dev binary." That left the agent guessing
when the test session found `boxel` missing. Make the dev wiring
the default and unconditional path during Phase 1:

- `software-factory-scheduling` now opens with an idempotent
  wiring block that derives the monorepo root from
  `git rev-parse --show-toplevel`, renames any stale `dist/` so
  the bin shim falls back to ts-node, symlinks
  `packages/boxel-cli/bin/boxel.js` into `~/.local/bin/boxel`, and
  prepends `~/.local/bin` to PATH. Verified with `boxel --version`
  + a check that `boxel --help` lists `lint` / `parse` / `test`.
- `software-factory-bootstrap` carries the same wiring block (it
  is the first skill loaded when working a bootstrap issue
  directly, before scheduling). Also fixes the realm-creation
  example: native subcommand `boxel realm create <endpoint>
  "<display name>"`, not `boxel run-command create-realm`.
- `software-factory-operations` adds a brief prerequisite
  pointing back to scheduling/bootstrap for the wiring block,
  since every command in the body uses `boxel`.

Phase 1 workaround — when boxel-cli ships properly the wiring
disappears and `boxel` is just installed globally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jurgenwerk jurgenwerk marked this pull request as draft May 15, 2026 09:37
Second-test-run fix-up. The skills said `boxel pull --realm <url>`
and `boxel push --realm <url>` as if those were top-level
commands; they aren't. The actual surface is under the `realm`
subcommand with positional arguments:

- `boxel realm pull <realm-url> <local-dir>`
- `boxel realm push <local-dir> <realm-url>`
- `boxel realm sync <local-dir> <realm-url>` (with optional
  --prefer-local / --prefer-remote / --prefer-newest for conflict
  resolution)

Updated every occurrence across the bootstrap, operations, and
scheduling skills plus the runbook capability table. Added a
concrete example next to each so the agent doesn't have to
reverse-engineer the argument order. Also updated the "see also"
references to point at the `realm-sync` plugin skill (which
documents the full surface).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jurgenwerk jurgenwerk changed the title CS-11149: Phase 1 — validator CLIs + skills for interactive factory runs software factory: explore running it from Claude Code as an orchestrator May 15, 2026
jurgenwerk and others added 9 commits May 15, 2026 11:41
Third-test-run fix-up. The agent passed
`boxel realm create user/factory-test-stickynote-2 "..."` and got
`Error: realm name must contain only lowercase letters, numbers,
and hyphens` (regex `^[a-z0-9-]+$`). The legacy-paths warning that
also printed was unrelated noise — not the actual failure mode —
which the agent then chased on its next attempt.

The skill said `<realm-endpoint>` was a path segment like
`user/my-realm`. That's wrong: the argument is the realm's slug
only, and the user-namespace prefix is added by the realm server
automatically based on the active profile. Updated the
"Creating the target realm" section to:

- Call the argument `<realm-name>` (matching the CLI's own help
  text).
- State the regex constraint explicitly.
- Show that the agent derives the slug from the target realm
  URL's final path segment.
- Explicitly tell the agent to ignore the legacy-local-realm-dirs
  warning unless the command itself exits non-zero.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fourth-test-run fix-up. The agent ran the full prompt sequence and
flagged three real bugs in the skill text that I'd glossed over.

1. **evaluate-module input shape** — skill said
   `--input '{"path": "..."}'`. The real fields are
   `{moduleIdentifier, realmIdentifier}`, both absolute URLs. Also
   the command spec must be fully qualified
   (`@cardstack/boxel-host/commands/evaluate-module/default`). Updated
   the operations skill's validator section with a worked example
   using `jq -nc` to build the input JSON; also updated the runbook
   capability table and the Required Flow shorthand.

2. **instantiate-card input shape** — skill said
   `--input '{"path": "..."}'`. The real fields are
   `{moduleIdentifier, cardName, realmIdentifier, instanceData?}`,
   and `instanceData` (when provided) must be a JSON string whose
   `data.meta.adoptsFrom.module` already matches `moduleIdentifier`
   exactly — relative paths like `../sticky-note` are rejected.
   Updated the validator section with a worked example that uses
   `jq` to rewrite the workspace JSON to absolute URLs before
   sending. Same updates to the runbook table and Required Flow.

3. **`loader.import` typing in `.test.gts`** — the example test
   used `let { StickyNote } = await loader.import(cardModuleUrl);`
   which destructures from a `{}` return type and makes
   `boxel parse` fail with a type error on every test file. Fixed
   the example to `loader.import<typeof import('./sticky-note')>(cardModuleUrl)`
   and added an explicit "Why" callout so future agents don't
   re-introduce the bug.

4. **Runbook prerequisites** — added the two one-time setup steps
   that have to happen before `boxel test` works: building the
   host app dist (`pnpm --filter @cardstack/host build`) and
   installing Playwright's chromium binary
   (`npx playwright install chromium`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier I omitted `LintResult` / `ParseResult` / `EvalResult` /
`InstantiateResult` / `TestRun` cards from the per-issue
workflow, reasoning that the agent reads validator output inline so
nothing needs to persist. That was the wrong call — the audit
trail in the realm's `Validations/` folder is part of the
factory's value, not a nice-to-have. Without those cards the host
UI loses the sortable history the SDK orchestrator produced.

Updated `software-factory-operations`:

- Rewrote the "Validators (CLI commands)" preamble: the CLI itself
  doesn't persist anything; persisting the audit trail is the
  agent's responsibility. Always run validators with `--json` so
  the structured result is available to map into the card.
- Added a new "Validation artifact cards" section. Documents the
  five card types and their source-realm module URLs, the
  `Validations/<type>_<issue-slug>-<n>.json` filename convention,
  the sequence-number rule (find the highest existing `n` and use
  `n+1`), the schema-discovery flow (introspect via
  `get-card-type-schema` before writing), and the document
  envelope. Calls out that cards are written even on success and
  not overwritten on retry.
- Expanded the Required Flow numbering to interleave a "write the
  validation card" step between "run the validator" and "push" so
  the audit trail lands with the rest of the work.
- Updated the target-realm artifact-structure tree to include
  `Validations/` with the five card patterns.

Updated the runbook:

- Replaced the "Things explicitly NOT in Phase 1 — Validation
  artifact cards" bullet with a positive "Phase 1 includes
  validation artifact cards" section pointing at the operations
  skill's new section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fix-ups from observing the recipe end-to-end run:

1. The runbook's three-prompt structure (Bootstrap / Work next /
   Continue or finish) was a defensive Phase 1 design — pause
   between each phase so we could verify each step. The recipe
   run proved a single prompt drives the whole loop end-to-end
   without intervention. Collapsed the three prompts into one
   that bootstraps + works every Issue + completes the project,
   and rewrote the bootstrap skill's Completion section: it no
   longer says "stop and report; don't start implementing." It
   now explicitly hands off to scheduling for in-session
   continuation. Also retired the "How Phase 2 builds on this"
   section since the single-prompt form was Phase 2's stated
   goal; replaced with a follow-up-work checklist (slash command
   wrapper, `_parse` realm-server endpoint, dev-CLI setup
   automation, orchestrator retirement).

2. The SDK orchestrator wrote `Issues/bootstrap-seed.json` as a
   visible "this is where the factory started" anchor in the
   realm UI. The new flow had been omitting it, leaving a gap in
   the visual parity. Added a "The bootstrap-seed Issue" section
   to the bootstrap skill documenting the attribute shape
   (issueId `<projectCode>-0`, status `done`, issueType
   `bootstrap`, priority `critical`, order `0`) and the
   relationships (project + relatedKnowledge links so a
   future-you reading the seed Issue has the same brief
   context any other Issue carries). Phase 1's documented
   validation plan now expects it.

Status section in the runbook bumped from "draft / Phase 1 spec"
to "working end-to-end on sticky-note and recipe as of 2026-05-15."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runbook is the user-facing guide for running the factory on
any brief; sticky-note was just my dev fixture during the CS-11149
back-and-forth. Removed:

- Status banner that named sticky-note / recipe as the briefs the
  runbook is "working against" — replaced with the existing
  introduction, which describes the system in general terms.
- Prerequisite example URL hardcoded to sticky-note — now uses a
  `<brief-slug>` placeholder so the reader inserts their own.
- "Validation plan" section that named sticky-note as the
  simplest exerciser and referenced the Playwright suite —
  replaced with an "Expected output" section that describes the
  artifacts of a successful run in generic terms. The Playwright
  suite still exists and tests sticky-note specifically, but
  that's a CI / contributor concern, not user-facing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "interactive" qualifier was distinguishing this from the SDK
orchestrator path during the CS-11149 migration. With the
orchestrator scheduled for retirement and this runbook being the
canonical (and shortly, only) way to drive the factory, the
qualifier is dead weight. No other files referenced the path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the runbook said "fix any failures (incrementing the
sequence number on retry)" and the operations skill said "Fix
anything that fails and re-run" — both correct but compressed
enough that an agent could read "re-run once and move on" instead
of "loop until every validator passes." That's a real bug class:
a partial-pass run would still mark the Issue done.

Both docs now lay out the loop explicitly:

- Run each validator, write its artifact card capturing the
  current result (passed OR failed — the audit trail is the
  point).
- If anything failed, fix the source, push, re-run the failing
  validators, write NEW artifact cards with the next sequence
  number (don't overwrite previous ones).
- Stop iterating only when every validator's most recent artifact
  card has `status: "passed"`.
- Then mark the Issue done. "Most failed but a few passed, good
  enough" is not the bar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The validator loop now terminates on success OR on one of three
bail-out conditions, matching the kind of safety the SDK
orchestrator gave via `maxIterationsPerIssue`. Without these, an
agent could spiral on a flaky test or a brief requirement it
can't satisfy and burn context indefinitely.

Per-issue limits (operations skill, new "Bailing out" section):

- 8 total iterations per Issue (matches the orchestrator's old
  `maxIterationsPerIssue` default).
- 3 consecutive failures of the same validator with the same
  error message — the fix isn't working, stop.
- 5 distinct fix attempts on a single validator without a pass —
  problem is outside this Issue's scope.

When any limit hits, the agent: sets the Issue to `blocked`,
appends a comment naming the limit hit + the latest failure
message verbatim + an enumeration of what was tried (keyed to
artifact-card sequence numbers), pushes, and hands back to
scheduling for the next eligible Issue. The Project's
`projectStatus` is only flipped to `completed` if every Issue
ended `done` — `blocked` Issues are left in place so the human
can investigate.

The runbook prompt now references the bail-out limits inline
(without spelling them out) and points at the operations skill
for the detail. The scheduling skill's status-transition table
cross-references the bail-out section so an agent looking at
"when do I flip to blocked?" finds the answer with one hop.

Outer-loop safety is implicit: once an Issue is `blocked`, the
scheduler skips it, so the loop terminates naturally once every
Issue is either `done` or `blocked`. No outer-cycle cap is
needed (the SDK orchestrator's 50-cycle cap was a defense
against scheduler bugs, not validator spirals).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Gaps blocking Phase 1" section (CLI commands to add / skill
content to add / things NOT in Phase 1) was meta tracking from
when this doc was the Phase 1 spec — every line of it is now
done or stale:

- "CLI commands to add" → all three landed (boxel lint, parse, test)
- "Skill content to add / rewrite" → all three skills authored
- "Phase 1 includes validation artifact cards" → done; covered in
  the operations skill's "Validation artifact cards" section
- "Things explicitly NOT in Phase 1: iteration limits" → wrong as
  of the previous commit, which added bail-out limits

Spec history belongs in git/PR descriptions, not a user-facing
runbook. The section is gone. The follow-up-work list survives
(it points at genuine in-flight rough edges) and the orchestrator
retirement bullet's "Phase 1" framing is gone.

Also dropped the "Phase 1 dev setup note" framing in the
prerequisites — same setup steps, just no longer labeled as a
temporary Phase 1 workaround. (It still IS temporary; the
"Follow-up work" section calls that out under "Dev `boxel-cli`
setup automation.")

No more Phase 1 / Phase 2 references in the runbook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jurgenwerk jurgenwerk changed the title software factory: explore running it from Claude Code as an orchestrator Run the software factory from Claude Code May 15, 2026
@jurgenwerk jurgenwerk changed the title Run the software factory from Claude Code software factory: explore running it just using Claude Code + Skills + Boxel CLI May 15, 2026
@jurgenwerk jurgenwerk changed the title software factory: explore running it just using Claude Code + Skills + Boxel CLI software factory: explore running it only using Claude Code + Skills + Boxel CLI May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant