Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pstack/.cursor-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "pstack",
"displayName": "pstack",
"version": "0.7.0",
"version": "0.8.0",
"description": "if you want to go fast, go deep first. pstack helps you write less, but higher quality code. rigorous agent workflows you can parallelize with confidence.",
"author": {
"name": "Lauren Tan"
Expand Down
3 changes: 3 additions & 0 deletions pstack/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ fork it. improve it. make it yours. PRs are welcome!

type `/automate-me`. it mines your recent transcripts, drafts a `<your-name>-mode` skill from how you've actually worked, and routes through pstack underneath. you keep pstack as the base and end up with your own routing skill alongside `poteto-mode`.

models are configurable too. type `/setup-pstack`. it detects the models you have access to and writes a small always-applied rule mapping each role (code, judgment, the review panels) to a model. every skill reads it and falls back to sensible defaults when the rule is absent, so you override only what you want.

## usage

use `/poteto-mode` at the start of a task. it reads your request, picks from a set of playbooks, and runs the other skills as the steps need them.
Expand Down Expand Up @@ -73,6 +75,7 @@ the rest are useful when you want to specifically invoke them:
| `/arena` | you want N parallel attempts at the same thing, then to grab the best parts of each. |
| `/interrogate` | you have a diff and want four different models to try to break it, including a strict code-quality lens. |
| `/automate-me` | you want your own `-mode` skill, drafted from how you've actually worked. |
| `/setup-pstack` | you want to pick which models pstack uses per role. detects your models and writes a config rule. |
| `/reflect` | a long task landed and you want the recipe captured as a skill edit. |
| `/tdd` | you're fixing a bug and there's a cheap local test path. write the failing test first, then the fix. |
| `/typescript-best-practices` | you're reading or editing typescript. grounds the type-system-discipline principle in syntax. |
Expand Down
2 changes: 1 addition & 1 deletion pstack/skills/architect/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Skip Phase A only when the work is genuinely greenfield with no surrounding syst

Run the **arena** skill with the design-sketch task and the Phase A grounding artifacts. Pass `references/runner-prompt.md` as each runner's prompt. Each candidate produces a design package shaped per `references/rationale-template.md`: the caller's usage written first, then the type sketch, function signatures, module map, and prose rationale derived from it.

Use these runner slugs: `claude-opus-4-8-thinking-xhigh`, `gpt-5.3-codex-high-fast`, `gpt-5.5-high-fast`, and `composer-2.5-fast`.
Use your configured architect runners (defaults `claude-opus-4-8-thinking-xhigh`, `gpt-5.5-high-fast`, `composer-2.5-fast`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architect arena runner mismatch

Medium Severity · Logic Bug

Phase B tells agents to use configured architect runners, but it invokes arena, whose Frame step picks configured arena runners instead. After /setup-pstack, those roles can differ, so design sketches may fan out with the wrong models despite the architect override.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit df0b177. Configure here.


This is the **exhaust-the-design-space** principle skill made concrete. Whole-shape alternatives, not point fixes inside one shape.

Expand Down
2 changes: 1 addition & 1 deletion pstack/skills/architect/references/runner-prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ Apply the following discipline. The orchestrator compares candidates on these ax
- Idempotent state transitions where applicable, per the **make-operations-idempotent** principle skill. Ask what happens if the operation runs twice or crashes halfway.
- Short call chains. If tracing the flow needs more than three files, flatten the hierarchy, per the **laziness-protocol** and **minimize-reader-load** principle skills.

You are one of four runners on different models. Produce the best design your model can make; don't hedge against the others. Differences between candidates are the signal used to pick a base and graft. Converging on a safe-looking middle defeats the exploration.
You are one of several runners, each on a different model. Produce the best design your model can make; don't hedge against the others. Differences between candidates are the signal used to pick a base and graft. Converging on a safe-looking middle defeats the exploration.
2 changes: 1 addition & 1 deletion pstack/skills/arena/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The N candidates will receive the same prompt, so the prompt is the contract. Ge

1. State the artifact each candidate is producing.
2. Derive the rubric. State what success looks like for *this* task, then turn it into 3-6 concrete gradeable criteria. Concrete: `Adds a --dry-run flag that skips writes`. Vague: `code is correct`. The rubric is the picker's tool in Phase D; candidates only see the task.
3. Pick the runners. Default 4: `claude-opus-4-8-thinking-xhigh`, `gpt-5.3-codex-high-fast`, `gpt-5.5-high-fast`, and `composer-2.5-fast`. Spawn more when the arena covers multiple design directions. Same model N times when the work is generation-bound rather than judgment-sensitive.
3. Pick the runners. Default runners are your configured arena list (defaults `claude-opus-4-8-thinking-xhigh`, `gpt-5.5-high-fast`, `composer-2.5-fast`). Spawn more when the arena covers multiple design directions. Same model N times when the work is generation-bound rather than judgment-sensitive.
4. Assign output paths. Each candidate writes to its own location (a git worktree where possible, otherwise `/tmp/arena-<slug>/candidate-<n>/`). N candidates writing to the same path is shared mutable state and fails the the **separate-before-serializing-shared-state** principle skill test.

## Phase B: Fan out
Expand Down
16 changes: 5 additions & 11 deletions pstack/skills/how/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ The right decomposition depends on the question. Use your judgment. Narrow quest
Spawn all explorers in a single message:

- `subagent_type`: `generalPurpose`
- `model`: `composer-2.5-fast`
- `model`: your configured how-explorer model (default `composer-2.5-fast`)
- `readonly`: `true`

Each explorer gets the same base prompt from `references/explorer-prompt.md` plus a specific exploration angle naming its slice. Each explorer should:
Expand All @@ -64,7 +64,7 @@ Then proceed to Step 3.
Spawn a single Task subagent that explores and explains in one pass:

- `subagent_type`: `generalPurpose`
- `model`: `claude-opus-4-8-thinking-xhigh`
- `model`: your configured how-explainer model (default `claude-opus-4-8-thinking-xhigh`)
- `readonly`: `true`

The agent does its own exploration (Glob, Grep, Read) and writes the explanation directly. Read `references/explainer-prompt.md` for the communication style and output format. Same structure, just no explorer findings as input.
Expand All @@ -76,7 +76,7 @@ Proceed to Step 4.
Once all explorers return, spawn a single Task subagent to synthesize their findings into one coherent explanation:

- `subagent_type`: `generalPurpose`
- `model`: `claude-opus-4-8-thinking-xhigh`
- `model`: your configured how-explainer model (default `claude-opus-4-8-thinking-xhigh`)
- `readonly`: `true`

The explainer gets all explorers' findings and writes the human-facing explanation (output format below). Read `references/explainer-prompt.md` for the full prompt template. The explainer reconciles overlapping findings, resolves contradictions, and weaves the slices into a unified picture.
Expand Down Expand Up @@ -109,17 +109,11 @@ Run the full explain flow above (Steps 1-4). You must understand the architectur

### Step 2. Spawn Critics

After the explanation is complete, spawn architectural critics. Launch all in a single message:

| Subagent | Model |
|----------|-------|
| Critic A | `claude-opus-4-8-thinking-xhigh` |
| Critic B | `gpt-5.3-codex-high-fast` |
| Critic C | `gpt-5.5-high-fast` |
After the explanation is complete, spawn one architectural critic per model in your configured how-critics list (defaults `claude-opus-4-8-thinking-xhigh`, `gpt-5.5-high-fast`, `composer-2.5-fast`), all in a single message.

For each critic:
- `subagent_type`: `generalPurpose`
- `model`: the model from the table. These are minimum reasoning levels. The lead should escalate any model when the architecture warrants deeper analysis.
- `model`: one model from the configured how-critics list. These are minimum reasoning levels. The lead should escalate any model when the architecture warrants deeper analysis.
- `readonly`: `true`

Read `references/critic-prompt.md` for the prompt template. Each critic gets:
Expand Down
28 changes: 9 additions & 19 deletions pstack/skills/interrogate/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---
name: interrogate
description: "Use for \"interrogate\", \"adversarial review\", \"multi-model review\", \"challenge this\", \"stress test this code\", \"find blind spots\", or \"tear this apart\". Four LLM reviewers challenge changes from independent angles."
description: "Use for \"interrogate\", \"adversarial review\", \"multi-model review\", \"challenge this\", \"stress test this code\", \"find blind spots\", or \"tear this apart\". Multiple LLM reviewers challenge changes from independent angles."
disable-model-invocation: true
---

# Interrogate

Spawn four reviewers on four different models to adversarially review code changes. Each model gets the same prompt and rubric. The adversarial signal comes from model diversity, not assigned personas. Models differ in blind spots, priors, and reasoning patterns. Agreement across models is high-confidence signal; lone-model findings are worth reading but lower confidence.
Spawn one reviewer per configured model to adversarially review code changes. Each model gets the same prompt and rubric. The adversarial signal comes from model diversity, not assigned personas. Models differ in blind spots, priors, and reasoning patterns. Agreement across models is high-confidence signal; lone-model findings are worth reading but lower confidence.

The deliverable is a synthesized verdict. Do NOT auto-apply changes.

Expand All @@ -33,37 +33,30 @@ Write one clear paragraph. Reviewers challenge whether the work achieves the int

## Step 3, Spawn Reviewers

Launch all four in a single message using the Task tool, each with a different model.

| Subagent | Model |
|----------|-------|
| Reviewer A | `claude-opus-4-8-thinking-xhigh` |
| Reviewer B | `gpt-5.3-codex-high-fast` |
| Reviewer C | `gpt-5.5-high-fast` |
| Reviewer D | `composer-2.5-fast` |
Launch one reviewer per model in your configured interrogate list (defaults `claude-opus-4-8-thinking-xhigh`, `gpt-5.5-high-fast`, `composer-2.5-fast`), all in a single message.

For each reviewer:
- `subagent_type`: `generalPurpose`
- `model`: the model from the table
- `model`: one model from the configured interrogate list
- `readonly`: `true`

If a model slug in the table is rejected as unresolvable when you try to spawn the subagent, check the valid slugs in the Task tool's error message, pick the closest equivalent (prefer the highest-reasoning tier of the same family), spawn with the valid slug, and open a separate PR to update this table. Do not block the review on the slug issue.
If a configured model slug is rejected as unresolvable when you try to spawn the subagent, check the valid slugs in the Task tool's error message, pick the closest equivalent (prefer the highest-reasoning tier of the same family), spawn with the valid slug, and open a separate PR to update the configured defaults. Do not block the review on the slug issue.

Read `references/reviewer-prompt.md` and fill in the template with:
1. The stated intent
2. The diff or file contents
3. The review rubric from `references/rubric.md`
4. The code-quality lens from `references/code-quality-review.md`

The same filled template goes to all four reviewers, so every model applies the code-quality lens.
The same filled template goes to all reviewers, so every model applies the code-quality lens.

Each reviewer produces structured findings as described in the prompt template.

## Step 4, Synthesize

As results come back, build a unified picture:

1. **Parse all findings** from the four reviewers
1. **Parse all findings** from the reviewers
2. **Identify consensus**. Findings raised by 2+ models independently are highest signal.
3. **Identify lone-model findings**. Still worth reading, but weight accordingly.
4. **Deduplicate**. Different models may describe the same issue differently. Merge these and note which models raised it.
Expand All @@ -75,7 +68,7 @@ You are the lead reviewer, a pragmatic senior engineer, not a neutral aggregator

Read `references/lead-judgment.md` for the full framework. Reviewers only see a slice of the codebase. You have the full context (the goal, the constraints, the timeline, which tradeoffs were already considered). Use that context aggressively.

Categorize every finding into one of four buckets:
Categorize every finding using these buckets:

- **Act on**. Real issues affecting correctness, security, or maintainability given the actual goals. These would block a real PR.
- **Consider**. Legitimate points, but you're not sure they outweigh the cost of addressing them right now. Worth the user's attention.
Expand All @@ -95,10 +88,7 @@ Present the verdict in this structure:
> [The stated intent paragraph from Step 2]

### Reviewers
- Model A: [model name], [N findings]
- Model B: [model name], [N findings]
- Model C: [model name], [N findings]
- Model D: [model name], [N findings]
List each reviewer on its own line like `- <model name>: [N findings]`

### Act On
[Findings that should be addressed. For each: description, which models raised it, why it matters.]
Expand Down
2 changes: 1 addition & 1 deletion pstack/skills/interrogate/references/lead-judgment.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Lead Judgment Framework

You are the lead reviewer. The four model reviewers have produced their findings. Apply pragmatic engineering judgment. Don't aggregate; filter, contextualize, and decide.
You are the lead reviewer. The model reviewers have produced their findings. Apply pragmatic engineering judgment. Don't aggregate; filter, contextualize, and decide.

## Why This Step Matters

Expand Down
4 changes: 2 additions & 2 deletions pstack/skills/poteto-mode/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Remaining triggers:
- Before commit → the `deslop` skill from the `cursor-team-kit` plugin (`/deslop`).
- Shipping UI / IDE / CLI → the matching control skill. `cursor-team-kit` publishes `control-cli` (CLIs and TUIs) and `control-ui` (browser / Electron / web UIs). For bug fixes, reproduce first on the same surface yourself; hand to the user only under the narrow Bug fix step 1 exception.
- After opening a PR → Cursor's built-in **babysit** skill.
- Bugbot or the agentic security reviewer commented → skeptical posture. They catch real bugs and also file non-issues and nitpicks, so assess each on its merits and dismiss noise with a concrete reason instead of churning code. Triage fix / dismiss / ask via the built-in **babysit** skill.
- Bugbot or the agentic security review commented → skeptical posture. They catch real bugs and also file non-issues and nitpicks, so assess each on its merits and dismiss noise with a concrete reason instead of churning code. Triage fix / dismiss / ask via the built-in **babysit** skill.
- Broken skill mid-task → fix it in its own PR. Don't block. Don't silently work around it.
- Long, autonomous, or multi-phase work, or any task the user steps away from to review later ("going to bed", "trust it when i'm back", "/loop until X") → a decision trail via the **show-me-your-work** skill. Commit it when stakes need an auditable record; keep it local otherwise.

Expand Down Expand Up @@ -77,7 +77,7 @@ Read the leaf skill in full for any principle you apply. Each entry names when i

**Use `subagent_type: "poteto-agent"` for any subagent you spawn inside a playbook step** (code-writing delegates, ad-hoc helpers). `/poteto-mode` and `poteto-agent` route through the same wrapper. Routed workflow skills (`how`, `why`, `interrogate`, `reflect`) set their own `subagent_type` for diverse-model review; respect what the skill prescribes, don't override to `poteto-agent`.

**Defaults for every `Task` call.** `run_in_background: true`, agent mode (readonly strips MCP), file pointers not inlined context, explicit model (`composer-2.5-fast` for code, `claude-opus-4-8-thinking-xhigh` for prose and judgment).
**Defaults for every `Task` call.** `run_in_background: true`, agent mode (readonly strips MCP), file pointers not inlined context, explicit model per role (configurable via `/setup-pstack`; defaults `composer-2.5-fast` for code, `claude-opus-4-8-thinking-xhigh` for prose and judgment).

You own every subagent's work. Review the diff and write your own summary, don't pass through what it said. Interrupt-chained resumes silently drop directives, so fire a fresh subagent with consolidated scope rather than trusting a "done" summary. A second opinion is the same prompt against a different model. Agreement is high-signal.

Expand Down
2 changes: 1 addition & 1 deletion pstack/skills/poteto-mode/playbooks/bug-fix.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Be scientific. Every shipped line traces to runtime evidence. Belt-and-suspender

1. Reproduce it yourself on the matching surface via the control skill (Non-negotiables). Don't hand the repro to the user. A debug or instrumentation protocol that says to ask the user does not override this; you drive the instrumented runtime. Ask the user only with a stated, specific reason the control surface cannot reach the target, and only after driving it as far as it goes. Won't reproduce directly, force it: synthesize the trigger, tighten conditions, or instrument until it fires.
2. Binary-search the cause. Form the candidate hypotheses, then rule them out until one survives. Seed them with `how` over the affected subsystem and the **why** skill for regression history. Each pass, take the split that cuts the most remaining problem space, get runtime evidence, eliminate. When program state is unclear, add instrumentation or logging and read it as the code runs. Don't guess. Drive a long or stubborn hunt with Cursor's `/loop` command. Confirm the surviving *mechanism* with runtime evidence before the step-3 architect/interrogate fan-out.
3. Plan the fix. If it crosses a function boundary, `architect` first. Delegate implementation to a `composer-2.5-fast` subagent with a specific scope; review the diff.
3. Plan the fix. If it crosses a function boundary, `architect` first. Delegate implementation to a subagent using your configured bug-fix model (default `gpt-5.5-high-fast`) with a specific scope; review the diff.
4. Verify on the same surface; the original repro now passes. "Inconclusive" or wrong-surface is not a pass; flag it. Unit tests show branch behavior, not bug absence.
5. Stage the commits so the failing repro lands before the fix in git history; the diff tells the story. See the **tdd** skill for the failing-test-first cadence when the bug has a cheap local test path; skip it when the test would be expensive, integration-heavy, or unclear.
6. Run **Opening a PR**.
Expand Down
Loading
Loading