perf(gptme-voice): skip context loading in fast subagent mode#713
perf(gptme-voice): skip context loading in fast subagent mode#713TimeToBuildBob merged 1 commit intomasterfrom
Conversation
Fast-mode subagents no longer pass --context files to gptme, removing the 20k+ token workspace context load that was the dominant latency source (~30-60s). Simple lookups now run in ~5-10s instead of ~1 minute. Smart mode is unchanged — it still loads full workspace context for queries that need it. Also streams stdout line-by-line so subagent_status can show the last action the running subagent performed (addresses feedback on #711).
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR improves Confidence Score: 5/5Safe to merge; both findings are P2 style/quality suggestions that do not block the feature. The core change (conditional --context files omission) is simple, correct, and well-tested. The streaming refactor is clean. Remaining comments are a weak test assertion and a zombie-process hardening suggestion — neither causes incorrect behavior on the happy path. No files require special attention beyond the P2 notes above. Important Files Changed
Sequence DiagramsequenceDiagram
participant RT as Realtime Voice
participant TB as GptmeToolBridge
participant SA as gptme subagent
RT->>TB: handle_function_call("subagent", {mode: "fast"})
TB->>TB: assign task_id, create asyncio.Task
TB-->>RT: {status: "dispatched", task_id}
TB->>SA: gptme --non-interactive [--model haiku] task
Note over TB,SA: fast mode: no --context files
loop streaming stdout
SA-->>TB: stdout line
TB->>TB: pending.last_output = line
end
RT->>TB: handle_function_call("subagent_status", {})
TB-->>RT: {pending: [{task_id, last_output, elapsed_seconds}]}
SA-->>TB: exit 0
TB->>TB: read response_file
TB->>RT: on_result(response_text)
|
| if entry.get("last_output"): | ||
| assert "Found 3 active tasks" in entry["last_output"] |
There was a problem hiding this comment.
Weak assertion makes the test self-defeating
The central claim of this test is that last_output is populated once the subagent produces stdout lines, but the assertion is guarded by if entry.get("last_output"):. If last_output is always absent or empty — the exact regression being tested against — the assert on line 207 is simply never reached and the test passes silently.
Two asyncio.sleep(0) yields may not be enough for the async stdout reader to have fully iterated; if the reader hasn't run yet, last_output is "" and the branch is skipped entirely. Consider promoting this to an unconditional assert:
| if entry.get("last_output"): | |
| assert "Found 3 active tasks" in entry["last_output"] | |
| assert entry.get("last_output"), "last_output must be set once subagent produces stdout" | |
| assert "Found 3 active tasks" in entry["last_output"] |
|
See my comment here: https://github.com/ErikBjare/bob/issues/651#issuecomment-4283398928 |
…719) Adds keywords and an anti-pattern section for the failure mode of writing latency-win claims ("should drop to ~5-10s") in PR descriptions or issue comments without a baseline number or cost-model comparison. The existing lesson covered code-level premature optimization but did not trigger on the more common failure of anchoring a projected speedup on one component (prompt size) instead of the dominant cost (turns × tok/s + startup). Motivation: ErikBjare/bob#651 — shipped #713 claiming context-skip would drop voice subagent lookups to 5-10s without showing that prompt-eval was actually the dominant latency source. Correct response (measurement-first) was #718.
Summary
mode=fastno longer passes--context filesto gptme, removing the 20k+ token workspace context load that was the dominant latency source. Simple lookups should drop from ~60s to ~5-10s.communicate()to streaming stdout line-by-line sosubagent_statuscan showlast_output— the last line the running subagent produced. Addresses feedback from feat(gptme-voice): add subagent_status and subagent_cancel tools #711.Root cause of the 1-minute latency
--context filestriggers gptme to load all files listed ingptme.toml(ABOUT.md, GOALS.md, ARCHITECTURE.md, people/, etc.) — easily 20k+ tokens. For a fast lookup like "what are my active tasks?", that context is unnecessary overhead. Without it, the subagent starts answering almost immediately.Test plan
test_execute_fast_mode_skips_context_loading,test_execute_smart_mode_keeps_context_loading,test_subagent_status_shows_last_output, and updated mocks for streaming)