Skip to content

perf(gptme-voice): skip context loading in fast subagent mode#713

Merged
TimeToBuildBob merged 1 commit intomasterfrom
fix/voice-fast-subagent-latency
Apr 20, 2026
Merged

perf(gptme-voice): skip context loading in fast subagent mode#713
TimeToBuildBob merged 1 commit intomasterfrom
fix/voice-fast-subagent-latency

Conversation

@TimeToBuildBob
Copy link
Copy Markdown
Member

@TimeToBuildBob TimeToBuildBob commented Apr 20, 2026

Summary

  • Fast subagent latency fix: mode=fast no longer passes --context files to gptme, removing the 20k+ token workspace context load that was the dominant latency source. Simple lookups should drop from ~60s to ~5-10s.
  • Real-time status: Switched from communicate() to streaming stdout line-by-line so subagent_status can show last_output — the last line the running subagent produced. Addresses feedback from feat(gptme-voice): add subagent_status and subagent_cancel tools #711.
  • Smart mode is unchanged — still loads full workspace context for queries that need it.

Root cause of the 1-minute latency

--context files triggers gptme to load all files listed in gptme.toml (ABOUT.md, GOALS.md, ARCHITECTURE.md, people/, etc.) — easily 20k+ tokens. For a fast lookup like "what are my active tasks?", that context is unnecessary overhead. Without it, the subagent starts answering almost immediately.

Test plan

  • 27/27 tests pass (5 new tests: test_execute_fast_mode_skips_context_loading, test_execute_smart_mode_keeps_context_loading, test_subagent_status_shows_last_output, and updated mocks for streaming)
  • Call and verify fast lookups complete in <15s

Fast-mode subagents no longer pass --context files to gptme, removing the
20k+ token workspace context load that was the dominant latency source (~30-60s).
Simple lookups now run in ~5-10s instead of ~1 minute.

Smart mode is unchanged — it still loads full workspace context for queries
that need it.

Also streams stdout line-by-line so subagent_status can show the last
action the running subagent performed (addresses feedback on #711).
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

This PR improves gptme-voice subagent latency by skipping --context files in fast mode (avoiding 20k+ token workspace loading) and switches from communicate() to line-by-line stdout streaming so subagent_status can surface last_output in real time. The logic change is small and well-targeted; five new tests cover the fast/smart mode CLI differences and the streaming status feature.

Confidence Score: 5/5

Safe to merge; both findings are P2 style/quality suggestions that do not block the feature.

The core change (conditional --context files omission) is simple, correct, and well-tested. The streaming refactor is clean. Remaining comments are a weak test assertion and a zombie-process hardening suggestion — neither causes incorrect behavior on the happy path.

No files require special attention beyond the P2 notes above.

Important Files Changed

Filename Overview
packages/gptme-voice/src/gptme_voice/realtime/tool_bridge.py Adds on_progress streaming callback and skips --context files for fast-mode subagents; minor zombie-process risk on timeout introduced by replacing communicate() with the new streaming reader.
packages/gptme-voice/tests/test_tool_bridge.py Good coverage of new fast/smart mode CLI argument assertions; test_subagent_status_shows_last_output has a conditionally-gated assertion that can pass trivially if last_output is never populated.

Sequence Diagram

sequenceDiagram
    participant RT as Realtime Voice
    participant TB as GptmeToolBridge
    participant SA as gptme subagent

    RT->>TB: handle_function_call("subagent", {mode: "fast"})
    TB->>TB: assign task_id, create asyncio.Task
    TB-->>RT: {status: "dispatched", task_id}

    TB->>SA: gptme --non-interactive [--model haiku] task
    Note over TB,SA: fast mode: no --context files

    loop streaming stdout
        SA-->>TB: stdout line
        TB->>TB: pending.last_output = line
    end

    RT->>TB: handle_function_call("subagent_status", {})
    TB-->>RT: {pending: [{task_id, last_output, elapsed_seconds}]}

    SA-->>TB: exit 0
    TB->>TB: read response_file
    TB->>RT: on_result(response_text)
Loading

Comments Outside Diff (1)

  1. packages/gptme-voice/src/gptme_voice/realtime/tool_bridge.py, line 208-214 (link)

    P2 Zombie process left behind on timeout

    process.kill() sends SIGKILL but process.wait() is never called after it, so the kernel keeps the child entry as a zombie until the parent Python process exits. The previous communicate() approach would also have this risk when cancelled, but the new streaming path makes it more visible: the finally block only cleans up the response file, not the process.

    A safe pattern after kill is to shield a short wait() call so the OS entry is reaped:

    except asyncio.TimeoutError:
        process.kill()
        with contextlib.suppress(Exception):
            await asyncio.shield(process.wait())
        return ToolResult(
            success=False,
            output="",
            error=f"Subagent timed out after {self.timeout}s",
        )

    The same applies to the CancelledError branch on line 216.

Reviews (1): Last reviewed commit: "perf(gptme-voice): skip context loading ..." | Re-trigger Greptile

Comment on lines +206 to +207
if entry.get("last_output"):
assert "Found 3 active tasks" in entry["last_output"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Weak assertion makes the test self-defeating

The central claim of this test is that last_output is populated once the subagent produces stdout lines, but the assertion is guarded by if entry.get("last_output"):. If last_output is always absent or empty — the exact regression being tested against — the assert on line 207 is simply never reached and the test passes silently.

Two asyncio.sleep(0) yields may not be enough for the async stdout reader to have fully iterated; if the reader hasn't run yet, last_output is "" and the branch is skipped entirely. Consider promoting this to an unconditional assert:

Suggested change
if entry.get("last_output"):
assert "Found 3 active tasks" in entry["last_output"]
assert entry.get("last_output"), "last_output must be set once subagent produces stdout"
assert "Found 3 active tasks" in entry["last_output"]

@ErikBjare
Copy link
Copy Markdown
Member

@TimeToBuildBob TimeToBuildBob merged commit d153938 into master Apr 20, 2026
14 checks passed
@TimeToBuildBob TimeToBuildBob deleted the fix/voice-fast-subagent-latency branch April 20, 2026 18:50
@TimeToBuildBob TimeToBuildBob changed the title perf(gptme-voice): skip context loading in fast subagent mode fix(gptme-voice): separate fast and smart subagent models Apr 20, 2026
@TimeToBuildBob TimeToBuildBob changed the title fix(gptme-voice): separate fast and smart subagent models perf(gptme-voice): skip context loading in fast subagent mode Apr 20, 2026
TimeToBuildBob added a commit that referenced this pull request Apr 21, 2026
…719)

Adds keywords and an anti-pattern section for the failure mode of writing
latency-win claims ("should drop to ~5-10s") in PR descriptions or issue
comments without a baseline number or cost-model comparison. The existing
lesson covered code-level premature optimization but did not trigger on
the more common failure of anchoring a projected speedup on one component
(prompt size) instead of the dominant cost (turns × tok/s + startup).

Motivation: ErikBjare/bob#651 — shipped #713 claiming
context-skip would drop voice subagent lookups to 5-10s without showing
that prompt-eval was actually the dominant latency source. Correct
response (measurement-first) was #718.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants