feat(core)!: redesign auto-compaction thresholds with three-tier ladder#4345
feat(core)!: redesign auto-compaction thresholds with three-tier ladder#4345LaZzyMan wants to merge 14 commits into
Conversation
Replaces the single 70% proportional threshold with a three-tier ladder (warn/auto/hard) that combines proportional fallback with absolute reservation. Large-window models (>=128K) now reserve ~33K instead of 30% of the window, freeing tens of thousands of context tokens that the old formula wasted. Other improvements bundled in the same redesign: - Compression sideQuery now disables thinking and caps maxOutputTokens at 20K, matching claude-code so the buffer math is predictable across providers (Anthropic/OpenAI/Gemini handle thinking budgets inconsistently) - Failure handling upgraded from one-shot permanent lock to a 3-strike circuit breaker; reactive overflow still latches immediately - New estimatePromptTokens helper closes the lag-by-one-turn and first-send-is-0 gaps in lastPromptTokenCount - Hard-tier rescue pulls reactive overflow recovery forward to before the API call, saving an oversized round-trip - /context command displays the three-tier ladder + current tier - tipRegistry's context-* tips track the new thresholds instead of fixed 50/80/95 percentages BREAKING CHANGE: chatCompression.contextPercentageThreshold setting is removed. Settings files containing the field log a one-line deprecation warning at startup and the value is ignored; behaviour is now controlled by built-in thresholds via the new computeThresholds() function. Design: docs/design/auto-compaction-threshold-redesign.md Plan: docs/plans/2026-05-14-auto-compaction-threshold-redesign.md
…ss test A pre-existing test case at chatCompressionService.test.ts:678 still passed `hasFailedCompressionAttempt: false` in the CompressOptions shape; rebasing onto current main surfaced this as a typecheck error because the field was renamed to `consecutiveFailures` (Task 7 of the three-tier ladder migration). Update to `consecutiveFailures: 0` — semantically equivalent, the test asserts the side-query is called when `force: true`, no other behaviour change.
Adds a defensive guard in ChatCompressionService.compress() that detects when the side-query summary hit COMPACT_MAX_OUTPUT_TOKENS (20K). In that case the summary is likely truncated mid-content, so we drop it and return NOOP rather than persist a half-summary. The next send re-tries; reactive overflow still catches the catastrophic case where the API rejects the next request as too large. Documented in the design doc as risk #2; the bot reviewer on PR #4168 correctly pushed for it to land alongside the threshold redesign rather than as a follow-up since the new 20K cap is what makes truncation likely in the first place.
The Task 11 redesign updated the non-interactive text formatter
(formatContextUsageText) but left ContextUsage.tsx — the interactive
React component that real /context users see — unchanged. As a result
the TUI still showed the old single "Autocompact buffer" line and none
of the new warn/auto/hard ladder.
Adds a "Compaction thresholds" section after the per-category breakdown:
- Effective window
- Warn / Auto / Hard threshold rows with a ▶ marker on the row the
current usage has crossed
- Current tier label coloured by severity (safe→green, warn/auto→
yellow, hard→red)
The existing progress bar legend (Used / Free / Autocompact buffer)
is preserved because it's tied to the three-segment progress bar
visualisation; the new section adds the absolute numbers + tier badge
on top of that.
Caught by the tmux e2e test (PR #4168 ci-monitor follow-up). Pre-fix
the assertion 'Compaction thresholds' missed completely from the TUI;
post-fix the new section renders correctly for fresh and live sessions
on 1M / 200K / 128K windows.
Behavior fixes: - MAX_TOKENS truncation guard now returns COMPRESSION_FAILED_EMPTY_SUMMARY instead of NOOP so the consecutive-failure breaker actually trips after repeated max-length summaries (R1.1). - Reactive overflow failure increments consecutiveFailures by 1 instead of latching to MAX in one shot, so a transient network blip doesn't permanently disable auto-compaction. The hard-tier rescue resets the counter, which remains the designated recovery path (R1.2). - /context current-tier classification uses rawOverhead (system + tools + memory + skills) as the tier input when API data is not yet available, rather than 0 — large inherited contexts no longer silently show 'safe' (R2.2). Performance: - sendMessageStream computes effectiveTokens ONCE and passes it through TryCompressOptions.precomputedEffectiveTokens, so the cheap-gate inside service.compress doesn't redo the estimation. Also fixes the imageTokenEstimate inconsistency between the rescue and cheap-gate paths (R1.3 + R1.4). - Steady-state path (lastPromptTokenCount > 0) skips the costly getHistory(true) clone — estimatePromptTokens only needs the user message in that branch. Code hygiene: - BYTES_PER_TOKEN → CHARS_PER_TOKEN (inputs are char counts, not byte counts; CJK text would mislead under the old name) (R3.1). - Drop dead getContextUsagePercent helper + index re-export — no callers in source after the threshold rewire (R1.5). - Add a comment on estimatePromptTokens' first-send fallback documenting the ~15-20K under-estimate (system prompt + tools + skills) and that reactive overflow is the safety net (R3.3). Tests: - New CLI ContextUsage.test.tsx exercises the React renderer for the three-tier section: section presence, ▶ marker placement per tier, current-tier label coloring (R1.6). - New chatCompressionService.test.ts case pins that a stale contextPercentageThreshold: 0 value in user settings no longer short-circuits compaction (R2.1). - New tokenEstimation.test.ts case covers functionResponse (distinct nested-parts branch from functionCall) (R3.5). - New geminiChat.test.ts integration test exercises the real ChatCompressionService — not a mock — for the first-send-after- inherited-history scenario where lastPromptTokenCount=0 and only the full-history estimate can cross the auto threshold (R3.4). Declined: R3.2 (change `>=` to `>` on the MAX_TOKENS guard). The current operator catches the at-cap case as suspicious, which is intentional — landing exactly at the output cap is far more likely truncation than clean stop given p99.99 ≈ 17K. With R1.1 in place, persistent truncations trip the breaker after MAX_CONSECUTIVE_FAILURES so the worst case is bounded.
- R5.1: tighten /context tier comment + TODO. The rawOverhead-based fix doesn't cover `--continue` restores with many history messages (since rawOverhead excludes messagesTokens). UI may still show 'safe' for one render until the first send. Documented inline and added a TODO to plumb chat history into collectContextData for same-source-of-truth as the cheap-gate. - R5.2a: add TODO(finish_reason) at the truncation guard. The `>= cap` heuristic false-positives on legitimate at-cap summaries; the proper signal is finish_reason which runSideQuery doesn't surface today. - R5.2b: split telemetry — new CompressionStatus.COMPRESSION_FAILED_OUTPUT_TRUNCATED enum value. Distinct from EMPTY_SUMMARY so logs/telemetry can tell prompt-quality failures (tune prompt / splitter) from capacity failures (raise cap / shrink splitter input). isCompressionFailureStatus() treats both as failures so the breaker behavior is unchanged. - R5.3: expand consecutiveFailures JSDoc to clarify it tracks "non-force, non-hard-rescue consecutive failures" — hard-rescue resets the counter and force=true skips increments, so the counter is the "regular path" health signal only; reactive overflow is the real safety net for the force-only paths. - R5.4: document the CompressOptions field rename (hasFailedCompressionAttempt: boolean → consecutiveFailures: number) as an SDK breaking change in the design doc with migration guide.
📋 Review SummaryThis PR redesigns auto-compaction triggering from a single proportional threshold to a three-tier ladder (warn/auto/hard) combining proportional floors with absolute-byte reservations. The implementation is well-documented, thoroughly tested, and addresses real issues from the previous PR #4168. Overall assessment: solid engineering with minor areas for improvement. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
Self-review (dual reviewer / pr-triage round 1) caught a correctness
regression in the hard-rescue path:
`sendMessageStream` calls `tryCompress(force=true)` from inside the
pre-push window when `effectiveTokens >= hard`. The service's
orphan-strip predicate at `chatCompressionService.ts:426-429` gated on
`force` alone, which conflated two distinct call shapes:
- manual `/compress` (force=true, trigger='manual'): user-initiated
between turns; trailing model funcCall IS orphaned because no
funcResponse is coming
- hard-rescue (force=true, trigger='auto'): automatic mid-turn;
trailing model funcCall is ACTIVE because its matching funcResponse
is sitting in the pending `userContent` waiting to be pushed
The strip fired for both, so a hard-rescue triggered mid tool-use loop
would drop the active funcCall. After compression returned and
`userContent` (the funcResponse) was pushed, the next API request
carried tool_result with no matching tool_use → provider validation
error.
The in-code comment at L422-424 already documented this exact
constraint for the auto-compress case (`force=false`), but reusing
`force=true` for hard-rescue silently violated the same constraint.
Fix:
- Gate `hasOrphanedFuncCall` on `compactTrigger === 'manual'` instead
of `force`. The trigger field already disambiguates intent.
- `sendMessageStream` hard-rescue now passes `trigger: 'auto'`
explicitly (without it, `force=true` defaults to `trigger='manual'`
via the `?? (force ? 'manual' : 'auto')` resolver).
Sibling audit for "force=true non-manual callsites":
- `GeminiClient.tryCompressChat` (manual /compress): correct — manual
- `sendMessageStream` hard-rescue: fixed in this commit
- `sendMessageStream` reactive overflow catch: already passes
trigger='auto'; runs AFTER API call (userContent in history), so if
it observes a trailing funcCall it IS orphaned but findCompressSplitPoint
handles the case without needing the strip
RED-first regression test added:
`preserves trailing model+funcCall under hard-rescue (force=true + trigger=auto)`
in `chatCompressionService.test.ts`. Failed against pre-fix code (the
strip dropped the funcCall); passes against the fix.
Adjacent fixes from the same triage round:
- `docs/users/configuration/settings.md`: the
`chatCompression.contextPercentageThreshold` row still said "use 0
to disable compression entirely" — code has ignored the value since
the removal commit. Marked the row REMOVED with migration guidance
pointing at the design doc.
- `packages/core/src/config/config.ts`: the deprecation warning now
tells users how to silence it (remove the key) and where to read
current behavior, instead of just announcing the removal.
- `docs/design/auto-compaction-threshold-redesign.md`: closed Open
Question 2 (small-window hard/auto collapse) — decision is to NOT
annotate `/context`, with rationale on file.
Tests: 2395 core tests passing, typecheck clean.
Self-review pass — pre-emptive dual-reviewer triageRan this PR through the dual-reviewer pipeline (Claude + Codex) before more bot reviews pile in. 6 findings surfaced; applied pr-triage's round-weighted bar + re-grade table. Outcomes: ✅ Fixed in 50bac97C1 [critical, high · very high]: hard-rescue + force-orphan-strip conflation. The in-code comment at C2 [medium · very high]: docs row contradicted code. M1 [low · high]: deprecation warning not actionable. The startup warning now tells users to remove the key from M2a [low · very high]: design doc Open Question 2 closed. Decided the small-window ⏭ Deferred (intentionally not fixed in this PR)Three findings re-graded to M2b [low · very high]: silent UX on small-window M3 [low · very high]: M4 [low · medium]: Triage discipline notes
Net change: +111 / -11 (6 files). All 2395 core tests pass, typecheck clean. |
Self-review on the 50bac97 commit caught a direction error in the M2a Open Question 2 closure note: said `currentTier` skips `'hard'` and goes to `'auto'` on collapsed windows, which is backwards. `contextCommand.ts:43-44` checks `tokens >= thresholds.hard` first (no `hard > auto` guard — that fix lives in a separate follow-up), so when `hard === auto` the `'hard'` branch matches first and the `'auto'` band is the empty one. Updated the rationale to describe the actual collapse direction and cite the source-of-truth file:line. Conclusion of the open question (don't annotate `/context`) is unchanged — only the explanation is corrected.
…ion tests The auto-compress and hard-rescue tests for "trailing funcCall is active, not orphaned" shared a byte-identical 4-message history and mock setup. Pull both into setupInFlightFuncCallFixture() inside the describe block so each test only contains the scenario name, the compress() call shape, and its own assertions. Net -29 LOC, no behavior change.
There was a problem hiding this comment.
Pull request overview
Implements the core/CLI pieces of the auto-compaction threshold redesign: replacing the legacy single 70% threshold with a three-tier warn/auto/hard ladder, adding a hard-tier “rescue” compaction before oversized sends, and updating supporting UX (tips + /context) and config deprecations accordingly.
Changes:
- Add
computeThresholds(window)and wire it into compaction gating (cheap-gate) and GeminiChat hard-tier rescue. - Replace the one-shot compaction failure latch with a
consecutiveFailurescircuit breaker, and cap/standardize compaction sideQuery output (maxOutputTokens=20K, thinking disabled). - Add local token estimation helpers and update CLI context display/tips to reflect warn/auto/hard tiers; deprecate
chatCompression.contextPercentageThreshold.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/services/tokenEstimation.ts | Adds char-based token estimation helpers for cheap-gate / hard-tier sizing. |
| packages/core/src/services/tokenEstimation.test.ts | Unit tests for token estimation helpers. |
| packages/core/src/services/chatCompressionService.ts | Introduces threshold ladder constants + computeThresholds, updates gating, caps sideQuery output, adds truncated-output failure. |
| packages/core/src/services/chatCompressionService.test.ts | Expands coverage for ladder math, breaker behavior, sideQuery config, and estimation-based gating. |
| packages/core/src/index.ts | Re-exports computeThresholds / CompactionThresholds for CLI consumption. |
| packages/core/src/core/turn.ts | Adds COMPRESSION_FAILED_OUTPUT_TRUNCATED to CompressionStatus. |
| packages/core/src/core/geminiChat.ts | Wires hard-tier rescue, estimation, and consecutiveFailures circuit breaker into send flow. |
| packages/core/src/core/geminiChat.test.ts | Updates/adds tests for breaker behavior, first-turn estimation wiring, and hard-tier rescue. |
| packages/core/src/core/client.ts | Removes unused export of the old compression threshold constant. |
| packages/core/src/core/client.test.ts | Updates terminology in comments to match consecutiveFailures. |
| packages/core/src/config/config.ts | Deprecates chatCompression.contextPercentageThreshold with a startup warning and ignores it logically. |
| packages/core/src/config/config.test.ts | Adds tests validating the deprecation warning behavior. |
| packages/cli/src/ui/types.ts | Adds context tier + thresholds types to carry ladder info to UI. |
| packages/cli/src/ui/hooks/useContextualTips.ts | Injects computed thresholds into tip context. |
| packages/cli/src/ui/components/views/ContextUsage.tsx | Renders the new “Compaction thresholds” section with tier marker/badge. |
| packages/cli/src/ui/components/views/ContextUsage.test.tsx | Tests the new thresholds section rendering and marker placement. |
| packages/cli/src/ui/components/Tips.test.ts | Updates tip-trigger fixtures to use warn/auto/hard thresholds. |
| packages/cli/src/ui/commands/contextCommand.ts | Updates /context computations and text output to show effective window + warn/auto/hard tiers. |
| packages/cli/src/ui/commands/contextCommand.test.ts | Adds tests for three-tier threshold display and tier classification. |
| packages/cli/src/services/tips/tipRegistry.ts | Rewrites context-* tips to use warn/auto/hard comparisons (vs percentages). |
| packages/cli/src/services/tips/tipRegistry.test.ts | Adds tests ensuring tip relevance aligns with the new ladder. |
| packages/cli/src/services/tips/index.ts | Removes getContextUsagePercent export (no longer used). |
| docs/users/configuration/settings.md | Documents removal/deprecation of contextPercentageThreshold and new ladder behavior. |
| docs/plans/2026-05-14-auto-compaction-threshold-redesign.md | Adds implementation plan / task breakdown for the redesign. |
| docs/design/auto-compaction-threshold-redesign.md | Adds design doc describing the ladder, rationale, and migration notes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — gpt-5.5 via Qwen Code /review
- geminiChat: remove pre-call consecutiveFailures reset in hard-rescue. force=true already bypasses the breaker check in chatCompressionService; the pre-reset was redundant on success (post-call L614 already handles it) and *broke* the breaker on failure paths — hard-rescue failures don't increment via tryCompress (force=true skips that branch), only the reactive overflow path at L992 explicitly increments. With the pre-reset the counter oscillated 0↔1 every send and MAX_CONSECUTIVE_FAILURES=3 was unreachable. Wrote a RED test asserting the forwarded counter is the latched value, not zero; the test failed against the old code and passes with the reset removed. - geminiChat: log hard-tier-rescue triggers via debugLogger.warn including effectiveTokens, hard, and the current consecutiveFailures so operators debugging "compaction stopped working" have a breadcrumb. - chatCompressionService: clamp effectiveWindow to >= 0 in computeThresholds so the value surfaced in /context stays meaningful for tiny windows (window < SUMMARY_RESERVE). auto/warn/hard outputs are unaffected because each is Math.max(proportional, absolute) and the proportional branch dominates whenever the absolute branch goes negative. - turn.ts: rewrite COMPRESSION_FAILED_OUTPUT_TRUNCATED docstring. Drop the misleading "compression succeeded" framing (the summary is dropped and isCompressionFailureStatus returns true) and reference the full enum name COMPRESSION_FAILED_EMPTY_SUMMARY instead of the abbreviation. - contextCommand.test.ts: reword the no-API-data-session test comment. collectContextData classifies estimated sessions against rawOverhead; with default fixtures rawOverhead lands in `safe`, but heavy system-prompt / skill / MCP loads can push it into warn/auto/hard. - design doc Background: prepend a blockquote clarifying the section describes pre-redesign behavior and that the inline file:line references point at code before PR #4345 (which removes them). - ui/types: replace the duplicated ContextThresholds interface with a type alias to the core's CompactionThresholds. Field-by-field copy in contextCommand.ts becomes a direct spread. ContextUsage.tsx keeps its CompactionThresholds React component name — the alias avoids the collision a direct import would have caused. - contextCommand: interpolate the actual reserve value into the "(window − 20K reserve)" annotation so SUMMARY_RESERVE retuning doesn't leave the text stale.
…istoryShallow) Main landed #4286 (replace structuredClone with shallow copy) which: - Reverted #4186's heap-pressure auto-compaction safety net (#4286 removed HEAP_PRESSURE_COMPRESSION_RATIO because the underlying OOM cause was fixed by the shallow-copy refactor) - Reverted #4168's consecutiveFailures ladder back to single-shot hasFailedCompressionAttempt - Introduced getHistoryShallow() / peekLastHistoryEntry() to replace structuredClone-based history access - Added a Chinese-language design doc draft for this exact redesign Resolution strategy: - Take OUR redesign everywhere it conflicts: three-tier threshold ladder, consecutiveFailures circuit breaker, hard-rescue, token estimator, hard-rescue debug log, CompressOptions plumbing for pendingUserMessage / precomputedEffectiveTokens / trigger. - DROP all bypassTokenThreshold / heapPressureCompressionCooldownUntil / HEAP_PRESSURE_* / mockGetHeapStatistics / mockHeapPressure code (heap-pressure mechanism is gone on main; we're not reviving it). - Use main's new getHistoryShallow(true) in chatCompressionService and in the hard-tier rescue estimator path (was getHistory(true) before main's refactor; the shallow path is what other compaction call sites now use). - For chatCompressionService.test.ts inline mockChat objects, alias getHistoryShallow to the same vi.fn() as getHistory so existing .mockReturnValue() calls drive both methods. - For the design doc, keep our resolved Open Question 2 closure rationale and prepend the round-2 blockquote clarifying the Background section describes pre-redesign behavior; take main's slightly more thorough SUMMARY_RESERVE paragraph where it explains both with/without-thinking cases. - Replace the round-2 test that asserted "hard-rescue forwards consecutiveFailures=3" with a test compatible with the post-merge history-access shape (now using getHistoryShallow). 346 core tests passing; CLI typecheck clean for affected files. Pre-existing provider-config typecheck errors from main's #4287 refactor are unrelated to this PR and not touched here.
Round-2 review-feedback follow-ups + main-mergeFix commit
Merge commit Main landed #4286 between the previous fix push and now. #4286 removed #4186's heap-pressure mechanism (because the underlying OOM cause was fixed by the shallow-copy refactor) and introduced
346 core tests passing; pre-existing typecheck errors from #4287's provider refactor on main are unrelated to this PR. |
Local maintainer validation — all PR-relevant gates green ✅Reviewed at head Environment
Results
Triage of the 32 core test failures (NOT caused by PR 4345)
Combined gitDiff + crawler isolated re-run: 103/103 passed in 64.85s (durations ~1.7–3.3 s each, well under the 5 s testTimeout when the box isn't busy). End-to-end proof (against built
|
| Window | warn | auto | hard | effective | Dominant branch | hard collapse? |
|---|---|---|---|---|---|---|
| 32K | 19,200 | 22,400 | 22,400 | 12,000 | proportional | yes (hard=auto) |
| 64K | 38,400 | 44,800 | 44,800 | 44,000 | proportional | yes (hard=auto) |
| 128K | 76,800 | 95,000 | 105,000 | 108,000 | mixed | no |
| 200K | 147,000 | 167,000 | 177,000 | 180,000 | absolute | no |
| 1M | 947,000 | 967,000 | 977,000 | 980,000 | absolute | no |
Additional verified properties:
- Spec constants exposed correctly (SUMMARY_RESERVE=20K, COMPACT_MAX_OUTPUT_TOKENS=20K, AUTOCOMPACT_BUFFER=13K, WARN_BUFFER=20K, HARD_BUFFER=3K, DEFAULT_PCT=0.7, WARN_PCT_OFFSET=0.1, CHARS_PER_TOKEN=4)
- Wasted reservation cap claim holds:
window − hardis 23K at 128K / 200K / 1M (vs naive 30%×window of 38K / 60K / 300K — saves 15K / 37K / 277K respectively) - Invariants
warn ≤ auto ≤ hard ≤ effectiveWindowhold for mid/large windows; small-window collapse (hard = auto > effectiveWindow) is by design and is correctly identified - Edge cases:
window=0→ all zeros;window=20K(SUMMARY_RESERVE) → effective=0, auto=14000, warn=12000, hard=14000 estimateContentTokens: char/4 estimator matches char-count input (400 chars → 100 tokens; 200+800 chars → 250 tokens)estimatePromptTokens: additive path returnslastPromptTokenCount + estimate(userMessage); first-send fallback (lastPromptTokenCount === 0) returnsestimate(history + userMessage)as documented- Tier-transition walkthrough on 128K window confirms boundaries are inclusive: 76,799 → below, 76,800 → warn, 94,999 → warn, 95,000 → auto, 104,999 → auto, 105,000 → hard
Scope / risk / breaking-change notes
- Diff: +3,986 / −222 across 25 files (3 docs + 5 CLI source + 4 CLI tests + 5 core source + 4 core tests + 1 new core file + 1 settings doc + 1 cli types + 1 tipRegistry source/test pair). PR author's own commits: 11 (with 1 round of post-review fixes).
- Breaking change (signaled by
!in title and called out in PR body):chatCompression.contextPercentageThresholdremoved — silently ignored with deprecation warningCompressOptions.hasFailedCompressionAttempt: boolean→consecutiveFailures: number(SDK-level)
- Thinking disabled on compression-side query (
includeThoughts: false) — addresses inconsistent per-provider semantics - Smaller windows (32K / 64K) auto-fall back to the proportional branch — verified above
- Larger windows (≥128K) capped wasted reservation at ~23K instead of 30%×window — verified above
Reviewer recommendation
Safe to merge with awareness of the documented breaking changes. The threshold ladder is mathematically correct, fully matches the published spec across all tested window sizes, gracefully degrades for tiny windows, and the PR-touched test surface (495 tests) is fully green. The 32 full-suite failures are entirely environmental:
- 3 reproduce identically on the PR base (path-shape fixture fragility, Claude-Code UA injection)
- 29 are 5-second-timeout races between parallel test suites and pass cleanly in isolation
CI on GitHub (the 3 platform test matrices) was still running at validation time and will provide an independent confirmation on clean macOS / Ubuntu / Windows runners.
— Maintainer local validation, run on 7c7c2d20 from upstream pull/4345/head.
R3-1: rewrite the stale "Hard-tier rescue resets the counter" comment in
the reactive-overflow path. The R2 commit removed the pre-call reset
from hard-rescue; the only counter-reset path is now the post-call
COMPRESSED branch in tryCompress. Two contradicting comments in the
same file would mislead a future maintainer tracing the lifecycle.
R3-2: rewrite the JSDoc on CompactionThresholds.hard. The "(resets
failure counter)" phrasing was true under the pre-R2 design; after R2
the hard threshold force-triggers compaction and bypasses the breaker,
but does not reset the counter (which only happens on COMPRESSED
success via the post-call branch). The type is consumed by both
geminiChat and the CLI UI (via ContextThresholds alias), so the
authoritative description had to match the actual contract.
R3-3: add a Step 3 to the hard-rescue regression test. The test title
claims "success recovers via the post-call branch" but the original
Steps 1-2 only verified the latched counter was forwarded INTO the
call. Step 3 follows up with a below-hard send and asserts the
forwarded counter is 0 — proving geminiChat.ts:614 ran on the
COMPRESSED result.
R3-4: assert effectiveWindow === 0 on the existing extreme-small-window
test and add a separate zero-window edge case. The Math.max(0, ...)
clamp from R2 was previously unasserted; a regression that removed
the clamp would go undetected.
R4-1: forward originalTokenCount on the breaker-NOOP path in
chatCompressionService.compress() to match the adjacent
threshold-NOOP path (L368-369). Returning {originalTokenCount: 0,
newTokenCount: 0} masked "breaker tripped at N tokens" as
"empty session" in telemetry dashboards.
R4-2a: add debugLogger.warn at the two consecutiveFailures increment
sites (cheap-gate path L586 and reactive-overflow path L955) when
the counter reaches MAX_CONSECUTIVE_FAILURES. The breaker is one of
the PR's headline safety features but, prior to this round, had zero
observability when it tripped. Required importing MAX_CONSECUTIVE_FAILURES
into geminiChat.ts.
R4-3: programmatically link tokenEstimation.ts's CHARS_PER_TOKEN to
compactionInputSlimming.ts's TOKEN_TO_CHAR_RATIO. Both are 4 today
and represent the same generic char/token conversion. Exporting from
compactionInputSlimming and aliasing in tokenEstimation eliminates
the silent-drift hazard the JSDoc already warned about.
Declined (round-weighted bar at round 4):
- R3-5: debugLogger test for hard-rescue trigger — observability test
coverage is overthinking at round 3+; the log is informational.
- R4-2b: expose breaker state in /context — new feature; out of scope.
- R4-4: render test for auto-tier marker — test coverage gap on
working code, defer to follow-up PR per round-weighted bar.
- R4-5a: extract makeFakeChat/makeFakeConfig shared factory — pure
test refactor at round 4, not a fix.
- R4-5b: direct unit test for precomputedEffectiveTokens — exercised
indirectly via hard-rescue path tests in geminiChat.test.ts.
- R4-6: truncation-guard fallback test for missing candidatesTokenCount
— code already has a TODO acknowledging the heuristic is imperfect
(chatCompressionService.ts:549-553); defer.
Round 3 + Round 4 follow-ups (bundled in
|
R5-1: assert breaker-NOOP forwards originalTokenCount. R4-1 changed the
breaker-NOOP return from `{0, 0}` to `{originalTokenCount, originalTokenCount}`
so telemetry can distinguish "breaker tripped at N tokens" from
"empty session", but the existing test only checked compressionStatus
and newHistory. Now seeds a non-zero originalTokenCount (120K) and
asserts both fields forward it.
R5-2: forward originalTokenCount on the empty-history NOOP. This was
sibling drift on R4-1 — I fixed the cited breaker-NOOP site but missed
the empty-history NOOP. Of 5 NOOP return sites in chatCompressionService,
4 now forward originalTokenCount (breaker, threshold-gate, post-split,
min-compression-fraction) and 1 (this one) was still returning `{0, 0}`,
breaking the project-wide invariant. Now consistent.
R5-3: replace 10 stale line-number references with semantic anchors.
After the R3+R4 push, the line refs in my R2/R3 comments (`geminiChat.ts:614`,
`chatCompressionService.ts:339`, `line 992`, `L627`, `line 944`) no longer
pointed at their original targets — `geminiChat.ts:614` now points at
`setSystemInstruction`'s body, completely unrelated to compaction. The
pattern itself is fragile; semantic phrasing ("the post-call reset in
tryCompress's COMPRESSED handler") doesn't drift when lines shift.
347/347 affected core tests passing locally; typecheck clean.
Round 5 follow-ups (bundled in
|
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] CompressionMessage.tsx (not in this PR's diff) has no case for the new COMPRESSION_FAILED_OUTPUT_TRUNCATED status in its getCompressionText() switch — it falls through to default: return '', producing a blank message. Consider adding a handler in a follow-up to avoid silent UI when compression hits the output cap.
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
R6-1: rewrite the stale JSDoc bullet on `consecutiveFailures` (the "Hard-tier rescue failures" bullet). The old wording said "the counter is reset to 0 BEFORE the rescue call" — that contradicted R5 which explicitly removed the pre-call reset. Now the bullet matches the actual behavior: counter is NOT pre-reset, force=true bypasses the breaker, post-call COMPRESSED handler resets on success, reactive overflow is the explicit-increment safety net. My R5 stale-comment sweep only grep'd inline `//` comments; this JSDoc on the field declaration slipped through. Re-audited "reset to 0 BEFORE" / "pre-reset" across both packages — single site remaining. R6-7: assert `passedOpts.trigger === 'auto'` in the hard-rescue test. This field is the orphan-strip safety wire added by the C1 fix (the service's `compactTrigger === 'manual'` check would otherwise strip the trailing active funcCall mid tool-loop). The test asserted force and pendingUserMessage but not the trigger; a refactor dropping the 'auto' from `trigger: shouldForceFromHard ? 'auto' : undefined` would silently break orphan-strip safety. Now regression-guarded with a single-line expect. 164/164 affected core tests passing locally. Declined per round-weighted bar (round 6 defaults Suggestion / Test coverage / Style to overthinking): - R6-2/3/6: test-coverage gaps on working code — defer to follow-up - R6-4: redundant truthy guard on always-set fields — style nit - R6-5: text-vs-UI inconsistency on /context — existing test enforces current behavior; treat as design decision (offer follow-up if reviewer escalates) - R6-8 (tipRegistry small-window context-high): explicitly closed in design doc's Open Question 2 — small windows have empty context-high band by design; UI work is out-of-scope for this PR - R6-9: wasted clone on rare fallback path — Suggestion-level perf - R6-10 (CompressionMessage missing case): file not in this PR's diff; reviewer themselves proposed it as follow-up
Round 6 follow-ups (bundled in
|
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: CI still running. Thorough manual review of all 26 changed files (+4076/-227) found zero new issues at commit a81438e. All prior Critical/Suggestion items from rounds 1-6 have been addressed. tsc and eslint clean on changed files. 371 tests pass (core: 347, cli: 24). Build clean. Will upgrade to Approve once CI passes. — qwen-latest-series-invite-beta-v36 via Qwen Code /review
Summary
Redesigns auto-compaction triggering from a single proportional threshold (70% of window) to a three-tier ladder (warn / auto / hard) that combines a proportional floor with absolute-byte reservations near the window edge. Pairs with a hard-tier rescue that forces compaction one send before the API would reject an oversized prompt.
auto = max(0.7 × window, effectiveWindow − 13K)— proactive cheap-gatewarn = max(0.6 × window, auto − 20K)— UX hint tierhard = max(effectiveWindow − 3K, auto)— force-compaction triggereffectiveWindow = window − SUMMARY_RESERVE (20K)— input + summary budgetCOMPACT_MAX_OUTPUT_TOKENS = 20K— pinned summary cap (matches SUMMARY_RESERVE)Small windows (32K / 64K) automatically fall back to the proportional branch — large windows are dominated by the absolute branch, capping wasted reservation at ~33K instead of 30% of the window.
Migration / breaking
chatCompression.contextPercentageThresholdremoved. The new ladder is computed internally from the context window; the field is silently ignored with a one-line deprecation warning at startup.CompressOptions.hasFailedCompressionAttempt: boolean→consecutiveFailures: number(SDK breaking change documented in design doc with migration guide).includeThoughts: false) on compression side query — per-provider thinking-budget semantics are inconsistent.Why a new PR
PR #4168 ran 12 review rounds and accumulated 6303 LOC of which ~2700 was review-driven scope creep (observability throttles, hostile-provider hardening, new escape-hatch feature, i18n baselines, cross-file constant unification, etc.) — much of it self-inflicted regression chains. The PR became un-reviewable.
This PR ships only the original spec + early-discovered real bugs (R1-R5) — 4288 LOC, 6 clean commits. The load-bearing review-driven refinements from R6-R12 are deferred to focused follow-up issues so each can be reviewed in isolation:
hardRescueFailureCount) — prevents infinite rescue retries on un-compressible history<state_snapshot>envelope extraction — data-retention fix for scratchpad contentchatCompression.disabledescape hatch — opt-out for compliance / audit sessionslastCandidatesTokenCountplumbing — closes one-response under-estimate in steady-stateOld PR's full archaeology preserved at tag
pr-4168-archive-pre-revert.Design doc
docs/design/auto-compaction-threshold-redesign.md(included in this PR).Test plan
npx vitest run --root packages/core src/services/chatCompressionService.test.ts src/core/geminiChat.test.ts src/services/tokenEstimation.test.ts→ 172/172 passingnpm run typecheck --workspace=packages/corecleannpm run build --workspace=packages/coreclean