diff --git a/docs/design/auto-compaction-threshold-redesign.md b/docs/design/auto-compaction-threshold-redesign.md new file mode 100644 index 0000000000..79bd6a8afc --- /dev/null +++ b/docs/design/auto-compaction-threshold-redesign.md @@ -0,0 +1,418 @@ +# Auto-Compaction Threshold Redesign + +**Status:** Draft · 2026-05-14 + +## 背景 + +当前 qwen-code 的自动压缩仅使用单一比例阈值 `COMPRESSION_TOKEN_THRESHOLD = 0.7`(`chatCompressionService.ts:33`),所有窗口大小共用同一比例。对比 claude-code 的「绝对 token 梯子」(autoCompact.ts:62-65),qwen-code 存在三个具体问题: + +1. **大窗口下预留过多**:1M 模型 70% 阈值在 700K 触发,剩余 300K 远超摘要 + 输出实际所需的 ~33K +2. **失败 1 次永久锁**:`hasFailedCompressionAttempt = true` 之后整个 session 不再尝试 auto-compact(geminiChat.ts:504),比 claude-code 的「连续 3 次熔断」更严苛 +3. **tip 系统与 auto 阈值脱钩**:`tipRegistry.ts` 里的三条 `context-*` tip 使用固定的 50/80/95 百分比,与 auto-compact 阈值(70%)完全独立。这意味着在「auto 正常工作」的主路径上 80% / 95% tip 极少触发,而在「auto 失败 / 反应式兜底」的边缘路径上又缺乏与阈值对齐的语义 +4. **压缩调用本身没有输出预算控制**:[chatCompressionService.ts:374-376](packages/core/src/services/chatCompressionService.ts:374) 显式开启 `thinkingConfig.includeThoughts = true`(注释:「Compression quality drives every subsequent main turn」),同时 sideQuery 调用未设 `maxOutputTokens` 上限。代码注释([:436-437](packages/core/src/services/chatCompressionService.ts:436))也承认 `compressionOutputTokenCount may include non-persisted tokens (thoughts)`。在压缩接近窗口顶时,总输出可能膨胀,使 buffer 预留缺乏可预测上限。

更糟糕的是跨 provider 行为不一致:Anthropic 的 thinking budget 与 max_tokens 完全独立;OpenAI 的 reasoning tokens 不受 max_completion_tokens 限制;Gemini 的行为又因模型版本而异。这意味着「单靠加 maxOutputTokens 就能控制总输出」在 qwen-code 这种多 provider 项目里不成立 + +5. **阈值判断使用的 `lastPromptTokenCount` 系统性下偏。** [geminiChat.ts:1217-1232](packages/core/src/core/geminiChat.ts:1217) 表明这个数来自上一轮 API response 的 `usageMetadata.totalTokenCount`。两个 gap:(a) 不包含本轮即将加入的 user message,每次 cheap-gate 判断都比真实 prompt 小一段;(b) 首轮初始值是 0,`--continue` 恢复巨大 session / sub-agent 继承大量历史时第一次 send 永远绕过所有阈值。对比 claude-code 的 `tokenCountWithEstimation`([query.ts:638](src/query.ts:638))走「最后一条 assistant API usage + 之后新增 message 估算」的双轨制能闭合这两个 gap + +## 设计目标 + +- 引入「比例 + 绝对」混合阈值,让大窗口模型由绝对值接管,小窗口仍走比例兜底 +- 新增 warn / hard 两层(auto 保留为主触发点),形成三层梯子 +- 把 tip 系统重写为跟随新阈值的触发条件 +- 失败处理从「1 次永久锁」升级为「3 次熔断 + 自动恢复」 +- **压缩调用关闭 thinking 并加 `maxOutputTokens` 上限**:与 claude-code 对齐,让总输出受单一参数约束、buffer 预算可预测;接受压缩质量可能下降的代价 +- **加 token 估算补偿**:消除 `lastPromptTokenCount` 的「滞后一轮」和「首轮为 0」两个系统性下偏,让阈值判断更贴近真实 prompt 大小 +- 删除 settings 里的 `contextPercentageThreshold` 配置入口(内部 PCT 常量保留) +- **不引入** env 覆盖通道、**不**新增显式 enabled 开关 + +## 三层阈值梯子 + +``` + window (raw context window) + │ + │ ← SUMMARY_RESERVE = 20K + ▼ + effectiveWindow + │ + │ ← HARD_BUFFER = 3K + ▼ + hard_threshold = effectiveWindow - 3K + │ + │ ← (AUTOCOMPACT_BUFFER - HARD_BUFFER) = 10K + ▼ +auto_threshold = max(PCT * window, effectiveWindow - AUTOCOMPACT_BUFFER) + │ + │ ← WARN_BUFFER = 20K + ▼ +warn_threshold = max((PCT - WARN_OFFSET) * window, auto_threshold - WARN_BUFFER) + │ + ▼ + 0 +``` + +### 三层语义 + +| 层 | 触发条件 | 行为 | +| -------- | ------------------------------ | -------------------------------------------------------- | +| **warn** | `tokenCount >= warn_threshold` | UI 提示「距自动压缩还剩 X tokens」,不改变 send 行为 | +| **auto** | `tokenCount >= auto_threshold` | 在 send 前 `tryCompress(force=false)`,正常压缩流程 | +| **hard** | `tokenCount >= hard_threshold` | 在 send 前 `tryCompress(force=true)`,重置失败锁强制压缩 | + +`hard` 层等同于把现有 reactive overflow(geminiChat.ts:711)的兜底逻辑提前到 send 前,避免一次失败的 oversized request round-trip。 + +## 内部常量 + +```ts +// chatCompressionService.ts +const DEFAULT_PCT = 0.7; // auto 比例兜底 +const WARN_PCT_OFFSET = 0.1; // warn 比例 = PCT - WARN_OFFSET = 0.6 +const COMPACT_MAX_OUTPUT_TOKENS = 20_000; // 压缩 sideQuery 输出硬上限(thinking + summary 合计) +const SUMMARY_RESERVE = 20_000; // 阈值梯子从窗口顶减去的输出预留 = maxOutput +const AUTOCOMPACT_BUFFER = 13_000; // auto 与 effectiveWindow 间距 +const WARN_BUFFER = 20_000; // warn 与 auto 间距 +const HARD_BUFFER = 3_000; // hard 与 effectiveWindow 间距 +const MAX_CONSECUTIVE_FAILURES = 3; // 失败熔断阈值 +``` + +数值来源:全部沿用 claude-code 的实测值([autoCompact.ts:30,62-65](src/services/compact/autoCompact.ts:30))。 + +`SUMMARY_RESERVE = COMPACT_MAX_OUTPUT_TOKENS` 是关键关系:模型受 `maxOutputTokens` 硬限制约束,输出不可能超出 20K,因此 reserve 不需要额外 safety margin。注意:本设计关闭 thinking 后该等式成立(output budget 全部给 summary);若保留 thinking,`thinking + summary` 共享预算(Gemini SDK / 多数 provider 的 `maxOutputTokens` 语义),模型自行在两者间分配,此时 summary 的实际可用空间小于 20K(见「风险与注意事项」第 1、2 条)。 + +## 计算函数 + +```ts +export interface CompactionThresholds { + warn: number; + auto: number; + hard: number; // 当 hard < auto 时等于 auto(小窗口退化) + effectiveWindow: number; +} + +export function computeThresholds(window: number): CompactionThresholds { + const effectiveWindow = window - SUMMARY_RESERVE; + + const absAuto = effectiveWindow - AUTOCOMPACT_BUFFER; + const auto = Math.max(DEFAULT_PCT * window, absAuto); + + const absWarn = auto - WARN_BUFFER; + const warn = Math.max((DEFAULT_PCT - WARN_PCT_OFFSET) * window, absWarn); + + const rawHard = effectiveWindow - HARD_BUFFER; + const hard = Math.max(rawHard, auto); // 小窗口下退化为 auto + + return { warn, auto, hard, effectiveWindow }; +} +``` + +### 实测数据 + +| 窗口 | warn | auto | hard | 备注 | +| ---- | ----------- | ----------- | ------------ | ------------------------------- | +| 32K | 19.2K (pct) | 22.4K (pct) | 22.4K (退化) | 比例兜底 | +| 64K | 38.4K (pct) | 44.8K (pct) | 44.8K (退化) | 比例兜底 | +| 128K | 76.8K (pct) | 95K (abs) | 105K (abs) | 混合(warn=pct, auto/hard=abs) | +| 200K | 147K (abs) | 167K (abs) | 177K (abs) | 绝对接管 | +| 256K | 203K (abs) | 223K (abs) | 233K (abs) | 绝对接管 | +| 1M | 947K (abs) | 967K (abs) | 977K (abs) | 全绝对 | + +`(pct)` 表示该层由比例公式决定,`(abs)` 表示由绝对值公式决定。 + +## 用户配置 + +### ChatCompressionSettings 变更 + +```ts +// packages/core/src/config/config.ts:217 +export interface ChatCompressionSettings { + /** 保留(与本设计无关,由 compactionInputSlimming 使用) */ + imageTokenEstimate?: number; +} +``` + +**删除:** `contextPercentageThreshold` 字段。理由: + +1. 新公式下,对主流窗口(>= 128K)该字段几乎无影响——绝对值接管 +2. 小窗口下用户配置反而可能让阈值"更早"压缩,与节省 token 直觉相反 +3. claude-code 没有暴露此字段,无类似的用户面配置先例 + +### Breaking change 处理 + +启动时 `Config` 加载发现 `chatCompression.contextPercentageThreshold` 存在: + +- 写入 stderr 一行警告:`"chatCompression.contextPercentageThreshold has been removed and is now controlled by built-in thresholds."` +- **不**报错、**不**阻塞启动 +- 字段值被忽略 + +## Token 估算补偿 + +qwen-code 的 `lastPromptTokenCount` 来自上一轮 API response 的 `usageMetadata.totalTokenCount`([geminiChat.ts:1217-1232](packages/core/src/core/geminiChat.ts:1217))。这导致: + +1. **滞后一轮**:cheap-gate 用 `lastPromptTokenCount` 判断,但本次 send 实际 prompt = 它 + 本轮 user message。少算的部分可能让阈值判断 false-negative +2. **首轮为 0**:初始值是 0,第一次 send 时无论历史多大都不会触发任何阈值(含 `--continue` 恢复 / sub-agent 继承场景) + +引入轻量本地估算函数 `estimatePromptTokens`,在 send 前 cheap-gate / hard 判断时补足这两段缺失: + +```ts +// chatCompressionService.ts(或新文件 packages/core/src/services/tokenEstimation.ts) + +const BYTES_PER_TOKEN = 4; // 通用 char/4 估算(claude-code 同此) +const BYTES_PER_TOKEN_JSON = 2; // JSON / tool_call input 更密集 + +/** + * 估算一组 Content 的 token 数,用于补偿 API usage metadata 的滞后。 + * 对 image / document 复用现有 imageTokenEstimate(默认 1600)。 + */ +export function estimateContentTokens( + contents: Content[], + imageTokenEstimate = DEFAULT_IMAGE_TOKEN_ESTIMATE, +): number { + // 复用 estimateContentChars(compactionInputSlimming.ts),再除以 bytesPerToken + // 内部对 functionCall / functionResponse 用 BYTES_PER_TOKEN_JSON + // ... +} + +/** + * cheap-gate 与 hard 判断的统一入口。 + * 主路径:lastPromptTokenCount 准 + 本轮 user message 估算 + * 首轮路径:full history 估算 + */ +export function estimatePromptTokens( + history: Content[], + userMessage: Content, + lastPromptTokenCount: number, +): number { + if (lastPromptTokenCount > 0) { + return lastPromptTokenCount + estimateContentTokens([userMessage]); + } + return estimateContentTokens([...history, userMessage]); +} +``` + +应用位置: + +- `chatCompressionService.compress()` 的 cheap-gate:把 `originalTokenCount` 来源换成 `estimatePromptTokens(history, userMessage, lastPromptTokenCount)` +- `geminiChat.sendMessageStream` 入口的 hard 判断(见下一节) + +**估算只用于提前触发,不用于「跳过触发」。** 因为 char/4 是粗略下界估计,作为 false-positive 一侧是安全的(宁可早一点压),作为 false-negative 则不可靠。 + +## 触发链路改动 + +### chatCompressionService.ts + +1. **导出 `computeThresholds`**,供 cheap-gate / UI / 命令复用 +2. **`compress()` cheap-gate** (line 221-249): + ```ts + if (consecutiveFailures >= MAX_CONSECUTIVE_FAILURES && !force) { + return NOOP; + } + const { auto } = computeThresholds(contextLimit); + const effectiveTokens = estimatePromptTokens( + curatedHistory, + userMessage, + originalTokenCount, + ); + if (!force && effectiveTokens < auto) return NOOP; + ``` +3. **`compress()` 的 runSideQuery 调用** (line 356-380):关闭 thinking + 加 `maxOutputTokens`: + + ```ts + const summaryResult = await runSideQuery(config, { + // ... + config: { + thinkingConfig: { includeThoughts: false }, // 关闭 thinking(与 claude-code 一致) + maxOutputTokens: COMPACT_MAX_OUTPUT_TOKENS, // 硬上限 20K + }, + // ... + }); + ``` + + 或者直接删掉 `thinkingConfig` 让 `runSideQuery` 默认值([sideQuery.ts:118](packages/core/src/utils/sideQuery.ts:118) 默认 `includeThoughts: false`)接管。 + + 关 thinking 后,`maxOutputTokens` 直接约束总输出(不存在 thinking 单独 budget 的问题),`SUMMARY_RESERVE = maxOutput = 20K` 是干净的硬关系。 + + 同时更新 [chatCompressionService.ts:374-376](packages/core/src/services/chatCompressionService.ts:374) 的注释,从「Compression quality drives every subsequent main turn — keep reasoning on」改为说明「为保证跨 provider 可预测的输出上限,与 claude-code 设计对齐」。 + + token math 一段([:436-437](packages/core/src/services/chatCompressionService.ts:436))的 "may include non-persisted tokens (thoughts)" 注释也可以同步清理 + +### geminiChat.ts: `sendMessageStream` 入口(line 562) + +```ts +// 替换前:tryCompress(force=false) +// 替换后:用估算 token 判断是否触发 hard,决定 force 标志 + +const { hard } = computeThresholds(contextLimit); +const effectiveTokens = estimatePromptTokens( + this.getHistory(true), + createUserContent(params.message), + this.lastPromptTokenCount, +); +const shouldForceFromHard = effectiveTokens >= hard; + +if (shouldForceFromHard) { + // 重置熔断器,等同 force compress + this.consecutiveFailures = 0; +} + +compressionInfo = await this.tryCompress( + prompt_id, + model, + shouldForceFromHard, + params.config?.abortSignal, +); +``` + +### 失败处理升级 (`geminiChat.ts:504-510`) + +```ts +// 替换前 +hasFailedCompressionAttempt: boolean; + +// 替换后 +consecutiveFailures: number; // 默认 0 + +// 失败分支 +} else if (isCompressionFailureStatus(info.compressionStatus)) { + if (!force) { + this.consecutiveFailures += 1; + } +} + +// 成功分支 +this.consecutiveFailures = 0; +``` + +`force=true` 调用失败不计入计数(保持现有 reactive / manual 不"占额"的语义)。 + +## UI 改动 + +### tipRegistry.ts 重写三条 context-\* tip + +三层阈值正好与三条 tip 一一对应。映射关系(按 token 数从低到高): + +| Tip ID | 当前条件 | 新条件 | 文案变化 | +| ------------------ | --------------------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- | +| `compress-intro` | `pct >= 50 && < 80 && sessionPromptCount > 5` | `tokenCount >= warn && tokenCount < auto && sessionPromptCount > 5` | 保持不变 | +| `context-high` | `pct >= 80 && < 95` | `tokenCount >= auto && tokenCount < hard` | 保持不变 | +| `context-critical` | `pct >= 95` | `tokenCount >= hard` | 加一句「Auto-compact will force on next send.」反映新 hard 层行为 | + +**对触发频率的影响:** + +- 主路径(auto 正常工作):`tokenCount` 跨越 auto 后立即触发压缩,下一轮 tokenCount 回落,所以 `context-high` 仅在「触发到压缩生效之间」短暂可见 +- 边缘路径(auto 失败 / 熔断 / reactive 来不及):`tokenCount` 持续上涨,会依次穿过 warn → auto → hard 触发三条 tip,跟用户视角的"上下文越来越紧"一致 +- `context-critical` 触发时 hard 层已经在 send 前 force compress(spec 触发链路改动一节),所以这条 tip 实际上是「post-rescue 告知」而非「pre-rescue 警告」,文案补一句说明 + +`TipContext` 接口增加: + +```ts +export interface TipContext { + lastPromptTokenCount: number; + contextWindowSize: number; + sessionPromptCount: number; + sessionCount: number; + platform: string; + // 新增:让 isRelevant 函数能拿到阈值。 + // computeThresholds 在调用方算好后注入,避免 tipRegistry 直接依赖 core。 + thresholds?: CompactionThresholds; +} +``` + +`AppContainer.tsx:1150` 构造 `TipContext` 时同步注入。 + +### /context 命令同步 (`contextCommand.ts:177-183`) + +```ts +// 替换硬编码 (1 - threshold) * contextWindowSize +const { warn, auto, hard, effectiveWindow } = + computeThresholds(contextWindowSize); + +// 显示四行: +// Effective window: 180K (window − 20K reserve) +// Warn threshold: 147K (...) +// Auto threshold: 167K ← 当前位置 +// Hard threshold: 177K +// 标记当前 token count 落在哪个 tier +``` + +### Footer 持续提示(可选 follow-up) + +本 spec 不强制实现 footer 持续提示,理由: + +- 现有 tip 系统已经能在 history 里给出提示 +- Footer 持续提示需要改 ink 渲染、增加重绘频率 +- 可作为本 spec 后置 follow-up(独立 PR) + +如果后续要做,建议触发条件 `tokenCount >= warn && tokenCount < auto`,超过 auto 后隐藏(压缩已开始)。 + +## 测试覆盖 + +### 单元测试(chatCompressionService.test.ts) + +- `computeThresholds(32K)` → 比例兜底分支(warn/auto 均 pct,hard 退化) +- `computeThresholds(128K)` → 混合分支(warn=pct,auto=abs,hard=abs) +- `computeThresholds(200K)` → 绝对接管分支(warn/auto/hard 均 abs) +- `computeThresholds(1M)` → 全绝对分支 +- `computeThresholds(window=10K)` → 极小窗口(绝对值全负),公式不崩 +- 三层阈值始终满足 `warn <= auto <= hard` +- max() 公式在边界点(pct \* window == abs)稳定 + +### 单元测试(tokenEstimation.test.ts) + +- `estimateContentTokens` 对纯文本 / json / functionCall / functionResponse / image / document 分别走对应 bytesPerToken +- `estimatePromptTokens` 在 `lastPromptTokenCount > 0` 时走「主路径」,等于 0 时走「首轮路径」 +- 大 user message 在 cheap-gate 阶段被加上去后能跨越 auto 阈值 +- 估算与真实 API usage 的偏差在 ±30% 以内(用真实历史样本回归) + +### 集成测试(geminiChat.test.ts / chatCompressionService.test.ts) + +- 3 次连续失败后 cheap-gate NOOP;下一次 force 后恢复 +- 单次失败不再永久锁 +- 估算 token 跨越 hard 后 send 自动 force compress +- 压缩 sideQuery 调用 `maxOutputTokens = COMPACT_MAX_OUTPUT_TOKENS` 正确透传到 `runSideQuery`,`thinkingConfig.includeThoughts` 为 `false`(或被 sideQuery 默认值接管) +- **首轮覆盖**:构造一个 `lastPromptTokenCount = 0` 但 history 巨大的 chat(模拟 `--continue` 恢复),首次 send 时 auto 阈值能被估算路径触发 + +### 兼容性测试 + +- 设置 `contextPercentageThreshold = 0.5` 启动 → stderr 警告 + 字段被忽略,行为以内部 PCT 常量为准 + +### Tip 系统测试(tipRegistry.test.ts) + +- 三条 context-\* tip 在跨越 warn/auto/hard 时正确触发,且区间不重叠 +- 主路径下 auto 阈值触发压缩后 `context-high` 不持续可见 +- 边缘路径(熔断 + token 继续涨)下三条 tip 依次触发 +- TipContext 缺 `thresholds` 时(fallback)行为合理 + +## 实施分阶段 + +| Phase | 内容 | 独立性 | +| ----- | -------------------------------------------------------------------------------------------- | ------------------ | +| 1 | 内部常量 + `computeThresholds` + cheap-gate 改动(不含估算补偿) | 可独立合并 | +| 2 | 失败处理升级(1 → 3 熔断) | 可独立合并 | +| 3 | hard 层 force compress 提前 | 依赖 P1 + P7 | +| 4 | 配置面变更 + breaking change 警告 | 依赖 P1 | +| 5 | UI(tip 重写 + /context) | 依赖 P1 | +| 6 | 压缩 sideQuery 关 thinking + 加 `maxOutputTokens` 上限 | 独立可先于 P1 落地 | +| 7 | Token 估算补偿(`estimateContentTokens` + `estimatePromptTokens`,应用到 cheap-gate / hard) | 独立可与 P1 并行 | + +每个 Phase 可独立 PR。建议合并顺序 **P6 → P7 → P1 → P2 → P4 → P3 → P5**:先给压缩调用打上 `maxOutputTokens` 上限(让 buffer 假设可信);再加估算补偿(让 token 数判断更可靠);再把阈值基础设施落地;再做失败熔断、配置面变更;最后才打开 hard 层主动救场(这时已有可靠的 token 数 + 熔断器)。每个 PR 都能独立验证、独立回滚。 + +## 风险与注意事项 + +1. **关 thinking 可能影响摘要质量。** 原作者注释 "Compression quality drives every subsequent main turn — keep reasoning on" 表达过对此的担忧。本 spec 的判断是「可预测的 token 上限」优先于「最大化质量」,但落地后需要观察 telemetry 里 `compression_input_token_count` / `compression_output_token_count` 的分布,以及主对话在压缩后的质量变化(用户反馈、`COMPRESSION_FAILED_*` 状态率)。如果质量下降明显,再考虑回退到 thinking 开启 + provider-specific thinkingBudget 控制。 + +2. **`maxOutputTokens` 触顶可能导致 summary 被截断。** 关 thinking 后,20K 直接限制 summary 主体;claude-code 实测 p99.99 ≈ 17K,留 ~3K 安全冗余。但 qwen-code 的压缩 prompt 与 claude-code 不同,分布需要观测。建议在压缩失败分支([chatCompressionService.ts:464-491](packages/core/src/services/chatCompressionService.ts:464))追加「检测到 finish_reason = MAX_TOKENS」的 NOOP 路径,避免持久化半截 summary。 + +3. **跨 provider 的 maxOutputTokens 映射差异。** OpenAI compat (dashscope) → `max_tokens`、Anthropic → `max_tokens`、Gemini SDK → `maxOutputTokens`。当前 qwen-code 已有这层映射([contentGenerator.ts:94](packages/core/src/core/contentGenerator.ts:94) 等),需要在 P6 实现时验证 sideQuery 路径上 `maxOutputTokens` 字段确实贯穿到所有 provider 的请求体。 + +4. **Token 估算是粗略下界,不应反向用作"跳过触发"的依据。** `char/4` 与各 provider 真实 tokenizer 偏差可能 ±30%。本 spec 只用估算来「让阈值更早触发」(false-positive 方向,宁可早压不可晚压)。所有「降低 token 计数 / 跳过压缩」的代码路径仍应使用 `lastPromptTokenCount`(API 权威值)。 + +5. **估算函数与现有 `estimateContentChars` 的关系。** [compactionInputSlimming.ts](packages/core/src/services/compactionInputSlimming.ts) 已经有 `estimateContentChars`(用于压缩 split point 计算),新增的 `estimateContentTokens` 应复用它(除以 bytesPerToken)而非新写一套,避免两套估算口径出现分歧。 + +## 不在本 spec 范围 + +- Env 变量覆盖通道(D 方案):维持「配置面最小」原则 +- Footer 常驻可视化:留作 follow-up +- 摘要 prompt 改进、`MIN_COMPRESSION_FRACTION` 调整:与阈值设计正交 + +## 开放问题(等 review) + +1. **breaking change 强度**:警告 + 忽略字段 vs 启动报错。当前选警告,需要确认对企业部署/团队配置是否够友好 +2. **小窗口(32K)下 hard 与 auto 退化为同一值**:用户视角是否需要在 `/context` 明示「该窗口下 hard 已退化」 diff --git a/docs/e2e-tests/2026-05-18-qwen-memory-benchmark-report.md b/docs/e2e-tests/2026-05-18-qwen-memory-benchmark-report.md new file mode 100644 index 0000000000..1a7aaf3253 --- /dev/null +++ b/docs/e2e-tests/2026-05-18-qwen-memory-benchmark-report.md @@ -0,0 +1,286 @@ +# Qwen Code Runtime Memory Benchmark Report + +Date: 2026-05-18 + +## Summary + +This report records local memory benchmarks for Qwen Code runtime behavior. It +compares Qwen Code across models and compares Qwen Code with Claude Code on the +same task shapes where equivalent model endpoints were available. + +The headline result is consistent across the latest matrix (single run per cell, +not statistically repeated): + +- Qwen Code process-tree RSS peak: about `852-1062 MiB` (`0.83-1.04 GiB`). +- Claude Code process-tree RSS peak: about `279-366 MiB` (`0.27-0.36 GiB`). +- Qwen Code was about `2.3x-3.6x` higher in the tested + non-interactive CLI task benchmarks. + +Note: process-tree RSS includes MCP child processes (~350 MiB overhead on the +Qwen side). This inflates the absolute numbers but the relative comparison +remains informative since both CLIs were measured the same way. + +The difference reproduced in small PR review, code navigation, and synthetic +diff workloads. It is therefore unlikely to be explained only by one large PR +or by one model provider. + +This report is intended to make the current performance investigation visible: +what has been measured, what conclusion is already supported, what remains +unknown, and what diagnostics should be added next. + +## Test Environment + +| Item | Value | +| --------------------------------------------- | ------------------------------------------ | +| Date | 2026-05-18 | +| Platform | macOS local development machine | +| Qwen Code version | `0.15.11` | +| Qwen Code binary | PATH-resolved `qwen` binary | +| Claude Code version used in the latest matrix | `2.1.129` | +| Claude Code binary used in the latest matrix | PATH-resolved `claude` binary | +| Node.js version | v22.x (default system install) | +| Sampling method | External `ps` RSS sampling once per second | +| Headline metric | Process-tree RSS peak | + +Process-tree RSS is used as the headline metric because Qwen Code launches a +root wrapper and a child Node/Qwen worker. Looking only at the root process can +understate the memory footprint seen by users. + +Temporary CLI config directories were used for matrix runs so the benchmarks +did not depend on global CLI state. + +## Benchmark Artifacts + +Five local reports were produced before this consolidated report: + +1. Qwen Code PR review memory run. +2. Qwen Code model comparison run. +3. Strict Qwen Code vs Claude Code comparison with `pai/glm-5`. +4. Qwen Code vs Claude Code, two CLIs by two models. +5. Qwen Code vs Claude Code, five-case matrix. + +This consolidated report covers the conclusions and headline metrics from all +five reports. It does not embed every raw sample row, terminal transcript, or +temporary runner artifact. Those raw artifacts stayed in local `tmp/` +directories because they are experiment outputs rather than stable repository +fixtures. + +The latest matrix is the strongest evidence because it covers multiple task +shapes rather than only one PR review workload. + +## Preliminary Conclusion + +The current data is strong enough to say that Qwen Code has a higher runtime +memory footprint than Claude Code in these local non-interactive CLI task +benchmarks. It is not strong enough to name one final root cause yet. + +The leading explanation is a Qwen Code runtime/path difference rather than a +model provider difference: + +- the gap reproduces with both `pai/glm-5` and `qwen3.6-plus`; +- the gap reproduces in small PR and code-navigation tasks, not only in large + diff tasks; +- Qwen Code repeatedly sends or accounts for more tokens than Claude Code for + similar work; +- Qwen Code's largest observed component is the child Node/Qwen worker process, + which points toward task-time process footprint, module loading, context + assembly, live history, tool-result retention, or subagent/saved-output + paths. + +The most useful next measurement is therefore not another external RSS-only +run. The next measurement should split RSS into V8 heap, native memory, +session/history size, retained tool-result size, and subagent/process-tree +activity. + +## Initial Cause Analysis + +The benchmark does not yet prove one root cause, but it does narrow the likely +problem area. + +| Signal | What it suggests | What it does not prove | +| -------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------- | +| Qwen remains near `1 GiB` in small PR and code-navigation cases | A high non-interactive task-time runtime cost is likely involved | It does not identify whether the footprint is V8 heap, native memory, module loading, or retained state | +| Diff size from 100 KiB to 5 MiB does not scale linearly with RSS | Raw diff bytes alone are probably not the primary driver | Large outputs can still amplify memory in real PR review flows | +| Qwen uses more tokens than Claude in every matrix cell | Qwen likely constructs or retains larger prompt/context/tool-result state for similar work | Token count is not the same as process memory and may be an effect rather than the cause | +| Tool call counts are similar, and Claude sometimes uses more turns/tool calls with lower RSS | A longer tool-call chain is unlikely to be the main explanation by itself | Tool output size and retention still need to be measured | +| Earlier large PR runs showed saved-output recovery and subagent amplification | Tool-output truncation and saved-output paths are likely heavy-workload amplifiers | They do not explain the entire small-task execution footprint | + +The current best explanation is therefore: + +1. **Task-time runtime cost first**: Qwen Code likely initializes or retains + more runtime state during non-interactive CLI task execution than Claude + Code. This may include agent runtime, tool registry, provider adapters, + session services, or UI/history structures that are not strictly needed for + a short non-interactive task. +2. **Context/tool-result volume second**: Qwen Code appears to carry larger + model-facing or session-facing context for similar work. The token gap makes + context assembly, tool result normalization, and history retention important + suspects. +3. **Large-output amplification third**: Large PR review can trigger additional + saved-output and subagent paths. These are probably not the only cause, but + they can make memory and token pressure worse in realistic review tasks. + +The next diagnostic run should answer where the `~1 GiB` sits: + +- high immediately after startup: module/runtime startup cost; +- jumps after tool execution: tool-output retention or result normalization; +- jumps during request assembly: context construction or duplicated histories; +- grows after streaming/compression: response retention or compression state; +- mostly RSS outside V8 heap: native buffers, loaded modules, or external + memory. + +## Latest Matrix + +The latest benchmark ran: + +- 2 CLIs: Qwen Code and Claude Code. +- 2 model labels: `pai/glm-5` and `qwen3.6-plus`. +- 5 cases: + - small PR review: PR `#4268`, one-line change + - code navigation: `rg` plus `sed` on compression-related files + - synthetic local diff, about 100 KiB + - synthetic local diff, about 1 MiB + - synthetic local diff, about 5 MiB + +All 20 runs exited `0` with no timeout. + +## Matrix Results + +| Case | Model | Qwen tree peak | Claude tree peak | Qwen / Claude | +| ---------------- | -------------- | -------------: | ---------------: | ------------: | +| small PR `#4268` | `pai/glm-5` | 1032.7 MiB | 357.8 MiB | 2.89x | +| small PR `#4268` | `qwen3.6-plus` | 852.2 MiB | 365.5 MiB | 2.33x | +| code navigation | `pai/glm-5` | 993.1 MiB | 359.6 MiB | 2.76x | +| code navigation | `qwen3.6-plus` | 996.9 MiB | 349.0 MiB | 2.86x | +| diff 100 KiB | `pai/glm-5` | 1012.1 MiB | 350.8 MiB | 2.89x | +| diff 100 KiB | `qwen3.6-plus` | 1001.1 MiB | 336.2 MiB | 2.98x | +| diff 1 MiB | `pai/glm-5` | 1008.3 MiB | 278.8 MiB | 3.62x | +| diff 1 MiB | `qwen3.6-plus` | 1003.3 MiB | 340.5 MiB | 2.95x | +| diff 5 MiB | `pai/glm-5` | 858.8 MiB | 323.2 MiB | 2.66x | +| diff 5 MiB | `qwen3.6-plus` | 1062.0 MiB | 331.2 MiB | 3.21x | + +Average process-tree RSS peak by case: + +| Case | Avg Qwen tree peak | Avg Claude tree peak | +| ---------------- | -----------------: | -------------------: | +| small PR `#4268` | 942.5 MiB | 361.6 MiB | +| code navigation | 995.0 MiB | 354.3 MiB | +| diff 100 KiB | 1006.6 MiB | 343.5 MiB | +| diff 1 MiB | 1005.8 MiB | 309.6 MiB | +| diff 5 MiB | 960.4 MiB | 327.2 MiB | + +## Runtime And Token Signals + +The same matrix also showed Qwen Code using more model-side tokens in every +tested case. + +Selected examples: + +| Case | Model | CLI | Duration | Turns | Total tokens | Tool calls | +| --------------- | -------------- | ------ | -------: | ----: | -----------: | ---------: | +| small PR | `pai/glm-5` | Qwen | 25.2s | 2 | 32,567 | 3 | +| small PR | `pai/glm-5` | Claude | 21.1s | 4 | 7,899 | 3 | +| code navigation | `qwen3.6-plus` | Qwen | 25.2s | 2 | 38,151 | 3 | +| code navigation | `qwen3.6-plus` | Claude | 46.9s | 6 | 25,861 | 5 | +| diff 100 KiB | `qwen3.6-plus` | Qwen | 16.5s | 3 | 57,185 | 2 | +| diff 100 KiB | `qwen3.6-plus` | Claude | 17.2s | 3 | 6,377 | 2 | +| diff 5 MiB | `pai/glm-5` | Qwen | 23.2s | 2 | 38,574 | 2 | +| diff 5 MiB | `pai/glm-5` | Claude | 9.8s | 3 | 5,285 | 2 | + +This token gap does not prove that token volume is the memory root cause, but it +does suggest that context assembly, tool result retention, or response +normalization should be measured alongside RSS and V8 heap statistics. + +## Token Usage Analysis + +The token gap is one of the strongest clues, but it needs internal request +metrics before it can be treated as a root cause. + +What the data supports today: + +- Qwen Code used more total tokens than Claude Code in every matrix cell. +- The gap appears even when tool-call counts are similar. +- Claude sometimes used more turns or tool calls while still using less memory. + +What this suggests: + +- The token delta is unlikely to come only from a longer tool-call chain. +- Qwen may be carrying larger static prompt/context state, larger tool schemas, + larger serialized tool results, or more retained conversation/session content. +- Large-output flows may add another layer through truncation, saved-output + recovery, or subagent paths. + +What is still missing: + +- per-request input token breakdown; +- system prompt and tool schema token sizes; +- retained message and tool-result sizes before each model request; +- whether large outputs are retained in multiple places, such as model history, + UI history, session recording, or saved-output storage. + +Those missing metrics are why the next step should add internal diagnostics +rather than only repeat the external RSS benchmark. + +## Earlier Large PR Review Signal + +An earlier strict PR review benchmark used PR `#4186` and showed the same broad +shape: + +| Model | CLI | Process-tree RSS peak | +| -------------- | ----------- | --------------------: | +| `pai/glm-5` | Qwen Code | 1000.7 MiB | +| `pai/glm-5` | Claude Code | 349.0 MiB | +| `qwen3.6-plus` | Qwen Code | 1095.8 MiB | +| `qwen3.6-plus` | Claude Code | 341.1 MiB | + +That earlier run was not enough by itself because a large PR can trigger unusual +tool-output and saved-output paths. The latest five-case matrix makes the +finding stronger because small PR and code-navigation tasks also reproduce the +gap. + +## Working Hypothesis + +The current evidence supports these hypotheses, in priority order: + +1. Qwen Code has a higher non-interactive task-time process footprint than + Claude Code. The Qwen child Node worker was typically the largest process in + local sampling, often around `0.7-0.8 GiB`. +2. Model choice is not the main explanation. Both `pai/glm-5` and + `qwen3.6-plus` showed the same broad Qwen-vs-Claude gap. +3. Large diff size alone is not the main explanation. The synthetic diff size + did not scale linearly from 100 KiB to 5 MiB, likely because tool-output + truncation caps how much output reaches the model. +4. Context/tool-result handling is still a likely contributor. Qwen Code used + more tokens than Claude Code in every matrix cell, and earlier large-PR runs + showed saved tool-output recovery and subagent amplification paths. +5. The next diagnostic layer should separate V8 heap, native RSS, loaded + module/runtime startup cost, session history, UI history, tool-result + retention, and subagent activity. External RSS alone cannot distinguish + those causes. + +## Caveats + +- These are single runs per matrix cell, not repeated statistical samples. +- RSS is external process RSS. It cannot distinguish V8 heap, native buffers, + module loading, retained tool output, UI state, or session history. +- Claude Code and Qwen Code use different runtime implementations and protocol + adapters, even when the model labels are the same. +- The benchmark was run locally on macOS. Linux servers should be tested before + drawing deployment-specific conclusions. + +## Recommended Follow-Up Measurements + +The next local investigation branch should add or use diagnostics for: + +- `process.memoryUsage()` before and after startup, tool execution, streaming, + compression, and session finalization. +- V8 heap statistics and heap spaces. +- Active handles and requests. +- Session message count and approximate retained character/token volume. +- Tool result count, total retained tool-result size, largest tool-result size, + and whether large outputs are retained by UI history or model history. +- Subagent count and child process/process-tree RSS. +- Tool-output truncation and saved-output recovery events. + +These measurements should be collected with the same benchmark matrix so the +current RSS comparison can be connected to internal Qwen Code state. diff --git a/docs/e2e-tests/2026-05-19-oom-reproduction-report.md b/docs/e2e-tests/2026-05-19-oom-reproduction-report.md new file mode 100644 index 0000000000..8716e208f5 --- /dev/null +++ b/docs/e2e-tests/2026-05-19-oom-reproduction-report.md @@ -0,0 +1,437 @@ +# OOM 压力测试与长任务 Replay 报告 + +**日期**: 2026-05-19 +**分支**: `codex/memory-diagnostics-local-run` +**测试人**: yiliang114 +**结论**: 成功复现并定位根因。v0.15.7 (#3735) 引入的 auto-compaction 使 `structuredClone` +调用频率倍增,在高 heap 压力时形成正反馈死循环导致 OOM。真实 debug 日志完整佐证了该机制。 + +--- + +## 一、背景 + +多个 issue(#4309, #4276, #4185, #4315, #4322, #2868)报告 qwen-code 在长会话中出现 V8 heap OOM crash: + +``` +FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory +``` + +用户报告的崩溃特征: +| Issue | 崩溃时 Heap | 运行时长 | 平台 | +|-------|------------|---------|------| +| #4276 | 4014 MB | ~110 分钟 | Linux x64 | +| #4315 | 2027 MB | ~19.6 小时 | macOS (默认 2GB limit) | +| #4322 | 4023 MB | ~7 小时 | Windows | +| #2868 | 2035 MB | ~1.7 分钟 | Linux | +| #4309 | 7020 MB | 未知 | Windows (设了 8GB limit 仍崩) | + +--- + +## 二、方法论修正 + +本报告区分两类测试: + +1. **低 heap 压力测试**:通过降低 `--max-old-space-size` 放大问题,用于快速定位 + “history 很大时整段复制导致瞬时峰值”的代码路径。它是诊断工具,不等价于用户真实 + 4G/8G OOM 复现。 +2. **默认 heap 长任务 replay**:不设置 `NODE_OPTIONS`,使用真实 JSONL 历史恢复并 + 继续执行 review 任务,同时从进程外采样 process-tree RSS。这类结果才用于判断 + 用户侧实际内存量级。 + +因此,低 heap 结果不能单独作为“真实 OOM 已修复”的证明。它只能说明某条路径在 +history 足够大时会产生峰值放大,需要再用默认 heap 长任务验证。 + +## 三、低 heap 压力测试条件 + +| 参数 | 值 | +| ------------------------ | ------------------------------------------------------------ | +| CLI 版本 | 0.15.11 (从 `codex/memory-diagnostics-local-run` 分支 build) | +| Model | `qwen3.6-plus` (128K context window) | +| Heap limit | `--max-old-space-size=512` | +| Heap-pressure safety net | **禁用** (HEAP_PRESSURE_COMPRESSION_RATIO 设为 99.0) | +| 操作模式 | YOLO + 自动化多轮 Read 文件任务 | +| 工作目录 | qwen-code monorepo (3538 .ts files, 1.26M lines) | + +### 关键配置修改 + +`packages/core/src/core/geminiChat.ts` 中将 heap-pressure compaction 阈值从 0.7 改为 99.0(使其永远不触发),模拟 #4186 修复前的状态。 + +--- + +## 四、低 heap 压力测试结果 + +### 崩溃时间线 + +``` +[21:26:59] #1 RSS:193.6MB Ctx:0% → Read geminiChat.ts (1500 行) +[21:27:46] #2 RSS:270.4MB Ctx:4.2% → Read agent.ts +[21:28:32] #3 RSS:397.5MB Ctx:4.3% → grep + Read 3 个文件 +[21:29:18] #4 RSS:452.7MB Ctx:5.7% → Read slashCommandProcessor.ts +[21:30:04] #5 RSS:515.0MB Ctx:5.9% → Read chatCompressionService.ts +[21:30:50] #6 RSS:649.1MB Ctx:4.0% ← TOKEN COMPACTION 触发 (5.9%→4.0%) + RSS 反增 134MB (structuredClone 峰值) +[21:31:36] #7 RSS:666.7MB Ctx:3.2% ← 再次 compaction, RSS 继续涨 +[21:32:22] CRASH — FATAL ERROR: Ineffective mark-compacts near heap limit +``` + +**总耗时**: ~5.5 分钟,7 轮任务后崩溃。 + +这证明在受限 heap 下,长 history + compaction/history clone 可以触发 V8 heap OOM。 +但该结果不代表默认 heap 下的真实用户 OOM 已经被完整复现。 + +### 更大 heap 的 synthetic 复现 + +为避免只依赖 512 MiB 低 heap 结论,补充了更大 heap 的 synthetic runtime +pressure 测试。该测试不调用模型,而是构造类似长 review/subagent 任务的历史: + +- root review turns: 10 +- subagent calls: 30 +- subagent transcript records: 780 +- retained tool result bytes: 193,986,560 +- serialized history bytes: 195,620,061 +- pressure mode: retained `structuredClone(history)` copies + +| Heap limit | Clone pressure | 结果 | 关键 GC / stack | +| ---------- | -----------------: | ---------------------------------------- | ------------------------------------------------------------ | +| 2 GiB | 8 retained clones | 未崩溃,RSS 2.42 GiB,heap used 1.87 GiB | 接近 heap limit | +| 2 GiB | 10 retained clones | OOM | `Reached heap limit`, `ValueDeserializer`, `StructuredClone` | +| 4 GiB | 20 retained clones | OOM | `Reached heap limit`, `ValueDeserializer`, `StructuredClone` | + +2 GiB 复现的 GC 摘要: + +``` +Mark-Compact 2042.9 (2081.9) -> 2042.9 (2081.1) MB +Mark-Compact 2048.9 (2087.2) -> 2048.9 (2087.2) MB +FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory +... +node::worker::(anonymous namespace)::StructuredClone +``` + +4 GiB 复现的 GC 摘要: + +``` +Mark-Compact 4082.5 (4126.8) -> 4082.5 (4126.3) MB +Mark-Compact 4095.1 (4139.0) -> 4095.1 (4139.0) MB +FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory +... +node::worker::(anonymous namespace)::StructuredClone +``` + +这组结果比 512 MiB 压力测试更接近用户报告的 2 GiB / 4 GiB heap OOM: +只要 history 中保留足够多的大 tool result / subagent transcript,对整段 history +做 retained 或瞬时 clone 都可以在 2-4 GiB heap 下触发 V8 OOM。它仍然是 synthetic +复现,不等价于完整业务长任务 replay,但能直接证明问题不是“小 heap 人为制造”的。 + +### 崩溃时 GC 状态 + +``` +[41381:0x130008000] 342468 ms: Mark-Compact 508.6 (526.7) -> 507.0 (526.9) MB, + pooled: 1 MB, 86.42 / 0.00 ms (average mu = 0.175, current mu = 0.150) + task; scavenge might not succeed + +[41381:0x130008000] 342568 ms: Mark-Compact 509.1 (526.9) -> 507.1 (528.2) MB, + pooled: 0 MB, 93.79 / 0.12 ms (average mu = 0.121, current mu = 0.068) + allocation failure; scavenge might not succeed + +FATAL ERROR: Ineffective mark-compacts near heap limit +Allocation failed - JavaScript heap out of memory +``` + +Mark-Compact 只能回收 1-2 MB(几乎所有对象都是 reachable),证明内存确实被合法持有的对象占满。 + +--- + +## 五、默认 heap 长任务 replay + +为了避免低 heap 结论过度外推,补充了默认 heap 的真实 JSONL replay: + +- 不设置 `NODE_OPTIONS` +- 不启用内部 runtime profiler,避免采样器自身影响 heap +- 每个 CLI 从同一份 rewound JSONL 复制出 fresh session +- 使用临时 `QWEN_HOME`,禁用 MCP 和 hooks,避免本地全局配置污染 +- 只用进程外采样统计 process-tree RSS + +| CLI | 结果 | 时长 | Tree RSS 峰值 | Root RSS 峰值 | Worker RSS 峰值 | 备注 | +| -------------------- | ---- | -----: | ------------: | ------------: | --------------: | ----------------------------------------------------------- | +| installed `qwen` | 成功 | 167.3s | 838.0 MiB | 230.2 MiB | 566.3 MiB | 第一次 fresh run 遇到模型服务端错误,未纳入结论;retry 成功 | +| local rebuilt bundle | 成功 | 106.3s | 527.5 MiB | 182.1 MiB | 345.4 MiB | 包含本地 clone 热路径修复 | + +默认 heap replay 的结论: + +1. 当前这份 review JSONL 可以稳定跑出数百 MiB 到约 0.8 GiB 的 process-tree RSS, + 但没有复现 4G/8G OOM。 +2. 本地 rebuilt bundle 在同起点 replay 上的峰值低于 installed CLI,说明减少 + history clone 热路径有实际收益。 +3. 这还不能证明所有用户 OOM 都已解决。真实 4G/8G OOM 仍需要更长任务、更大 + tool-result 累积,或保留 MCP/tool schema 压力的 replay 继续验证。 + +## 六、根因分析 + +### OOM 的三层机制 + +``` +┌─────────────────────────────────────────────────────────┐ +│ Layer 3: V8 Heap Limit (512MB/2GB/4GB) │ ← 用户最终撞到这里 +├─────────────────────────────────────────────────────────┤ +│ Layer 2: structuredClone() 峰值放大 (瞬时 ~2x) │ ← 直接诱因 +├─────────────────────────────────────────────────────────┤ +│ Layer 1: History 中 tool result 累积 (线性增长) │ ← 基础增长 +├─────────────────────────────────────────────────────────┤ +│ Layer 0: Token compaction 触发时机 │ ← 控制点 +└─────────────────────────────────────────────────────────┘ +``` + +### 精确崩溃路径 + +``` +sendMessage() + → tryCompress() + → heapPressureRatio < threshold (safety net disabled) + → ChatCompressionService.compress() + → chat.getHistory(true) + → structuredClone(this._history) ← 峰值分配! + → V8 需要额外 ~N MB 来容纳 clone + → 如果 existing heap + N > limit → OOM +``` + +### 关键证据 + +| 观察 | 含义 | +| --------------------------------------- | ---------------------------------------------- | +| Task #5→#6: Context 5.9%→4.0% (降了) | Token compaction **成功执行**了 | +| Task #5→#6: RSS 515→649 MB (涨了 134MB) | Compaction 过程的 `structuredClone` 制造了峰值 | +| GC 只能回收 1-2 MB | 所有对象都是 live(history + clone 都在) | +| #4309 设 8GB limit 仍崩 | history 足够大时,clone 峰值可超任何 limit | + +需要注意:以上证据来自低 heap 压力测试和 issue 现象的组合推断。默认 heap replay +目前支持”clone 热路径会显著影响峰值 RSS”,但尚未单独复现 4G/8G OOM。 + +### 为什么 128K context window 更容易触发 + +- 128K × 70% = ~90K tokens 触发 compaction +- 大 context window (1M) 的 70% = 700K tokens,几乎不会触发 +- **compaction 越频繁 → structuredClone 越频繁 → OOM 风险越高** +- DeepSeek 等未配置 contextWindowSize 的模型默认 128K,更易触发 + +--- + +## 六.5、真实运行日志佐证 + +以下日志提取自本地 crash session 的 debug 输出。为避免泄露本地路径和 session id, +报告只保留时间线和关键日志内容。 + +该 session 启动于 `2026-05-19T13:26:35Z` (本地 21:26:35),crash 于 +`2026-05-19T13:32:10Z` (本地 21:32:10)。 + +### Heap Pressure 与 Auto-Compaction 事件时间线 + +``` +13:29:43 [WARN] Heap pressure at 74.9%; attempting auto-compaction before token threshold. +13:30:06 [DEBUG] [FILE_READ_CACHE] clear after auto tryCompress ← compaction #1 执行成功 +13:30:13 [WARN] Heap pressure at 70.7%; attempting auto-compaction before token threshold. + ← 刚压完 heap 从 74.9% 仅降到 70.7%,仍超阈值,立即再次尝试 +13:30:52 [DEBUG] Heap pressure at 86.0%; skipping heap-pressure auto-compaction during cooldown. + ← 30s cooldown 期间拒绝执行 +13:30:56 [WARN] Heap pressure at 85.3%; attempting auto-compaction before token threshold. + ← cooldown 过期,heap 已升至 85.3% +13:31:21 [DEBUG] [FILE_READ_CACHE] clear after auto tryCompress ← compaction #2 执行成功 +13:31:37 [WARN] Heap pressure at 88.8%; attempting auto-compaction before token threshold. + ← 压完后 heap 反弹至 88.8% +13:32:09 [DEBUG] Heap pressure at 90.2%; skipping heap-pressure auto-compaction during cooldown. + ← heap 已达 90.2%,cooldown 中无法执行 +13:32:10 ← 日志终止(进程 OOM crash) +``` + +### 日志证据解读 + +| 日志观察 | 含义 | +| ------------------------------------------------------------------------------------- | --------------------------------------------------------- | +| 2.5 分钟内触发 **4 次** heap-pressure auto-compaction 尝试(另有 2 次 cooldown 拒绝) | #3735 引入的 `tryCompress` 在高压时频繁触发 | +| 每次 compaction 执行后 heap 占比仍 >70% | `structuredClone()` 制造的临时峰值抵消了压缩收益 | +| 74.9% → 70.7% → 86% → 85.3% → 88.8% → 90.2% → crash | 正反馈循环:压缩→clone 峰值→heap 更高→再压缩→更高 | +| 日志在 90.2% 后 1 秒内断裂 | 下一次 `getHistory(true)` 的 `structuredClone()` 瞬间超限 | +| `[FILE_READ_CACHE] clear after auto tryCompress` 出现 2 次 | 证实 compaction 确实走了完整的 compress → setHistory 路径 | + +### 正反馈死循环机制 + +``` +heap 占比高 (>70%) + → 触发 heap-pressure auto-compaction + → tryCompress() 内部调用 getHistory(true) + → structuredClone(this._history) ← 瞬时 heap 峰值 +30~40% + → compaction 成功,释放旧 history + → 但 clone 峰值已经把 heap 推高到更危险的水位 + → 下一轮 send 继续累积 + → heap 占比更高 → 更频繁触发 → crash +``` + +--- + +## 六.6、版本归因:为什么 0.15.7 ~ 0.15.11 期间 OOM 报告增多 + +### 关键 commit 时间线 + +| 版本 | PR | 改动 | 对 `structuredClone` 调用频率的影响 | +| ------------ | ---------------------------------------------------- | ----------------------------------------------------------------------------------- | ----------------------------------- | +| **v0.15.6** | — | `getHistory(true)` 仅在 `sendMessage` 入口调用 1 次 | 基线:每次 send 1 次 clone | +| **v0.15.7** | **#3735** `auto-compact subagent context` | 将 `tryCompress()` 下沉到 `GeminiChat`,**每次 send 前**先执行一次 compaction 检查 | **+1 次**:send 前 compress 检查 | +| **v0.15.10** | **#3879** `reactive compression on context overflow` | 当 provider 返回 context overflow 时,再次触发 `tryCompress()` + `getHistory(true)` | **+1~2 次**:overflow retry 路径 | +| **v0.15.10** | **#3985** `harden reactive compression` | 强化 reactive compression 重试逻辑 | 同上 | + +### v0.15.6 vs v0.15.11 的 `getHistory(true)` 调用点对比 + +**v0.15.6** (2 处): + +``` +L367: const requestContents = this.getHistory(true); ← send 构造 request +L618: const recoveryContents = self.getHistory(true); ← MAX_TOKENS escalation (极少触发) +``` + +**v0.15.11** (5 处): + +``` +L467: ChatCompressionService.compress() 内部调用 ← #3735: 每次 send 前的 auto-compact +L574: requestContents = this.getHistory(true); ← send 构造 request +L724: reactive tryCompress() 内部调用 ← #3879: context overflow 后 retry +L739: requestContents = self.getHistory(true); ← #3879: retry 构造新 request +L943: const recoveryContents = self.getHistory(true); ← MAX_TOKENS escalation +``` + +### 最坏路径:一次 send 可触发 4 次 `structuredClone` + +``` +sendMessage() + → tryCompress() ← #3735: getHistory(true) [clone #1] + → getHistory(true) ← 构造 request [clone #2] + → API 返回 context overflow + → reactive tryCompress() ← #3879: getHistory(true) [clone #3] + → getHistory(true) ← retry request [clone #4] +``` + +### 结论 + +**#3735 (v0.15.7)** 是 OOM 频率显著上升的最可能触发因素(非唯一根因)——它使每次 +`sendMessage` 都会先跑一次 `tryCompress()`,而 `tryCompress` 内部通过 +`ChatCompressionService.compress()` → `chat.getHistory(true)` 做全量 `structuredClone`。 +在 history 较大时,这个 “先 clone 再判断是否需要压缩” 的设计让内存峰值从 ~1.3x 升至 ~2x+。 +注:issue history 显示 OOM 报告在 #3735 之前就已存在,但 #3735 大幅增加了 structuredClone +的调用频率,从而显著提高了 OOM 的触发概率。 + +**#3879 (v0.15.10)** 进一步恶化了问题——在已经处于 heap 边界时 (provider 返回 context overflow) +再触发一次全量 clone,使原本就危险的 session 更容易 crash。 + +--- + +## 七、#4186 修复效果验证(对比测试) + +启用 heap-pressure safety net (HEAP_PRESSURE_COMPRESSION_RATIO = 0.7) 后的对比测试: + +| 指标 | 禁用 safety net | 启用 safety net | +| --------------- | ------------------ | ------------------------- | +| OOM 发生 | 是(7 轮后 crash) | 否(持续运行 >10 分钟) | +| RSS 峰值 | 666 MB → crash | 555 MB → GC 回收到 280 MB | +| Compaction 触发 | 仅 token threshold | heap 70% 时提前触发 | +| Context 行为 | 5.9%→4.0%→crash | 22.7%→17.0%(安全回落) | + +**结论**: #4186 的 heap-pressure safety net 有效防止了 OOM,但它是一个**缓解**而非根治: + +- 如果 history 本身已经占了 heap 的 60%+,即使提前 compact,clone 的峰值仍然可能超限 +- 这解释了为什么 #4309 用户设了 8GB limit 后仍然 crash + +--- + +## 八、内存占用分布 + +基于测试中的 RSS 增长模式估算: + +| 内存位置 | 占比 | 增长特征 | +| -------------------------------- | ------ | --------------------------- | +| `this._history[]` (tool results) | 40-50% | 线性累积,每轮 +30-100MB | +| `structuredClone()` 临时拷贝 | 30-40% | 瞬时峰值,compaction 时出现 | +| V8 runtime (GC metadata, code) | ~15% | 基本恒定 | +| UI/logging/stream buffers | ~5% | 缓慢增长 | + +--- + +## 九、复现脚本与环境 + +### 自动化驱动脚本 + +```bash +#!/bin/bash +# /tmp/oom-simple-driver.sh +SESSION="$1" + +TASKS=( + "用 Read 工具完整读取 packages/core/src/core/geminiChat.ts" + "用 Read 工具完整读取 packages/core/src/tools/agent/agent.ts" + "用 grep -rn structuredClone packages/core/src 然后 Read 前 3 个文件" + "用 Read 完整读取 packages/cli/src/ui/hooks/slashCommandProcessor.ts" + "用 Read 完整读取 packages/core/src/services/chatCompressionService.ts" + "用 find packages/cli/src/ui/commands -name '*.ts' 然后逐一 Read" + "用 Read 完整读取 packages/core/src/core/turn.ts" + # ... 更多任务 +) + +i=0 +while true; do + TASK="${TASKS[$((i % ${#TASKS[@]}))]}" + i=$((i + 1)) + + QWEN_PID=$(ps aux | grep "dist/index.js" | grep -v grep | awk '{print $2}' | sort -rn | head -1) + RSS=$(ps -o rss= -p $QWEN_PID 2>/dev/null) + [ -z "$RSS" ] && { echo "CRASH after $((i-1)) tasks!"; exit 0; } + + RSS_MB=$(echo "scale=1; $RSS/1024" | bc) + CTX=$(tmux capture-pane -t "$SESSION:1" -p 2>/dev/null | grep -oE "[0-9]+\.[0-9]+% 已用" | tail -1) + echo "[$(date +%H:%M:%S)] #$i RSS:${RSS_MB}MB Ctx:$CTX | ${TASK:0:55}" + + tmux send-keys -t "$SESSION:1" C-u + sleep 0.2 + tmux send-keys -t "$SESSION:1" "$TASK" Enter + sleep 0.5 + tmux send-keys -t "$SESSION:1" Enter + sleep 45 +done +``` + +### 启动命令 + +```bash +# 1. 禁用 heap-pressure safety net +# geminiChat.ts: HEAP_PRESSURE_COMPRESSION_RATIO = 99.0 + +# 2. Build +npm run build --workspace=packages/core && npm run build --workspace=packages/cli + +# 3. 启动 qwen (128K context model, 512MB heap) +SESSION="oom-test" +tmux new-session -d -s "$SESSION" -c "$REPO_DIR" +tmux send-keys -t "$SESSION" \ + "NODE_OPTIONS='--max-old-space-size=512' node packages/cli/dist/index.js --model 'qwen3.6-plus'" Enter + +# 4. 等待启动后运行驱动 +sleep 10 +bash /tmp/oom-simple-driver.sh "$SESSION" +``` + +--- + +## 十、后续建议 + +### 短期缓解(已有) + +- [x] #4186: heap-pressure auto-compaction safety net (0.7 threshold) +- [x] #4188: fileReadCache / crawlCache 上限 + +### 中期修复(建议) + +- [ ] 减少 `structuredClone()` 调用 — `nextSpeakerChecker` 只需最后一条消息,不需 clone 全量 +- [ ] Compaction 使用 slice + 引用替代全量 deep clone +- [ ] 大 tool result (>100KB) 写入临时文件,history 中只保留摘要引用 + +### 长期方向 + +- [ ] Tool result offload 到磁盘 + lazy load (#4184) +- [ ] 基于 RSS 的分级压缩策略(不仅是 token count) +- [ ] History 分段存储,避免单次全量操作 diff --git a/docs/e2e-tests/2026-05-19-qwen-runtime-diagnostics-benchmark-report.md b/docs/e2e-tests/2026-05-19-qwen-runtime-diagnostics-benchmark-report.md new file mode 100644 index 0000000000..e482f0f94c --- /dev/null +++ b/docs/e2e-tests/2026-05-19-qwen-runtime-diagnostics-benchmark-report.md @@ -0,0 +1,904 @@ +# Qwen Code Runtime Diagnostics Benchmark Report + +Date: 2026-05-19 + +## Scope + +This run repeats the previous Qwen Code benchmark shapes with the new opt-in +runtime diagnostics enabled. It only tests Qwen Code, not Claude Code. + +Initial model matrix: + +- `pai/glm-5` +- `qwen3.6-plus` + +Additional PR-size follow-up: + +- `DeepSeek/deepseek-v4-pro` through Anthropic-compatible protocol + +Cases: + +- small GitHub PR review: PR `#4268` +- code navigation: compression / compaction related code search and reads +- synthetic local diff: about 94.6 KiB +- synthetic local diff: about 968.5 KiB +- synthetic local diff: about 4.84 MiB + +The run used the local bundled CLI from the diagnostics branch, with +`QWEN_CODE_PROFILE_RUNTIME=1` and a temporary CLI home. Global MCP servers and +hooks were not loaded for this benchmark. + +Important caveat: these absolute RSS numbers are lower than the previous +PATH-resolved `qwen` runs because this run used `node dist/cli.js` from the +local branch plus a stripped temporary config. Treat this report as an internal +diagnostics distribution run, not a direct replacement for the earlier installed +CLI RSS comparison. + +## Installed CLI vs Local Bundle Sanity Check + +A follow-up sanity check used the same minimal prompt, model, and non-interactive +mode across the installed CLI and the local diagnostics bundle. The only +intentional variable was whether Qwen Code loaded a stripped temporary CLI home +or the normal user config. + +| CLI | Config mode | Total tokens | Tree RSS peak | Root RSS peak | Process count peak | Runtime diagnostics | +| ------------------- | --------------- | -----------: | ------------: | ------------: | -----------------: | ------------------- | +| PATH `qwen` | stripped config | 33,965 | 542.4 MiB | 249.9 MiB | 3 | no | +| local `dist/cli.js` | stripped config | 47,281 | 455.2 MiB | 214.2 MiB | 4 | yes | +| PATH `qwen` | normal config | 97,615 | 1,099.9 MiB | 250.1 MiB | 6 | no | +| local `dist/cli.js` | normal config | 97,954 | 1,105.4 MiB | 212.7 MiB | 8 | yes | + +This check changes the attribution: the earlier 1 GiB user-visible peak is +reproducible with the normal config even on the local diagnostics bundle. It is +therefore not primarily explained by the local branch including PR `#4186`. + +At the normal-config peak, the local process-tree sample was dominated by +multiple Node/MCP processes rather than the Qwen root process alone: + +| Role | Command shape | RSS at tree peak | +| ----- | ------------------------- | ---------------: | +| child | Node process | 252.9 MiB | +| child | Chrome DevTools MCP | 219.7 MiB | +| child | Node process | 219.2 MiB | +| root | Qwen Node process | 215.1 MiB | +| child | Chrome DevTools MCP setup | 175.2 MiB | + +PR `#4186` is present in the local diagnostics branch, but it is a V8 heap +pressure auto-compaction safety net. It triggers at about 70% V8 heap pressure; +on this environment the Node heap limit is about 4.1 GiB, while the stripped +benchmark end heap was about 99-143 MiB. Based on these numbers, the lower +stripped-config RSS is not caused by `#4186` actively compressing context during +these benchmark runs. + +### Bare Mode Config Attribution Check + +A second follow-up used `qwen3.6-plus` with the same PR-review prompt shape on +both the installed CLI and the local bundle. This is not a normal end-to-end +business benchmark. It is a controlled attribution check for startup/config +memory only. + +`--bare` changes the runtime inputs: it skips normal global settings discovery, +MCP startup, hooks, implicit context, skills, and other startup integrations. It +can therefore fail or behave differently when a model provider is configured +only in global settings. For this run, model credentials were supplied only +through the child-process environment because bare mode intentionally does not +load the normal provider settings. Nothing was written back to the user's global +config. + +This run did not produce useful token/tool-call statistics: the model completed +in one turn and did not call the requested shell command. Do not use these rows +as normal task benchmark results, and do not compare their token/tool-call +behavior with the matrix above. They are only useful for estimating how much +process-tree RSS comes from normal config and configured child processes. + +| CLI | Mode | Wall | Turns | Tool uses | Tree RSS peak | Root RSS peak | Process count peak | +| ------------------- | -------- | ---: | ----: | --------: | ------------: | ------------: | -----------------: | +| PATH `qwen` | normal | 5.5s | 1 | 0 | 1,021.3 MiB | 251.5 MiB | 5 | +| PATH `qwen` | `--bare` | 2.4s | 1 | 0 | 525.7 MiB | 246.4 MiB | 2 | +| local `dist/cli.js` | normal | 4.9s | 1 | 0 | 1,046.2 MiB | 213.3 MiB | 5 | +| local `dist/cli.js` | `--bare` | 2.3s | 1 | 0 | 454.3 MiB | 216.5 MiB | 3 | + +The result confirms the process-tree hypothesis for startup/config attribution. +On this machine, normal config adds roughly 0.50-0.59 GiB of user-visible +process-tree RSS over `--bare`, while root RSS stays in the same 0.21-0.25 GiB +band. At the normal-config peak, the extra RSS again came from additional +Node/MCP child processes, including a Chrome DevTools MCP process and its setup +wrapper. `--bare` removes those startup/config children and brings +installed/local runs back into the 0.45-0.53 GiB tree-RSS range. + +### Temporary Settings MCP / Hooks Isolation + +Because `--bare` changes too many runtime inputs to be treated as a normal +benchmark, a follow-up used temporary `QWEN_HOME` directories with generated +settings files derived from the normal settings. The run stayed on the normal +settings-loading path, but toggled only two config dimensions: + +- MCP disabled: `mcpServers` cleared and MCP allow/exclude lists emptied. +- Hooks disabled: `disableAllHooks` set to true. + +No global settings were modified. The case used `qwen3.6-plus` and a minimal +startup prompt, so it measures startup/config process-tree cost, not task +reasoning quality. + +| CLI | Temporary config | MCP servers | Tools | Tree RSS peak | Root RSS peak | Process count peak | +| ------------------- | -------------------- | ----------: | ----: | ------------: | ------------: | -----------------: | +| PATH `qwen` | full | 4 | 46 | 1,017.4 MiB | 249.8 MiB | 5 | +| PATH `qwen` | MCP disabled | 0 | 17 | 548.7 MiB | 252.4 MiB | 2 | +| PATH `qwen` | hooks disabled | 4 | 46 | 1,003.8 MiB | 246.4 MiB | 5 | +| PATH `qwen` | MCP + hooks disabled | 0 | 17 | 542.5 MiB | 248.0 MiB | 2 | +| local `dist/cli.js` | full | 4 | 48 | 865.9 MiB | 220.4 MiB | 6 | +| local `dist/cli.js` | MCP disabled | 0 | 19 | 442.9 MiB | 209.6 MiB | 2 | +| local `dist/cli.js` | hooks disabled | 4 | 48 | 848.3 MiB | 212.6 MiB | 5 | +| local `dist/cli.js` | MCP + hooks disabled | 0 | 19 | 447.2 MiB | 217.8 MiB | 2 | + +Interpretation: + +1. Disabling MCP is the dominant change. It removes 4 MCP servers, reduces the + advertised tool count by about 29 tools, and lowers process-tree RSS by about + 0.42-0.47 GiB in this startup/config case. +2. Disabling hooks alone barely changes RSS in this case. That is expected + because the prompt did not produce tool calls, so `PreToolUse` / + `PostToolUse` hooks were not executed. +3. The root process stays around 0.21-0.25 GiB across all rows. The large + difference is again process-tree composition, not root Qwen RSS. + +Two attempted code-navigation follow-ups with `qwen3.6-plus` and `pai/glm-5` +also reproduced the same MCP-vs-no-MCP memory split, but neither model produced +tool calls in those runs. Those rows are therefore not used as hooks execution +evidence. A valid hooks benchmark still needs a task/model combination that +reliably emits tool calls. + +### Per-MCP Isolation + +The previous row showed MCP as a group is the dominant startup/config memory +factor. A follow-up isolated each configured MCP server while keeping hooks +disabled for all rows. This keeps the test on the normal settings-loading path +but changes only the MCP server subset. + +Configured MCP server names: + +- `approval-bridge` +- `env-center` +- `chrome-devtools` +- `code` + +Single-pass isolation: + +| Variant | Enabled MCPs | Tools | MCP servers | Tree RSS peak | Root RSS peak | Interpretation | +| ------------------------- | -------------------------------------------------- | ----: | ----------: | ------------: | ------------: | ------------------------------------ | +| none | none | 19 | 0 | 444.4 MiB | 211.7 MiB | baseline without MCP | +| full | all 4 | 48 | 4 | 857.3 MiB | 215.9 MiB | full MCP startup shape | +| only `approval-bridge` | `approval-bridge` | 19 | 1 | 455.5 MiB | 214.0 MiB | near baseline | +| only `env-center` | `env-center` | 19 | 1 | 452.3 MiB | 214.4 MiB | near baseline | +| only `chrome-devtools` | `chrome-devtools` | 48 | 1 | 824.4 MiB | 209.5 MiB | large RSS increase and tool increase | +| only `code` | `code` | 19 | 1 | 452.1 MiB | 216.6 MiB | near baseline | +| without `approval-bridge` | `env-center`, `chrome-devtools`, `code` | 48 | 3 | 997.1 MiB | 215.4 MiB | still high; run showed variance | +| without `env-center` | `approval-bridge`, `chrome-devtools`, `code` | 48 | 3 | 863.8 MiB | 220.9 MiB | still high | +| without `chrome-devtools` | `approval-bridge`, `env-center`, `code` | 19 | 3 | 463.4 MiB | 221.6 MiB | returns near baseline | +| without `code` | `approval-bridge`, `env-center`, `chrome-devtools` | 48 | 3 | 858.1 MiB | 219.5 MiB | still high | + +Because startup RSS has some variance, the key variants were repeated twice: + +| Variant | Samples | Tree RSS range | Avg tree RSS | Result | +| ------------------------- | ------: | ------------------- | -----------: | ------------------------------ | +| none | 2 | 443.3-451.9 MiB | 447.6 MiB | stable no-MCP baseline | +| full | 2 | 856.1-922.8 MiB | 889.5 MiB | stable high-MCP range | +| only `chrome-devtools` | 2 | 1,007.1-1,021.2 MiB | 1,014.2 MiB | enough alone to reproduce high | +| without `chrome-devtools` | 2 | 461.1-461.6 MiB | 461.4 MiB | removes the high RSS | +| only `approval-bridge` | 2 | 449.1-449.9 MiB | 449.5 MiB | near baseline | +| only `env-center` | 2 | 438.7-449.5 MiB | 444.1 MiB | near baseline | +| only `code` | 2 | 450.6-451.3 MiB | 451.0 MiB | near baseline | + +Interpretation: + +1. `chrome-devtools` is the dominant MCP contributor in this environment. It is + sufficient by itself to reproduce the high process-tree RSS. +2. Removing `chrome-devtools` from the full MCP set returns RSS to the no-MCP + band. Removing other MCPs while keeping `chrome-devtools` does not. +3. The advertised tool count follows the same pattern: baseline is 19 tools, + while `chrome-devtools` raises the tool count to 48. That means this MCP is + also likely to increase request tool schema size and token pressure, not just + process-tree RSS. +4. `approval-bridge`, `env-center`, and `code` individually stay near the + no-MCP baseline in these startup/config runs. They emitted startup warnings + in this environment, so this result should be interpreted as "no persistent + startup RSS owner observed" rather than proof that they have zero cost in all + workflows. + +## Runtime Summary + +| Case | Model | Wall | Turns | Total tokens | Tree RSS peak | Root RSS peak | End heap | End RSS | +| ---------------- | -------------- | ----: | ----: | -----------: | ------------: | ------------: | --------: | --------: | +| small PR `#4268` | `pai/glm-5` | 20.1s | 7 | 173,216 | 362.1 MiB | 359.8 MiB | 103.1 MiB | 216.5 MiB | +| code navigation | `pai/glm-5` | 18.4s | 2 | 49,127 | 378.0 MiB | 376.0 MiB | 102.4 MiB | 313.4 MiB | +| diff 94.6 KiB | `pai/glm-5` | 16.6s | 6 | 135,716 | 367.9 MiB | 366.0 MiB | 99.1 MiB | 295.0 MiB | +| diff 968.5 KiB | `pai/glm-5` | 11.4s | 2 | 42,590 | 373.2 MiB | 362.5 MiB | 106.4 MiB | 345.6 MiB | +| diff 4.84 MiB | `pai/glm-5` | 12.0s | 4 | 95,119 | 414.2 MiB | 412.0 MiB | 123.6 MiB | 410.7 MiB | +| small PR `#4268` | `qwen3.6-plus` | 35.0s | 6 | 156,556 | 358.9 MiB | 356.9 MiB | 102.6 MiB | 293.1 MiB | +| code navigation | `qwen3.6-plus` | 28.9s | 4 | 99,800 | 370.3 MiB | 368.3 MiB | 105.8 MiB | 298.2 MiB | +| diff 94.6 KiB | `qwen3.6-plus` | 28.3s | 4 | 90,808 | 358.8 MiB | 356.9 MiB | 105.9 MiB | 307.0 MiB | +| diff 968.5 KiB | `qwen3.6-plus` | 30.9s | 6 | 151,782 | 366.1 MiB | 364.1 MiB | 101.0 MiB | 316.9 MiB | +| diff 4.84 MiB | `qwen3.6-plus` | 24.1s | 4 | 93,271 | 372.8 MiB | 366.0 MiB | 142.8 MiB | 366.0 MiB | + +Average by model: + +| Model | Avg tree RSS peak | Avg root RSS peak | Avg turns | Avg total tokens | Avg max wire body | Avg total tool result | +| -------------- | ----------------: | ----------------: | --------: | ---------------: | ----------------: | --------------------: | +| `pai/glm-5` | 379.1 MiB | 375.3 MiB | 4.2 | 99,154 | 111.8 KiB | 335.1 KiB | +| `qwen3.6-plus` | 365.4 MiB | 362.4 MiB | 4.8 | 118,443 | 119.3 KiB | 344.3 KiB | + +Overlapping small PR `#4268` model snapshot: + +| Model | Protocol | Wall | Turns | Total tokens | Tree RSS peak | Root RSS peak | Max wire body | +| -------------------------- | --------- | ----: | ----: | -----------: | ------------: | ------------: | ------------: | +| `pai/glm-5` | OpenAI | 20.1s | 7 | 173,216 | 362.1 MiB | 359.8 MiB | 113.8 KiB | +| `qwen3.6-plus` | OpenAI | 35.0s | 6 | 156,556 | 358.9 MiB | 356.9 MiB | 134.1 KiB | +| `DeepSeek/deepseek-v4-pro` | Anthropic | 39.7s | 2 | 43,362 | 346.9 MiB | 344.8 MiB | 103.0 KiB | + +## Request And Tool Diagnostics + +| Case | Model | Requests | Max wire body | Max system prompt | Max tool schema | Tool calls | Total tool result | Max tool result | Max function response in request | +| ---------------- | -------------- | -------: | ------------: | ----------------: | --------------: | ---------: | ----------------: | --------------: | -------------------------------: | +| small PR `#4268` | `pai/glm-5` | 7 | 113.8 KiB | 51.4 KiB | 40.2 KiB | 9 | 4.7 KiB | 3.9 KiB | 15.3 KiB | +| code navigation | `pai/glm-5` | 2 | 114.6 KiB | 51.5 KiB | 40.2 KiB | 3 | 17.5 KiB | 6.2 KiB | 18.4 KiB | +| diff 94.6 KiB | `pai/glm-5` | 6 | 111.2 KiB | 39.1 KiB | 37.2 KiB | 9 | 94.9 KiB | 92.6 KiB | 29.2 KiB | +| diff 968.5 KiB | `pai/glm-5` | 2 | 104.8 KiB | 39.1 KiB | 37.2 KiB | 2 | 772.1 KiB | 771.9 KiB | 25.6 KiB | +| diff 4.84 MiB | `pai/glm-5` | 4 | 114.7 KiB | 39.1 KiB | 37.2 KiB | 4 | 786.3 KiB | 783.2 KiB | 34.7 KiB | +| small PR `#4268` | `qwen3.6-plus` | 6 | 134.1 KiB | 51.4 KiB | 40.2 KiB | 5 | 34.6 KiB | 15.6 KiB | 36.6 KiB | +| code navigation | `qwen3.6-plus` | 4 | 114.9 KiB | 51.5 KiB | 40.2 KiB | 3 | 17.5 KiB | 6.2 KiB | 18.4 KiB | +| diff 94.6 KiB | `qwen3.6-plus` | 4 | 112.8 KiB | 39.1 KiB | 37.2 KiB | 3 | 92.9 KiB | 92.6 KiB | 33.0 KiB | +| diff 968.5 KiB | `qwen3.6-plus` | 6 | 113.1 KiB | 39.1 KiB | 37.2 KiB | 5 | 778.0 KiB | 771.9 KiB | 32.1 KiB | +| diff 4.84 MiB | `qwen3.6-plus` | 4 | 121.5 KiB | 39.1 KiB | 37.2 KiB | 4 | 798.5 KiB | 783.2 KiB | 41.3 KiB | + +## Observations + +1. Process-tree RSS is almost the same as root RSS in this local bundle run. + The root/tree gap is usually below 10 MiB. That means these runs did not + show a persistent child-process memory owner. The dominant process is the + main Node process. +2. The local bundle run peaks around 0.36-0.41 GiB, not the earlier + 0.83-1.04 GiB, because the matrix used a stripped temporary config. A + follow-up normal-config sanity check reproduced about 1.1 GiB tree RSS on + both PATH `qwen` and local `dist/cli.js`, with the extra memory coming from + child MCP/Node processes in the process tree. +3. V8 heap is much smaller than RSS. End heap is about 99-143 MiB while end RSS + is about 216-411 MiB. The remaining footprint is likely loaded modules, + native allocations, external buffers, or runtime overhead outside live JS + heap. +4. Static request overhead is large and repeated. The system prompt is about + 39-51 KiB per request, and tool schema is about 37-40 KiB per request. This + explains why even small tasks can produce high accumulated token counts when + the model takes several turns. +5. Large diff output is capped before it reaches the model request. The 968 KiB + and 4.84 MiB diff cases produced around 772-799 KiB of captured tool result, + but the largest model-facing function response in a request stayed around + 25-41 KiB, and max wire body stayed around 105-122 KiB. This points to + truncation / saved-output handling working on the model-facing path. +6. Memory still increases on large-output cases even though wire body remains + bounded. For example, the 4.84 MiB GLM run reached 414.2 MiB tree RSS and + 410.7 MiB end RSS, and the 4.84 MiB qwen3.6-plus run ended with 142.8 MiB + heap. That suggests large tool output can still affect local capture, + normalization, or retained runtime state even when the final request payload + is capped. +7. Model choice changed turns and token totals more than RSS in this run. + `qwen3.6-plus` averaged more tokens and turns than `pai/glm-5`, but its + average tree RSS peak was slightly lower. This supports the earlier + conclusion that model choice is not the main explanation for process memory. + +## Updated Working Inference + +The new diagnostics make the earlier hypothesis more precise: + +- The installed-CLI user-visible 1 GiB peak is now reproducible with the normal + config on the local diagnostics bundle. The stripped run should be used for + internal Qwen runtime attribution; the normal-config run should be used for + user-visible process-tree attribution. +- The largest observed difference between stripped and normal config is + process-tree shape: normal config starts additional MCP/Node child processes. + Those children explain most of the absolute jump from about 0.35-0.55 GiB to + about 1.1 GiB in the minimal prompt sanity check. +- The `--bare` follow-up confirms the same direction on `qwen3.6-plus`: normal + config costs about 0.50-0.59 GiB more process-tree RSS than bare mode for the + same prompt shape, while root RSS changes only slightly. +- The temporary-settings isolation is a better attribution test than `--bare`: + disabling MCP alone reduces process-tree RSS by about 0.42-0.47 GiB while + keeping the normal settings-loading path. Disabling hooks alone does not show + a meaningful RSS change in no-tool-call cases. +- Per-MCP isolation points to `chrome-devtools` as the dominant MCP contributor: + it is enough by itself to reproduce the high RSS band, and removing it returns + the run near the no-MCP baseline. +- Within the local Qwen runtime, the most suspicious areas are no longer "raw + diff bytes sent to the model". The model-facing request body is bounded. +- The stronger suspects are static per-request context cost, repeated request + rounds, tool schema size, and local retention/capture of large tool outputs + before or outside model-facing truncation. +- Because RSS remains much higher than V8 heap, the next profiling layer should + include module/startup accounting, external memory, and heap snapshots around + tool execution and final response emission. + +## RSS Attribution From Current Diagnostics + +The current counters do not identify an exact retained object or source file, +but they do narrow what is and is not driving RSS in these local runs: + +| Signal | Current evidence | RSS implication | +| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | +| Root RSS vs process-tree RSS | Root and tree peaks are usually within about 2-10 MiB; DeepSeek large PR is the widest gap at about 23.6 MiB | No persistent child process explains the RSS in this local bundle run; the main Node process dominates | +| Normal config process tree | Minimal-prompt normal-config runs reach about 1.1 GiB tree RSS while root RSS stays about 213-250 MiB | User-visible 1 GiB peaks can be dominated by MCP/Node child processes rather than Qwen root RSS alone | +| `--bare` comparison | `qwen3.6-plus` normal runs peak around 1.02-1.05 GiB tree RSS; bare runs peak around 0.45-0.53 GiB | Loading normal config adds about 0.50-0.59 GiB process-tree RSS in this environment | +| Temporary MCP isolation | Clearing MCP servers drops startup/config tree RSS from 865-1,017 MiB to 443-549 MiB | MCP startup and MCP child processes explain about 0.42-0.47 GiB of process-tree RSS in the controlled config check | +| Per-MCP isolation | `chrome-devtools` alone reaches about 1.0 GiB in repeated samples; without it the run stays around 461 MiB | `chrome-devtools` is the dominant MCP process-tree RSS contributor in this environment | +| Temporary hooks isolation | `disableAllHooks=true` with MCP still enabled changes tree RSS by only about 13-18 MiB in no-tool-call cases | Hook config alone is not a visible startup RSS driver here; hook execution still needs a tool-call benchmark | +| V8 heap vs RSS | End heap is about 99-143 MiB while end RSS is about 216-411 MiB | Live JS heap is not the whole footprint; loaded modules, native allocations, external buffers, or runtime overhead are likely significant | +| PR/diff size vs RSS | DeepSeek small/medium/large PRs scale from 1 to 4,750 changed lines, but tree RSS stays in a narrow 340.7-360.0 MiB band | Raw PR size is not linearly driving RSS once tool output is bounded | +| Tool output size | Large diff runs capture about 772-799 KiB tool results and show some higher end RSS / heap, but RSS does not scale linearly | Tool result capture/normalization contributes pressure, especially large-output cases, but is unlikely to be the only RSS driver | +| Request body size | Max model-facing body ranges from about 103-289 KiB while RSS stays near the same band | Request serialization size affects tokens and latency more clearly than RSS peak | +| Static per-request context | System prompt is about 39-51 KiB and tool schema about 37-48 KiB per request | Repeated rounds are a token/cost amplifier; this alone does not explain RSS but is a likely optimization target for token pressure | + +Working attribution: in the stripped local bundle benchmark, the RSS floor looks +mostly like task-time runtime/module/native footprint, with large tool output +adding incremental pressure. In the normal-config run, the user-visible 1 GiB +tree peak is mostly process-tree composition: Qwen root plus MCP/Node child +processes. The next targeted measurement should split Qwen root diagnostics +from configured MCP server diagnostics, then add startup/module/external-memory +checkpoints inside the Qwen root process. + +## Progress Snapshot + +Current confirmed signals: + +1. The user-visible 1 GiB startup/config peak is reproducible with both the + installed CLI and the local diagnostics bundle when the normal config is + loaded. It is not primarily explained by the diagnostics branch or PR `#4186`. +2. In this environment, that 1 GiB peak is mostly process-tree composition: + Qwen root process plus relaunch child process plus MCP child processes. +3. `chrome-devtools` is the dominant configured MCP contributor in the current + config. It is enough by itself to reproduce the high process-tree RSS band, + even when the prompt does not explicitly use that MCP. +4. The no-MCP normal relaunch shape still sits around 0.45 GiB process-tree RSS. + A single Qwen runtime process without the relaunch parent is closer to + 0.22-0.24 GiB in the startup attribution check. This means the 0.45 GiB + baseline is not a single-process root RSS number. +5. In stripped non-interactive task runs, model choice changes turns, token + totals, latency, and request sizes more clearly than RSS. RSS stayed in a + relatively narrow range across `pai/glm-5`, `qwen3.6-plus`, and + `DeepSeek/deepseek-v4-pro`. +6. Current short-task diagnostics show model-facing tool/function responses are + bounded, but local tool-result capture and runtime state can still increase + heap/RSS on large-output cases. This keeps large-output retention on the + investigation path. + +Current gaps: + +1. The short-task benchmark matrix is still short-lived. A later interactive + long-review run did reproduce a 41.9 min failure, but it is still one sample + and needs repeat runs plus heap/object attribution. +2. The current counters are enough to attribute process-tree RSS and request + size, but not enough to name the retained JS object graph during long + sessions. +3. Startup/config RSS and long-session OOM must remain separate tracks. MCP and + relaunch explain a large idle/startup RSS band; they do not by themselves + explain V8 heap OOM after long tasks. +4. Interactive TUI memory still needs a separate run from non-interactive mode, + because UI history and Ink static output are not exercised the same way. + +## Long-Task OOM Evidence From Issues And PRs + +Issue/PR evidence points to several different OOM shapes, not one single +failure mode: + +| Source | Evidence summary | Hypothesis to test | +| ---------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | +| [`#4309`](https://github.com/QwenLM/qwen-code/issues/4309) | User reports 5.84 GiB memory usage / 7.02 GiB warning with YOLO mode and DeepSeek backend; increasing Node memory to 8 GiB did not remove the symptom | Long autonomous tool loops can retain enough state that simply raising old-space limit is not a root fix | +| [`#4149`](https://github.com/QwenLM/qwen-code/issues/4149) | Multiple reports show `Ineffective mark-compacts near heap limit`, including 4 GiB and much larger heap-limit cases | A large fraction of heap is reachable application state, not immediately collectible garbage | +| [`#4116`](https://github.com/QwenLM/qwen-code/issues/4116) | OOM occurred while context display was around 9.5%; analysis points to `structuredClone`, UI history, Ink static tree, and large context windows | Token usage can be low while JS heap pressure is high; token threshold alone is not a reliable memory guard | +| [`#4167`](https://github.com/QwenLM/qwen-code/issues/4167) | User says the crash happened while compressing; analysis identifies compression peak memory as a distinct shape | Compression can itself create a peak when heap is already high, especially if history is cloned/stringified around the same time | +| [`#2128`](https://github.com/QwenLM/qwen-code/issues/2128) | Report identifies unbounded UI history, retained file diffs / terminal output, string-width caches, and checkpoint serialization | Interactive TUI long sessions may retain memory outside model history and outside non-interactive benchmarks | +| [`#2562`](https://github.com/QwenLM/qwen-code/issues/2562) | Report focuses on `GeminiChat.getHistory()` deep-cloning full history in long sessions | Full-history cloning can amplify memory peaks and should be measured separately from retained steady-state size | +| [`#4185`](https://github.com/QwenLM/qwen-code/issues/4185) | Tracks V8 heap pressure exceeding limit before token-based compaction runs | Heap-pressure guard is necessary, but it only mitigates symptoms if retained data remains large | +| [`#4184`](https://github.com/QwenLM/qwen-code/issues/4184) | Proposes diagnostics and offload/preview for large retained tool results | Large tool output may be bounded for model requests while still retained in local hot memory | +| [`#4186`](https://github.com/QwenLM/qwen-code/pull/4186) | Merged heap-pressure auto-compaction safety net and O(1) last-history access for `nextSpeakerChecker` | Covers part of heap-pressure and clone amplification, but does not claim to solve all OOM classes | +| [`#4127`](https://github.com/QwenLM/qwen-code/pull/4127), [`#4168`](https://github.com/QwenLM/qwen-code/pull/4168) | Open compaction-threshold PRs; one uses fixed heap thresholds, the other redesigns token thresholds and compression behavior | Useful related work, but long-task testing must verify whether heap, token, and compression signals line up in real runs | +| [`#3000`](https://github.com/QwenLM/qwen-code/issues/3000), [`#4183`](https://github.com/QwenLM/qwen-code/issues/4183) | Diagnostic roadmap calls out `/doctor memory`, heap snapshot, and bounded memory timeline | Snapshot/timeline support is needed to move from RSS attribution to retained-object attribution | + +Initial interpretation: + +- Unused configured MCP can consume memory because normal startup connects to + configured MCP servers and advertises their tools before the task needs them. + In the measured config, `chrome-devtools` starts extra Node/npm MCP processes + and also increases the tool schema count from 19 to 48. This explains a large + startup/config RSS band and can also increase repeated request overhead. +- The long-session OOM reports are a different layer. GC logs where + Mark-Compact frees very little memory suggest the heap is full of reachable + state. The strongest candidates are retained history/tool/UI objects, + full-history clones, compression intermediates, and streaming/logging + accumulators. +- PR `#4186` is a useful mitigation because it can compact based on heap + pressure before token thresholds trigger, and it removes one unnecessary + full-history clone. It should not be treated as proof that large tool-output + retention, UI history retention, or compression peak memory is already solved. + +## Long-Task Validation Plan + +The next benchmark should keep two tracks separate: + +1. Startup/config attribution: normal config vs MCP-disabled vs + `chrome-devtools`-only vs no-relaunch attribution. This explains what users + see before meaningful work begins. +2. Long-task runtime growth: repeated tool calls, large outputs, compression, + resume, and interactive UI history. This explains OOM after real work. + +Recommended long-task cases: + +| Case | Shape | Why it matters | +| ----------------------------- | ---------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- | +| Long PR review loop | Repeat medium/large PR review prompts for 30, 60, and 120 minutes, with fixed model and fixed config | Closest to reported agent workflows; captures turns, tool calls, token growth, and RSS/heap trend | +| Large tool-output retention | Repeatedly produce bounded 1 MiB / 5 MiB / 20 MiB command outputs, then ask follow-up questions | Tests whether raw output is retained locally after model-facing truncation | +| Compression pressure | Use a lower controlled old-space limit and large-context prompts to trigger heap-pressure compaction | Verifies PR `#4186` triggers before OOM and whether compression itself creates a new peak | +| Interactive TUI history | Run the same long loop in tmux TUI mode and compare with non-interactive mode | Isolates UI history, Ink static output, rendered diffs, and terminal-output display retention | +| Resume stress | Resume a large saved session and immediately continue work | Targets `/resume` OOM reports and session reconstruction cost | +| Streaming/logging accumulator | Force long streamed responses with telemetry/logging enabled vs disabled | Tests the suspected `collected responses` / logging-retention path from issue analysis | +| MCP idle vs MCP active | Run no-MCP, `chrome-devtools` configured-but-unused, and `chrome-devtools` actively used variants | Separates idle MCP child RSS from actual MCP tool execution and tool schema/token overhead | + +Metrics that should be recorded per turn or per sampling interval: + +- Root RSS current/peak and process-tree RSS current/peak. +- Child process count and top child command shapes. +- V8 `heapUsed`, `heapTotal`, `heap_size_limit`, `external`, and + `arrayBuffers`. +- Turn count, request count, tool-call count, and tool-call rounds. +- Input/output/cache/total tokens by request and by whole task. +- Request body bytes, system prompt bytes, tool schema bytes, and function + response bytes. +- Tool-result count, total captured tool-result bytes, max tool-result bytes, + and retained tool-result bytes if available. +- Conversation history message count and approximate history byte size. +- Interactive-only UI history item count and approximate retained display size. +- Compression attempts, compression trigger reason, tokens before/after, heap + pressure before/after, and compression failure status. +- Heap snapshot or bounded memory timeline artifacts when heap pressure crosses + a configured threshold. + +Validation criteria: + +1. Repeat at least the key long-task cases twice. Startup RSS has visible + variance, so single-run conclusions should be avoided. +2. Report root RSS and process-tree RSS separately. User-facing memory pressure + can come from child processes, while V8 OOM comes from the Qwen root heap. +3. Treat a flat RSS line as important evidence. If tokens and tool calls grow + but heap/RSS stays flat, the issue is likely elsewhere. +4. When RSS or heap grows, correlate the growth with a specific signal: + tool-result bytes, history bytes, UI history count, compression event, + streaming accumulator size, or MCP process start. +5. If a heap snapshot is taken, write a structured diagnostics JSON first, then + the snapshot. Heap snapshots may be large and can contain sensitive strings, + so they should remain opt-in and local. + +## Interactive Long-Review Reproduction + +After the short non-interactive prompts kept finishing before the target window, +an interactive TUI benchmark was run with remote input. The CLI process stayed +alive in one session while a controller submitted one real PR-review turn at a +time. The next turn was only submitted after the assistant emitted that turn's +completion marker. This avoids treating a short one-shot prompt as a long-task +reproduction. + +Setup: + +- Installed Qwen Code `0.15.11`, model `qwen-latest-series-invite-beta-v28`. +- Temporary CLI home derived from the normal settings, with MCP and hook config + removed. No global config was modified. +- Interactive TUI mode with dual JSON event output and remote JSONL input. +- Static PR review only. The prompt disallowed dependency install, build, test, + Playwright, Docker, and other long external build commands. +- External RSS samplers recorded both process-tree RSS and the Qwen Node root + RSS every 5 seconds. + +Outcome: + +| Signal | Value | +| ----------------------------- | ----------: | +| Wall time before exit | 41.9 min | +| Exit status | 1 | +| Completed PR-review turns | 6 | +| Main chat records | 1,076 | +| API response telemetry | 335 | +| Tool-call telemetry | 607 | +| MCP tool-call telemetry | 0 | +| Main/root API responses | 36 | +| Subagent API responses | 299 | +| Root total tokens | 2.08M | +| Subagent total tokens | 17.24M | +| Total API telemetry tokens | 19.32M | +| Max root input tokens | 85,655 | +| Max subagent input tokens | 215,207 | +| `/usr/bin/time -l` max RSS | 1,072.4 MiB | +| Sampled Qwen root RSS peak | 1,028.2 MiB | +| Sampled process-tree RSS peak | 1,038.1 MiB | + +The process exited with: + +```text +libc++abi: terminating due to uncaught exception of type std::__1::system_error: thread constructor failed: Resource temporarily unavailable +``` + +This is a **thread exhaustion** error, not a V8 heap OOM. The failure mechanism +is distinct: the OS refused to create a new thread, likely due to per-process +resource limits (`RLIMIT_NPROC`) or memory fragmentation preventing stack +allocation. It is still relevant because it occurred in a disabled-MCP, +no-build/test, interactive long-session review where the Qwen Node process +itself crossed about 1 GiB RSS. +The failure happened during the final summary phase, after the controller had +already completed six review turns. + +Turn timeline and sampled Qwen root RSS: + +| Window | Turn state | Qwen root RSS max | Qwen root RSS at window end | +| ------------- | -------------------- | ----------------: | --------------------------: | +| 0.0-9.0 min | turn 1 completed | 701.2 MiB | 255.3 MiB | +| 9.0-15.1 min | turn 2 completed | 503.2 MiB | 494.4 MiB | +| 15.1-24.1 min | turn 3 completed | 468.7 MiB | 457.5 MiB | +| 24.1-31.9 min | turn 4 completed | 619.3 MiB | 602.3 MiB | +| 31.9-40.3 min | turn 5 completed | 955.5 MiB | 955.5 MiB | +| 40.3-40.4 min | turn 6 completed | 988.6 MiB | 988.6 MiB | +| 40.4-41.9 min | final summary / exit | 1,028.2 MiB | 1,028.2 MiB | + +Token and tool distribution: + +| Owner | API responses | Input tokens | Output tokens | Total tokens | Max input | +| ------------ | ------------: | -----------: | ------------: | -----------: | --------: | +| Root session | 36 | 2.06M | 22.2K | 2.08M | 85,655 | +| Subagents | 299 | 17.08M | 154.6K | 17.24M | 215,207 | + +Tool-call telemetry by function: + +| Tool | Calls | Captured content length | +| ------------------- | ----: | ----------------------: | +| `read_file` | 271 | 1.46 MB | +| `run_shell_command` | 181 | 164.4 KB | +| `web_fetch` | 80 | 846.3 KB | +| `grep_search` | 25 | 15.0 KB | +| `glob` | 15 | 27.8 KB | +| `todo_write` | 16 | 16.1 KB | +| `list_directory` | 8 | 6.2 KB | +| `agent` | 10 | 0 | +| `tool_search` | 1 | 2.1 KB | + +The top visible TUI token counter for a single agent reached about 3.83M +tokens. Telemetry also shows the heaviest subagent at about 4.05M total tokens +with a 215K-token max input request. That makes subagent amplification the +dominant signal in this reproduction. + +Interpretation: + +1. This run separates long-session growth from MCP startup/config memory. MCP + was disabled and there were no MCP tool calls, yet the Qwen root process + still reached about 1 GiB RSS. +2. The late memory peak aligns with subagent-heavy review turns and final + summary/merge-back, not with external build/test child processes. +3. The RSS curve is not a simple linear leak. It falls after early turns, then + rises sharply after later subagent turns and remains high near exit. +4. The failure mode is native resource exhaustion rather than a V8 heap-limit + stack, so the next run should add heap/external/arrayBuffer/thread-count + sampling. RSS alone cannot distinguish JS heap from native allocations or + thread-resource pressure. +5. The strongest code paths to inspect remain subagent transcript retention, + agent-result merge-back, full-history cloning, checkpoint/session recording, + and final summary/history assembly. + +## Deterministic Huge-Task Clone-Pressure Reproduction + +A deterministic stress harness was added as +`scripts/memory-pressure-repro.mjs`. It does not call a model. Instead, it +constructs a Qwen-like long-session object graph with root review turns, +subagent transcripts, large tool results, checkpoint JSON, and retained +`structuredClone()` copies. This gives a repeatable reproduction for the clone +and checkpoint peak suspected from the user-provided OOM stack. + +The harness has a lightweight script test: + +```bash +npx vitest run --config ./scripts/tests/vitest.config.ts \ + scripts/tests/memory-pressure-repro.test.js +``` + +Result: passed, 1 test. + +Controlled runs used `node --max-old-space-size=256` unless otherwise noted. + +| Case | History shape | Clone/checkpoint pressure | Result | Max RSS | +| ------------------------------------------------- | ----------------------------------------------------------------------- | -------------------------------------------------- | --------------------------------- | --------: | +| Small sanity | 2 turns, 2 KiB tool result, 1 subagent | 1 clone + 1 checkpoint | passed; 2.6 MiB history JSON | 89.7 MiB | +| Huge build only | 12 turns, 256 KiB tool result, 2 subagents x 12 subagent turns | no retained clone/checkpoint | passed; 76.2 MiB history JSON | 491.5 MiB | +| Huge + 1 clone | same as above | 1 retained `structuredClone()` | passed | 569.6 MiB | +| Huge + 2 clones | same as above | 2 retained `structuredClone()` copies | OOM, exit 134 | 496.5 MiB | +| Huge + 1 checkpoint | same as above | one checkpoint with original + cloned history JSON | passed; 152.5 MiB checkpoint JSON | 926.9 MiB | +| Huge + 2 checkpoints | same as above | two checkpoint copies | OOM, exit 134 | 920.1 MiB | +| Huge + 2 clones, no retained subagent transcripts | same generated subagent output, but parent history keeps only summaries | passed; parent history JSON drops to 3.8 MiB | 136.8 MiB | + +The failing huge-clone run produced: + +```text +FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory +``` + +The native stack included: + +- `v8::internal::ValueDeserializer::ReadObjectInternal` +- `v8::internal::ValueDeserializer::ReadDenseJSArray` +- `node::worker::Message::Deserialize` +- `node::worker::StructuredClone` + +This matches the same stack family as the user-provided OOM log. The controlled +reproduction also shows why 4 GiB / 8 GiB user reports are plausible: the +failure is not caused by a single large object, but by large retained +history/tool-result/subagent state plus one or more full-history clone or +checkpoint copies. Raising `--max-old-space-size` can delay the crash while +preserving the same amplification pattern. + +Important attribution from this deterministic run: + +1. Building a 76.2 MiB parent history JSON can succeed under the reduced heap. + The OOM appears when additional full-history clone/checkpoint copies are + retained. +2. A single checkpoint copy can push RSS close to 1 GiB even before OOM. +3. Removing retained subagent transcripts from the parent hot history changes + the same generated workload from OOM to a small 136.8 MiB RSS run. That is + the clearest mitigation signal so far. +4. This reproducer is synthetic and intentionally adversarial, but it exercises + the same object-graph shape as the long interactive review: parent session, + subagents, large tool outputs, transcript merge-back, and full-history clone + pressure. + +## DeepSeek PR-Size Follow-Up + +After the initial model matrix, an additional Qwen Code-only run tested +`DeepSeek/deepseek-v4-pro` across three real PR sizes. This model is configured +through the Anthropic-compatible protocol; OpenAI-compatible execution returned +404 in a smoke check, so the successful benchmark uses `--auth-type anthropic`. + +The diagnostics branch was extended to record Anthropic wire request summaries +with the same privacy rule as the OpenAI path: aggregate counts and byte sizes +only, no prompt text, diff content, tool arguments, headers, base URL, or API +key. + +PR sizes: + +| Size | PR | State | Files | Changed lines | Title | +| ------ | ------- | ------ | ----: | ------------: | ----------------------------------------------------------------------- | +| small | `#4268` | merged | 1 | 1 | fix(serve): add mcp_guardrails to E2E capabilities expectation | +| medium | `#4186` | merged | 6 | 494 | fix(core): add heap-pressure auto-compaction safety net | +| large | `#4168` | open | 25 | 4,750 | feat(core)!: redesign auto-compaction thresholds with three-tier ladder | + +Runtime: + +| Size | PR | Wall | Turns | Total tokens | Cache-read tokens | Tree RSS peak | Root RSS peak | End heap | End RSS | +| ------ | ------- | -----: | ----: | -----------: | ----------------: | ------------: | ------------: | --------: | --------: | +| small | `#4268` | 39.7s | 2 | 43,362 | 28,672 | 346.9 MiB | 344.8 MiB | 115.2 MiB | 304.3 MiB | +| medium | `#4186` | 142.6s | 4 | 135,120 | 115,840 | 340.7 MiB | 337.3 MiB | 103.5 MiB | 285.6 MiB | +| large | `#4168` | 191.1s | 8 | 386,891 | 332,928 | 360.0 MiB | 336.3 MiB | 119.3 MiB | 237.9 MiB | + +Request and tool diagnostics: + +| Size | PR | Requests | Anthropic wire requests | Max Anthropic body | Max system | Max tool schema | Tool calls | Total tool result | Max tool result | Max function response in request | +| ------ | ------- | -------: | ----------------------: | -----------------: | ---------: | --------------: | ---------: | ----------------: | --------------: | -------------------------------: | +| small | `#4268` | 2 | 2 | 103.0 KiB | 50.8 KiB | 47.6 KiB | 3 | 0.6 KiB | 0.5 KiB | 1.1 KiB | +| medium | `#4186` | 4 | 4 | 159.8 KiB | 50.8 KiB | 47.6 KiB | 5 | 30.2 KiB | 29.3 KiB | 56.7 KiB | +| large | `#4168` | 8 | 8 | 289.5 KiB | 50.8 KiB | 47.6 KiB | 11 | 235.0 KiB | 232.1 KiB | 182.4 KiB | + +DeepSeek observations: + +1. PR size scaled turns, tokens, Anthropic wire body size, and tool result size + clearly, but did not scale RSS proportionally. The small/medium/large tree + RSS peaks stayed in a narrow `340.7-360.0 MiB` band. +2. The large PR was expensive mostly in model rounds and token volume: + 8 requests and 386,891 total tokens. Its max Anthropic body was 289.5 KiB, + much larger than the OpenAI-compatible runs, but RSS still stayed near the + same local-bundle band. +3. The static Anthropic request cost is also visible: system prompt is about + 50.8 KiB and tool schema about 47.6 KiB per request. Repeated rounds are + therefore a major token amplifier. +4. The large PR produced 235.0 KiB of captured tool results and 182.4 KiB max + function response in a request. This is higher than the earlier small PR / + code-navigation cases and shows large PRs still put pressure on local + tool-result handling and request assembly, even when RSS does not spike. +5. The DeepSeek run reinforces the model-choice conclusion: provider/model + choice strongly changes turns, latency, token volume, and wire payload shape, + but the local bundle RSS peak remains dominated by Qwen Code runtime shape + rather than scaling linearly with PR size. + +## Long-Review JSONL Replay: History Clone Pressure + +A recent long PR-review chat record was analyzed as a post-mortem shape for +the reported OOM class. The raw JSONL is not included here because it contains +prompt and tool output text. The aggregate shape is: + +| Signal | Value | +| ----------------------- | ----------------------------- | +| Duration | 87.0 min | +| Qwen Code version | 0.15.10 | +| Model | qwen-latest-series beta model | +| API responses | 380 | +| Tool-call telemetry | 507 events | +| MCP tool-call telemetry | 4 events | +| Subagent API responses | 313 | +| Root API responses | 67 | +| Root prompt growth | 38,622 -> 168,555 tokens | +| Max prompt tokens | 168,555 | +| Total response tokens | 31.28M | + +This shape does not support MCP as the primary OOM cause for this case. Only +4 of 507 tool-call telemetry events were MCP, and all four recorded +`content_length=0`. The dominant shape is long-session/subagent amplification: +15 `agent` calls produced 313 subagent API responses and 403 subagent tool-call +events. + +The replay then rebuilt the chat `Content[]` message shape from the JSONL and +ran controlled clone/stringify pressure tests. The base retained message payload +is small, so it is not itself enough to OOM: + +| Replay scale | Retained clones | History JSON | Checkpoint JSON | End heap | End RSS | +| ------------ | --------------: | -----------: | --------------: | -------: | -------: | +| 1x | 8 | 0.54 MB | 1.08 MB | 18.0 MB | 88.8 MB | +| 30x | 8 | 14.46 MB | 28.92 MB | 260.0 MB | 577.8 MB | +| 60x | 8 | 28.86 MB | 57.71 MB | 510.3 MB | 960.8 MB | + +The scaled replay is not a user-data claim; it is a controlled amplification of +the observed JSONL shape to test whether full-history clone and checkpoint +serialization can create the same failure mode as the reports. + +A low-heap reproduction with `--max-old-space-size=256` confirms the mechanism: + +| Case | History JSON | Result | +| ------------------------- | -----------: | ----------------------------------------------------- | +| Build history only | 38.4 MB | Succeeded; heap 131.6 MB, RSS 378.2 MB | +| Build + one clone | 38.4 MB | Succeeded; heap 183.3 MB, RSS 463.4 MB | +| Build + repeated clones | 38.4 MB | OOM after several retained `structuredClone()` copies | +| Checkpoint double-history | 38.4 MB | OOM while holding history plus cloned client history | + +The repeated-clone OOM stack contains `ValueDeserializer::ReadObjectInternal`, +`ValueDeserializer::ReadDenseJSArray`, +`node::worker::Message::Deserialize`, and +`node::worker::StructuredClone`, matching the same stack family seen in the +user-provided OOM log. This proves that full-history `structuredClone()` can be +the immediate OOM trigger without any MCP server involvement. + +Current working hypothesis for this JSONL class: + +1. MCP can explain normal-config startup RSS in separate benchmarks, but it is + not the likely trigger for this long-review OOM shape. +2. Long task growth comes from retained chat history, large tool outputs, + subagent histories, observable agent messages, and UI/tool-result state. +3. The immediate OOM trigger can be a full-history clone or checkpoint-style + double serialization after the heap is already high. +4. Compression can mitigate retained history, but compression itself may create + a temporary peak if it first clones or serializes large history. + +### Local Mitigation Validation: Disabled-MCP PR Review Case + +Two targeted mitigations were applied locally and validated before rerunning a +disabled-MCP PR review case: + +1. `checkNextSpeaker()` now reads only the last curated message with + `getHistoryTail(1, true)` and sends only that message to the next-speaker + side query. The next-speaker prompt only asks about the immediately previous + model response, so sending full history was unnecessary clone and token + pressure. +2. `AgentToolInvocation` no longer retains full `responseParts` arrays inside + the live `task_execution.toolCalls` display. The real response parts still + flow through transcript/history paths, but the parent UI display now keeps + only a bounded text summary for nested tool-result streaming instead of + holding another full copy of large subagent tool outputs during long runs. +3. `GeminiChat.sendMessageStream()` now builds model request contents through + an internal curated-history view instead of calling public + `getHistory(true)`. Public `getHistory()` still returns a defensive + `structuredClone()` for external callers, but the request hot path no longer + deep-clones the whole retained chat history before every model call. + +TDD checks added for these mitigations: + +| Test | Expected protection | +| -------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- | +| `checkNextSpeaker > should send only the last curated model message to the side query` | Prevents full-history clone/send in next-speaker checks | +| `AgentTool > should not retain responseParts in live tool call display after TOOL_RESULT` | Prevents live subagent display from retaining large tool responses | +| `AgentTool > should keep only a bounded result summary in live tool call display` | Preserves nested result readability without retaining the full response body | +| `GeminiChat > sendMessageStream > does not deep-clone the full curated history when building request contents` | Prevents request setup from hitting the `ValueDeserializer` / `StructuredClone` OOM path | + +Additional reproduction and fix validation: + +| Step | Command shape | Result | +| ------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| Pre-fix deterministic clone pressure | `node --max-old-space-size=256 scripts/memory-pressure-repro.mjs ... --clone-count=2 --mode=clone` | OOM, exit 134; stderr contained `Reached heap limit` and `ValueDeserializer` / `StructuredClone`; max RSS 528.1 MiB in the repeat run | +| Red test | targeted `GeminiChat` test with `structuredClone` forced to throw during request setup | failed at `GeminiChat.getHistory()` before the mitigation | +| Green test | same targeted `GeminiChat` test after the mitigation | passed | +| Built-code smoke | `node --max-old-space-size=256` against the built core package, with a 96-entry / about 48 MiB history and `structuredClone` forced to throw | passed; request had 97 contents; process RSS 161.4 MiB, `/usr/bin/time -l` max RSS 161.6 MiB | + +This narrows the earlier "same stack family" statement: the deterministic +synthetic OOM still proves retained full-history clones can fail in the same V8 +stack family as the user log, while the new `GeminiChat` red/green test proves +one real production request-setup path no longer reaches that clone point. +Checkpoint/resume and compression internals still need separate long-run +validation because they can legitimately need durable copied history. + +Verification commands: + +| Command | Result | +| ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `npx vitest run src/core/geminiChat.test.ts` | passed, 89 tests | +| `npx vitest run src/utils/nextSpeakerChecker.test.ts --coverage=false` | passed, 13 tests | +| `npx vitest run src/tools/agent/agent.test.ts --coverage=false` | passed, 77 tests | +| `npx vitest run --config ./scripts/tests/vitest.config.ts scripts/tests/memory-pressure-repro.test.js` | passed, 1 test | +| `npm run build --workspace=packages/core` | passed | +| `npm run build --workspace=packages/cli` | passed | +| `npm run typecheck --workspace=packages/core` | passed | +| `npm run typecheck --workspace=packages/cli` | passed | +| `npm run bundle` | passed | +| `npm run build` | failed in `packages/vscode-ide-companion` lint on existing internal-module import rules; core, CLI, bundle, and targeted tests above passed | + +The full root `npm run build` was not clean in this worktree because the +`vscode-ide-companion` package hit pre-existing `import/no-internal-modules` +lint errors. The core/CLI build and bundle needed for the local runtime test +completed successfully. + +The same PR review prompt was then run with a temporary config where MCP and +hooks were disabled. Both rows were interrupted after a bounded long-run window +instead of waiting for a full review to finish. **Caveat**: the two runs are +confounded by workload size (79K vs 390K tokens) and cannot be compared as a +controlled experiment. The comparison only shows directional evidence. + +| Variant | Runtime | MCP servers | Tools | Assistant messages | Tool use/result blocks | Parent tool ids | Total tokens | Max input tokens | Root max RSS | +| ----------------- | ------: | ----------: | ----: | -----------------: | ---------------------: | --------------: | -----------: | ---------------: | -----------: | +| before mitigation | 365.08s | 0 | 19 | 42 | 42 / 42 | 3 | 79,439 | 26,807 | 357.7 MiB | +| after mitigation | 404.52s | 0 | 19 | 58 | 52 / 42 | 2 | 390,339 | 54,000 | 310.5 MiB | + +This is not a deterministic apples-to-apples model benchmark: the patched run +did more work and consumed substantially more total tokens before the manual +cutoff. The useful signal is narrower: under a disabled-MCP review case with +more observed work, root max RSS did not increase and was about 47.2 MiB lower. +That supports the mitigation direction, but it does not prove the whole +long-task OOM class is fixed. + +Remaining high-risk clone/retention paths to inspect next: + +1. Compression still calls full `getHistory(true)` before summarization. If the + heap is already high, the compression attempt can create the peak that trips + OOM. +2. Checkpoint creation can hold original history, cloned client history, and a + serialized checkpoint payload at the same time. +3. Fork subagents still seed from parent history with `getHistory(true)`. +4. ACP/history export/summary/copy paths still call full `getHistory()` and + should be audited separately from the normal review loop. + +Version timing: + +| Issue | Created | Reported version | Signal | +| ----- | ---------- | ------------------------ | ---------------------------------------- | +| #2128 | 2026-03-05 | not specified | Long-session UI memory growth | +| #2562 | 2026-03-21 | not specified | `structuredClone` OOM in long sessions | +| #2868 | 2026-04-03 | 0.13.2 | Heap OOM | +| #2945 | 2026-04-07 | 0.14.0 | V8 heap OOM | +| #4116 | 2026-05-13 | 0.15.11 | OOM with structured-clone-style analysis | +| #4134 | 2026-05-14 | 0.15.11 | OOM | +| #4149 | 2026-05-14 | 0.15.10-nightly.20260513 | V8 heap OOM | +| #4167 | 2026-05-15 | 0.15.11 | Crash near compression | +| #4185 | 2026-05-15 | 0.15.11 | Heap pressure before token compaction | +| #4254 | 2026-05-17 | not specified | Memory keeps rising | +| #4276 | 2026-05-18 | 0.15.11 | V8 heap OOM | +| #4309 | 2026-05-19 | 0.15.11 | High memory warning around 7 GiB | + +The issue history does not prove that 0.15.10 introduced the OOM class; similar +reports existed in March and April. It does support a recent cluster beginning +around 2026-05-13, overlapping `v0.15.10`/`v0.15.11` releases. The relevant +diff between `v0.15.9` and `v0.15.10` touched subagent runtime, +non-interactive execution, `GeminiChat`, and compression code heavily, so this +range is a reasonable first bisect window. + +## Notes + +- The first code-navigation prompt allowed open-ended exploration and hit + `maxSessionTurns`; the successful rows above use a constrained command list. +- The first synthetic-diff attempt used a relative bundle path from inside the + temporary repositories; those failed immediately and are excluded from the + tables. The successful rows use the absolute local bundle path. +- Raw JSONL streams are not committed because they contain prompts, tool + commands, and tool output. The report only includes aggregate diagnostics. diff --git a/docs/e2e-tests/2026-05-21-qwen-0.15.11-default-heap-oom-stress-report.md b/docs/e2e-tests/2026-05-21-qwen-0.15.11-default-heap-oom-stress-report.md new file mode 100644 index 0000000000..e9579dee1b --- /dev/null +++ b/docs/e2e-tests/2026-05-21-qwen-0.15.11-default-heap-oom-stress-report.md @@ -0,0 +1,338 @@ +# Qwen Code 0.15.11 默认 Heap OOM 压测报告 + +日期:2026-05-21 + +## 测试范围 + +本报告记录了针对 Qwen Code `0.15.11` 最新本地构建的一轮默认 heap 压测。 +这轮测试的目标是验证:在不人为降低内存上限的情况下,当前代码是否还能复现 +issue 中提到的长会话 OOM,以及在更极端的大输出场景下还有没有新的风险。 + +本轮覆盖三个模型: + +- `pai/glm-5` +- `qwen3.6-plus` +- `DeepSeek/deepseek-v4-pro` + +测试分为两部分: + +1. 真实长任务、多 agent 并发 review 循环。 +2. amplified foreground stdout 压测,即用大规模前台 shell stdout 放大 + tool-output 路径压力。 + +## 测试环境 + +| 项目 | 值 | +| --------------------------- | --------------------------------------------- | +| 分支 | `codex/memory-investigation-draft-pr` | +| Commit | `c161e0aa4` | +| CLI | 本地 `dist/cli.js` | +| CLI 版本 | `0.15.11` | +| Node 默认 heap limit | `4144 MiB` | +| `NODE_OPTIONS` | 未设置 | +| 显式 `--max-old-space-size` | 未设置 | +| runner `ulimit` | runner 未设置 | +| 配置模式 | 临时复制 `~/.qwen`,并隔离 `QWEN_RUNTIME_DIR` | +| MCP / 正常配置 | 尽量按复制后的正常配置加载 | + +注意:这里的 CLI 版本显示为 `0.15.11`,是因为 package version 尚未 bump。 +实际测试对象是 commit `c161e0aa4` 下本地编译出的 `dist/cli.js`,不是 PATH +里的全局 `qwen` 可执行文件。 + +本轮没有修改全局 Qwen 配置。原始 runtime artifacts 在: + +- `.qwen/runtime-bench/2026-05-20T13-51-58-731Z-oom-stress` +- `.qwen/runtime-bench/2026-05-20T15-20-37-790Z-oom-amplified` + +注意:本轮里 `env-center` MCP server 启动失败,但其他内置工具和部分 +MCP/child process 仍然加载。因此这些结果代表当前本地环境,不是完全 stripped +的 `--bare` 环境。 + +## 核心结论 + +最新本地构建在 issue 最关心的“长会话 V8 heap OOM”路径上表现明显更好。 +基于这轮默认 heap、多模型、多 agent、长任务压测,可以认为本 PR 对此前遇到的 +long-session heap OOM 问题已经基本解决,至少在当前复现维度下已经不能再复现 +原始 heap OOM。 + +真实长任务、多 agent 并发测试一共执行了: + +- 23 个 worker turn +- 约 `719,094,118` reported total tokens +- 77 次 agent tool call +- 856 次总 tool call + +这部分没有复现任何传统 V8 heap OOM 特征: + +- `JavaScript heap out of memory` +- `Reached heap limit` +- `Ineffective mark-compacts near heap limit` +- `Allocation failed` + +真实长任务阶段最高 process-tree RSS 为 `874.7 MiB`,最高 root-process RSS 为 +`219.1 MiB`。这说明在默认 heap 下,当前代码没有轻易复现原 issue 中那种长任务 +跑挂的 heap OOM。 + +第二阶段 amplified stdout 压测更激进。它一共执行了 18 个 payload attempt, +覆盖三个模型和 `128 MiB` 到 `2048 MiB` 的 foreground stdout payload。 + +结果是: + +- 三个模型都成功跑过 `1536 MiB` payload。 +- 最高成功 process-tree RSS 是 `5964.7 MiB`,出现在 `qwen3.6-plus` + 的 `1536 MiB` payload。 +- 到 `2048 MiB` payload 时,出现了一个新的 extreme large-output failure。 + +`2048 MiB` 的结果: + +- `pai/glm-5`:`exit=1`,stdout 为空,没有标准 OOM 文本。 +- `qwen3.6-plus`:`exit=1`,stdout 为空,没有标准 OOM 文本。 +- `DeepSeek/deepseek-v4-pro`:出现 V8 fatal: + `Check failed: i::kMaxInt >= len`,栈在 + `v8::String::NewFromOneByte` / `node::StringBytes::Encode` / + `DecodeUTF8`。 + +这个新问题不是原 issue 中的传统 long-session heap OOM。它更像是 +multi-GiB foreground stdout 被解码/构造成 JS string 时触发的 V8 字符串长度 +限制或大输出处理问题。建议作为 large-output follow-up 跟踪,而不是把它当作 +当前长会话 heap-pressure 修复失败。 + +## Phase 1:真实长任务、多 Agent 并发压测 + +### 测试形态 + +每个模型 worker 都复用同一个 session,不断 `--resume`。每一轮要求 Qwen Code: + +- 进行只读代码审查和代码搜索; +- 在同一轮中并发启动至少 4 个 `agent` tool call; +- 重点检查 chat history、compaction、subagent runtime、non-interactive + streaming、provider adapters 等 memory 相关区域; +- 保留足够详细的最终回答,让 session history 自然增长。 + +runner 每秒采样 process-tree RSS,没有设置任何额外 heap cap。 + +这部分在观察到内存比较稳定后用 `SIGTERM` 主动停止,以便切换到第二阶段的 +amplified stdout 压测。因此表里的 `SIGTERM` 不是 OOM。 + +### 汇总结果 + +| Model | Worker turns | Total tokens | Agent calls | Tool calls | Peak tree RSS | Peak root RSS | Last exit | OOM | +| -------------------------- | -----------: | --------------: | ----------: | ---------: | ------------: | ------------: | --------- | ------ | +| `pai/glm-5` | 9 | 444,614,704 | 36 | 362 | 874.7 MiB | 217.4 MiB | `SIGTERM` | no | +| `qwen3.6-plus` | 7 | 101,425,927 | 17 | 346 | 862.7 MiB | 219.1 MiB | `SIGTERM` | no | +| `DeepSeek/deepseek-v4-pro` | 7 | 173,053,487 | 24 | 148 | 864.5 MiB | 213.8 MiB | `SIGTERM` | no | +| **Total / max** | **23** | **719,094,118** | **77** | **856** | **874.7 MiB** | **219.1 MiB** | - | **no** | + +### 分轮结果 + +| Model | Turn | Exit | Timed out | OOM | Peak tree RSS | Peak root RSS | Total tokens | Agent calls | Tool calls | +| -------------------------- | ---: | --------- | --------- | --- | ------------: | ------------: | -----------: | ----------: | ---------: | +| `DeepSeek/deepseek-v4-pro` | 1 | `0` | no | no | 709.1 MiB | 167.3 MiB | 5,565,147 | 4 | 37 | +| `DeepSeek/deepseek-v4-pro` | 2 | `0` | no | no | 674.5 MiB | 118.8 MiB | 13,989,721 | 4 | 29 | +| `DeepSeek/deepseek-v4-pro` | 3 | `0` | no | no | 734.1 MiB | 148.0 MiB | 22,621,542 | 4 | 24 | +| `DeepSeek/deepseek-v4-pro` | 4 | `0` | no | no | 771.1 MiB | 107.5 MiB | 33,470,249 | 4 | 22 | +| `DeepSeek/deepseek-v4-pro` | 5 | `0` | no | no | 864.5 MiB | 212.9 MiB | 43,540,313 | 4 | 19 | +| `DeepSeek/deepseek-v4-pro` | 6 | `0` | no | no | 807.6 MiB | 167.9 MiB | 53,866,515 | 4 | 17 | +| `DeepSeek/deepseek-v4-pro` | 7 | `SIGTERM` | no | no | 785.1 MiB | 213.8 MiB | n/a | n/a | n/a | +| `pai/glm-5` | 1 | `SIGTERM` | yes | no | 742.8 MiB | 170.5 MiB | 17,071,519 | 4 | 142 | +| `pai/glm-5` | 2 | `0` | no | no | 874.7 MiB | 217.4 MiB | 27,438,727 | 4 | 60 | +| `pai/glm-5` | 3 | `0` | no | no | 699.7 MiB | 102.1 MiB | 35,627,222 | 4 | 38 | +| `pai/glm-5` | 4 | `0` | no | no | 796.0 MiB | 194.0 MiB | 44,130,101 | 4 | 23 | +| `pai/glm-5` | 5 | `0` | no | no | 743.4 MiB | 152.1 MiB | 50,465,979 | 4 | 26 | +| `pai/glm-5` | 6 | `0` | no | no | 714.9 MiB | 125.2 MiB | 56,357,372 | 4 | 18 | +| `pai/glm-5` | 7 | `0` | no | no | 694.5 MiB | 96.6 MiB | 64,047,037 | 4 | 20 | +| `pai/glm-5` | 8 | `0` | no | no | 756.0 MiB | 136.8 MiB | 71,891,505 | 4 | 15 | +| `pai/glm-5` | 9 | `SIGTERM` | no | no | 755.7 MiB | 157.3 MiB | 77,585,242 | 4 | 20 | +| `qwen3.6-plus` | 1 | `0` | no | no | 735.1 MiB | 153.1 MiB | 3,890,508 | 4 | 83 | +| `qwen3.6-plus` | 2 | `0` | no | no | 702.4 MiB | 142.5 MiB | 4,300,186 | 1 | 9 | +| `qwen3.6-plus` | 3 | `0` | no | no | 862.7 MiB | 219.1 MiB | 8,635,953 | 4 | 88 | +| `qwen3.6-plus` | 4 | `SIGTERM` | yes | no | 685.8 MiB | 106.5 MiB | n/a | n/a | n/a | +| `qwen3.6-plus` | 5 | `0` | no | no | 610.5 MiB | 93.1 MiB | 40,191,337 | 4 | 87 | +| `qwen3.6-plus` | 6 | `0` | no | no | 723.6 MiB | 121.9 MiB | 44,407,943 | 4 | 79 | +| `qwen3.6-plus` | 7 | `SIGTERM` | no | no | 810.4 MiB | 116.0 MiB | n/a | n/a | n/a | + +### Phase 1 解读 + +这是本轮里最能说明原始 long-session OOM 已明显改善的数据。 + +这组测试比 5 月 18 日的小 PR review / code navigation 更重:它包含更多 +`--resume`、更多 subagent activity、更大的 reported token 量和更多 tool call。 +但 process-tree RSS 始终低于 `0.9 GiB`,也没有出现传统 V8 heap OOM。 + +这不能证明所有用户 OOM 都不可能再发生,但至少说明当前构建在默认 heap 下, +已经无法轻易复现 issue 中那类长会话 heap-pressure OOM。 + +## Phase 2:Amplified Foreground Stdout 压测 + +### 测试形态 + +第二阶段故意放大 shell-output 路径压力。每个模型、每个 payload size 都要求 +parent session 和并发 agents 运行前台 shell 命令,输出大量 `x` 到 stdout: + +```bash +node -e "const chunk='x'.repeat(1024*1024); for (let i=0; i= len. +... +v8::String::NewFromOneByte +node::StringBytes::Encode +node::encoding_binding::BindingData::DecodeUTF8 +``` + +触发条件: + +- Model:`DeepSeek/deepseek-v4-pro` +- Payload:`2048 MiB` +- Peak tree RSS:`4660.4 MiB` +- Largest process RSS:`4527.6 MiB` +- runner 记录 exit:`SIGTERM`,因为 fatal 输出已经捕获后,剩余子进程仍在高 CPU + 空转,被手动终止。 + +`pai/glm-5` 和 `qwen3.6-plus` 在 `2048 MiB` 也失败,表现为 stdout 为空、 +exit code `1`,但 stderr 没有捕获到 V8 fatal stack。 + +### 严重程度 + +这是一个真实的 robustness 问题,但触发条件是 multi-GiB foreground stdout, +不是正常代码审查任务。它也不能证明当前 long-session heap-pressure 修复失败。 + +### 是否是本 PR 引入? + +本轮没有证据表明 `2048 MiB` stdout failure 是当前 memory PR 引入的回归。 + +原因: + +- 失败路径是 foreground shell stdout decode / string construction。 +- 原 issue 路径是 long-session history、compaction、clone pressure。 +- 本轮没有做同 payload 的 pre-PR baseline,因此不能归因成 regression。 +- 该 failure 只在刻意极端的 `2048 MiB` payload 出现;`128 MiB` 到 + `1536 MiB` 都能完成。 + +建议把它作为 dedicated large-output follow-up:更早 stream / spool / hard-cap +foreground shell output,避免在内存里构造 multi-GiB JS string。除非当前 PR 的目标 +明确包含“任意 multi-GiB 前台 stdout 都必须可处理”,否则不建议把它作为当前 PR 的 +blocker。 + +## 结论 + +1. 最新本地 `0.15.11` 构建在 issue 报告的 long-session heap OOM 方向上明显更好。 + 基于当前默认 heap 压测结果,可以认为本 PR 已经基本解决此前遇到的 + long-session heap OOM 复现路径。 + +2. 在默认 Node heap 下,真实长任务 + 多 agent review loop 没有在 + `pai/glm-5`、`qwen3.6-plus`、`DeepSeek/deepseek-v4-pro` 三个模型上复现传统 + V8 heap OOM。 + +3. synthetic foreground stdout 压测仍能把 process-tree RSS 推得很高。当前构建在 + 三模型上都撑过了 `1536 MiB` payload,最高成功 tree RSS 是 `5964.7 MiB`。 + +4. 仍然存在一个独立的极端 large-output 问题:`2048 MiB` stdout 附近,Qwen Code + 可能在输出 JSON 结果前失败;DeepSeek case 捕获到了 V8 string-length fatal。 + +5. 这个新发现重要,但更像是后续 large-output robustness 问题,不应直接作为 + long-session heap-pressure mitigation 的 blocker。 + +## 建议发到 PR 的评论摘要 + +建议 PR 评论里只放精简摘要,完整数据放本文档: + +```markdown +I reran default-heap stress tests on the latest local build with +`pai/glm-5`, `qwen3.6-plus`, and `DeepSeek/deepseek-v4-pro`. + +No `NODE_OPTIONS`, `--max-old-space-size`, or runner `ulimit` was used. The +local Node heap limit was about 4144 MiB. + +Results: + +- Realistic long-session + multi-agent review loop: 23 worker turns, + ~719M reported total tokens, 77 agent calls, 856 total tool calls. + No traditional V8 heap OOM was reproduced. Peak process-tree RSS was + 874.7 MiB; peak root RSS was 219.1 MiB. +- Amplified stdout stress: 18 payload attempts across 128 MiB -> 2048 MiB. + All three models completed through 1536 MiB payloads without traditional + heap OOM. Highest successful process-tree RSS was 5964.7 MiB. +- At 2048 MiB foreground stdout, an extreme large-output failure remains. + DeepSeek captured a V8 fatal `Check failed: i::kMaxInt >= len` stack in + `String::NewFromOneByte` / `StringBytes::Encode` / `DecodeUTF8`. + +Conclusion: this PR appears to have effectively addressed the previously +observed long-session heap OOM reproduction path under default heap. The +2048 MiB stdout failure is a separate large-output/string-limit robustness issue +and should be tracked as a follow-up rather than treated as the same +long-session heap OOM regression. +``` diff --git a/docs/plans/2026-05-18-qwen-runtime-memory-investigation.md b/docs/plans/2026-05-18-qwen-runtime-memory-investigation.md new file mode 100644 index 0000000000..393e3dc8dc --- /dev/null +++ b/docs/plans/2026-05-18-qwen-runtime-memory-investigation.md @@ -0,0 +1,240 @@ +# Qwen Code Runtime Memory Investigation Plan + +Date: 2026-05-18 + +## Context + +Local benchmarks show Qwen Code using substantially more process-tree RSS than +Claude Code for similar non-interactive CLI task shapes. The latest five-case +matrix found Qwen Code peaking around `0.83-1.04 GiB` while Claude Code stayed +around `0.27-0.36 GiB`. + +This document proposes a draft investigation and optimization direction. It is +not intended to claim a final root cause yet. The immediate goal is to make the +memory gap reviewable, reproducible, and explainable with internal diagnostics. + +## Progress So Far + +The investigation has reached the evidence-and-direction stage: + +- A repeatable local matrix has been built for small PR review, code navigation, + and synthetic diff workloads. +- Qwen Code has been compared across multiple models. +- Qwen Code and Claude Code have been compared on the same task shapes where + equivalent model endpoints were available. +- The observed RSS gap is consistent enough to justify deeper runtime + diagnostics. +- Related upstream work has been mapped so this effort can build on existing + `/doctor memory` and memory-diagnostics follow-ups. + +The investigation has not yet reached the final root-cause stage because +external process RSS cannot show whether the retained memory is V8 heap, native +memory, loaded modules, live history, tool results, or request assembly state. + +## Current Evidence + +The companion benchmark report is: + +- `docs/e2e-tests/2026-05-18-qwen-memory-benchmark-report.md` + +The main evidence is: + +- The Qwen-vs-Claude RSS gap reproduced across small PR review, code + navigation, and synthetic diff workloads. +- The gap reproduced with both `pai/glm-5` and `qwen3.6-plus`. +- Qwen Code used more tokens than Claude Code in every tested matrix cell. +- Large diff size did not produce a clean linear memory increase, which suggests + the baseline and bounded/truncated output paths matter more than raw diff + bytes alone. + +## Related Work + +Relevant upstream work already exists: + +| Item | Status | Role in the memory work | +| ------- | --------------------- | --------------------------------------------------------------------------------------------------------------- | +| `#4180` | merged PR | Adds baseline `/doctor memory` diagnostics. This is the first instrumentation slice. | +| `#4181` | open issue, no PR yet | Adds interpretation and pressure classification for `/doctor memory`. | +| `#4182` | open issue, no PR yet | Adds structured `/doctor memory --json` output and safe session-scale stats. | +| `#4183` | open issue, no PR yet | Adds opt-in heap snapshots and bounded memory timeline diagnostics. | +| `#4184` | open issue, no PR yet | Adds large tool-result retention diagnostics and designs offload/preview mitigation. | +| `#4127` | open PR, conflicting | Adds heap-pressure safety nets for long-session OOM prevention. Useful mitigation, not enough for attribution. | +| `#4168` | open PR | Redesigns auto-compaction thresholds. Useful for context pressure, not enough for task-time footprint analysis. | +| `#4172` | open PR | Decouples auto-memory recall from the main request path. Useful for latency/blocking, not direct RSS proof. | +| `#4188` | merged PR | Bounds build/test caches to prevent OOM in parallel test runs. Important but separate from runtime benchmarks. | + +This investigation should build on that direction rather than wait for all +follow-up issues to land. + +Most of the remaining work is instrumentation-first. The open diagnostics +issues are designed to make memory reports explainable before attempting a +runtime fix. The open mitigation PRs may reduce specific OOM paths, but they do +not yet explain why short non-interactive CLI tasks repeatedly peak near +`1 GiB`. + +## Why This Draft Starts With Documentation + +This draft intentionally starts with benchmark evidence and an investigation +plan instead of bundling a runtime code change. + +Reasons: + +1. The current goal is to make the performance problem and direction visible, + not to claim a same-day fix. +2. Adding instrumentation and optimization in the same PR would make review + harder because it mixes measurement, diagnosis, and behavior changes. +3. The existing benchmark already supports the need for deeper diagnostics. +4. The next PR can be narrower and easier to validate: diagnostics-only, then + rerun the same matrix and compare internal metrics. + +The next implementation PR should add the missing counters and timeline points, +then rerun the benchmark matrix. Only after that should a targeted optimization +PR attempt to reduce memory. + +## Working Inference + +The current data points toward a Qwen Code runtime/path issue more than a model +provider issue. + +The strongest current inference is: + +> Qwen Code appears to carry a high non-interactive CLI task execution +> footprint, likely amplified by larger context/tool-result/session handling. +> The likely problem area is the CLI runtime and agent data path, not the +> selected model alone. + +More specifically, the evidence points away from "too many tool calls" as the +primary cause. Tool-call counts were similar across CLIs, and Claude sometimes +used more turns or tool calls while keeping lower RSS. The more plausible +problem is that Qwen Code initializes or retains heavier state for the same +short non-interactive CLI task, then amplifies that execution footprint with +larger context, tool-result, saved-output, or session-history data. + +The most likely buckets are: + +1. **Process and module startup/execution cost**: Qwen Code may initialize more + runtime, tools, UI/session infrastructure, or provider machinery than needed + for non-interactive CLI tasks. +2. **History and context assembly**: Qwen Code may retain or construct larger + model-facing context than Claude Code for the same task shape. +3. **Tool-result retention**: large or repeated tool results may be retained in + live history, UI history, chat recording, or saved-output recovery paths. +4. **Subagent and saved-output amplification**: previous large PR tests showed + saved-output recovery and subagent activity, which can add memory and token + pressure. +5. **MCP child processes**: the companion diagnostics report revealed that MCP + servers (e.g. chrome-devtools) contribute ~350 MiB to process-tree RSS. This + inflates the absolute numbers but is a constant overhead unrelated to session + length. +6. **Native memory versus JS heap split**: external RSS cannot tell whether the + pressure is V8 heap, native buffers, loaded modules, or retained data. + +This is deliberately phrased as an inference. The next step is to add enough +internal measurements to confirm or rule out each bucket. + +## Proposed Draft PR Scope + +The first draft PR should be evidence and diagnostics focused: + +1. Commit the benchmark report and investigation plan. +2. Add or extend local diagnostic output so Qwen Code can report: + - V8 heap and heap-space statistics. + - RSS versus heap split. + - session message count and approximate retained size. + - tool result count, total retained size, and largest retained result size. + - truncation and saved-output recovery counters. + - subagent/process-tree activity when available. +3. Re-run the existing matrix against: + - current published Qwen Code, + - current `main`, + - diagnostics-only branch, + - candidate optimization branch. +4. Use those measurements to choose one small optimization target. + +The first PR should avoid mixing several unrelated optimizations. It should +either remain documentation-only or add diagnostics-only code. A separate PR +should carry the first runtime memory reduction once the cause is clearer. + +## Candidate Optimization Directions + +These are candidates, not conclusions: + +1. **Bounded tool-output retention**: store large output out of the hot path and + keep only preview, metadata, and retrieval pointers in live history. +2. **Non-interactive lazy loading**: avoid initializing TUI-only or + interactive-only subsystems during non-interactive CLI task execution. +3. **Session/UI history caps**: degrade old or heavy history items into compact + transcript entries. +4. **Context assembly accounting**: measure and cap large tool results before + model request construction. +5. **Subagent accounting**: expose subagent lifecycle and memory impact in + diagnostics. + +Claude Code and OpenAI Codex (OpenAI's CLI coding agent) should be used as +design references for diagnostic separation, bounded output retention, and lazy +history loading. The implementation should still follow Qwen Code's own +architecture and tests. + +## Validation Plan + +The investigation should keep the same benchmark matrix so before/after results +remain comparable: + +- small PR review +- code navigation +- synthetic diff about 100 KiB +- synthetic diff about 1 MiB +- synthetic diff about 5 MiB + +For each run, record: + +- process-tree RSS peak +- root process RSS peak +- V8 heap peak +- heap-space summary +- duration +- turns +- token count +- tool call count +- largest retained tool result +- total retained tool-result size +- session/history item counts +- subagent count + +The minimum success condition for a candidate fix is not just "RSS went down". +It should also identify which internal metric changed and why. + +## Next PR Candidate + +The next PR should be diagnostics-only and should avoid changing runtime +behavior. A minimal useful slice would add: + +- model request input-size accounting; +- system prompt and tool schema size accounting; +- retained message count and approximate retained character size; +- retained tool-result count, total size, and largest item size; +- lifecycle samples around startup, first request assembly, tool execution, + streaming completion, compression, and final response; +- process memory samples that include RSS, heap used, heap total, external, and + heap-space stats. + +After that lands locally, rerun the same Qwen model matrix and compare: + +- published Qwen Code; +- current `main`; +- diagnostics-only branch; +- candidate optimization branch. + +## Non-Goals + +This draft does not claim that: + +- all memory pressure is caused by tool output; +- one existing open PR will solve the observed task-time footprint; +- model provider differences are irrelevant in every environment; +- single-run local measurements are sufficient for release-level performance + claims. + +The intended claim is narrower: Qwen Code shows a consistent local RSS gap in +the tested workloads, and the project needs internal diagnostics to explain and +reduce that gap. diff --git a/packages/cli/src/ui/commands/doctorCommand.test.ts b/packages/cli/src/ui/commands/doctorCommand.test.ts index f9afbd969c..315ebc8cbc 100644 --- a/packages/cli/src/ui/commands/doctorCommand.test.ts +++ b/packages/cli/src/ui/commands/doctorCommand.test.ts @@ -143,10 +143,13 @@ describe('doctorCommand', () => { }, ], resourceUsage: { - maxRSS: 4_000, + maxRSS: 4 * 1024, + maxRSSRaw: 4, + maxRSSUnit: 'KiB', userCPUTime: 10, systemCPUTime: 20, }, + processTree: null, activeHandles: 2, activeRequests: 0, openFileDescriptors: null, @@ -839,10 +842,13 @@ describe('doctorCommand', () => { nativeContexts: 1, }, resourceUsage: { - maxRSS: 8_000, + maxRSS: 8 * 1024, + maxRSSRaw: 8, + maxRSSUnit: 'KiB', userCPUTime: 10, systemCPUTime: 20, }, + processTree: null, activeHandles: 2, activeRequests: 0, v8HeapSpaces: null, @@ -946,10 +952,13 @@ describe('doctorCommand', () => { }, v8HeapSpaces: null, resourceUsage: { - maxRSS: 4_000, + maxRSS: 4 * 1024, + maxRSSRaw: 4, + maxRSSUnit: 'KiB', userCPUTime: 10, systemCPUTime: 20, }, + processTree: null, activeHandles: 2, activeRequests: 0, openFileDescriptors: null, @@ -992,7 +1001,14 @@ describe('doctorCommand', () => { detachedContexts: 0, nativeContexts: 1, }, - resourceUsage: { maxRSS: 0, userCPUTime: 0, systemCPUTime: 0 }, + resourceUsage: { + maxRSS: 0, + maxRSSRaw: 0, + maxRSSUnit: 'KiB', + userCPUTime: 0, + systemCPUTime: 0, + }, + processTree: null, activeHandles: 0, activeRequests: 0, v8HeapSpaces: null, diff --git a/packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.ts b/packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.ts index 987766b15d..2f6ba63cb5 100644 --- a/packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.ts +++ b/packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.ts @@ -35,6 +35,7 @@ import { } from '../../utils/runtimeFetchOptions.js'; import { DEFAULT_TIMEOUT } from '../openaiContentGenerator/constants.js'; import { createDebugLogger } from '../../utils/debugLogger.js'; +import { runtimeDiagnostics } from '../../utils/runtimeDiagnostics.js'; import { tokenLimit, CAPPED_DEFAULT_MAX_TOKENS, @@ -226,6 +227,7 @@ export class AnthropicContentGenerator implements ContentGenerator { let response: Message; try { const anthropicRequest = await this.buildRequest(request); + runtimeDiagnostics.recordAnthropicWireRequest(anthropicRequest); const headers = this.buildPerRequestHeaders(anthropicRequest); response = (await this.client.messages.create(anthropicRequest, { signal: request.config?.abortSignal, @@ -249,6 +251,7 @@ export class AnthropicContentGenerator implements ContentGenerator { ...anthropicRequest, stream: true, }; + runtimeDiagnostics.recordAnthropicWireRequest(streamingRequest); let stream: AsyncIterable; try { diff --git a/packages/core/src/core/client.ts b/packages/core/src/core/client.ts index 7f2d514ce0..efd0043fc1 100644 --- a/packages/core/src/core/client.ts +++ b/packages/core/src/core/client.ts @@ -301,10 +301,58 @@ export class GeminiClient { return this.getChat().getHistory(curated); } + getHistoryShallow(curated: boolean = false): Content[] { + const chat = this.getChat(); + return chat.getHistoryShallow?.(curated) ?? chat.getHistory(curated); + } + getHistoryTail(count: number, curated: boolean = false): Content[] { return this.getChat().getHistoryTail(count, curated); } + private getHistoryTailShallow( + count: number, + curated: boolean = false, + ): Content[] { + const chat = this.getChat(); + return ( + chat.getHistoryTailShallow?.(count, curated) ?? + chat.getHistoryTail?.(count, curated) ?? + chat.getHistory(curated).slice(-count) + ); + } + + private peekLastHistoryEntry(): Content | undefined { + const chat = this.getChat(); + return chat.peekLastHistoryEntry?.() ?? chat.getHistory().at(-1); + } + + private getHistoryLength(): number { + const chat = this.getChat(); + return chat.getHistoryLength?.() ?? chat.getHistory().length; + } + + private getLastModelMessageText(): string | undefined { + const chat = this.getChat(); + if (chat.getLastModelMessageText) { + return chat.getLastModelMessageText(); + } + const history = chat.getHistoryShallow?.() ?? chat.getHistory(); + for (let i = history.length - 1; i >= 0; i--) { + const message = history[i]; + if (message?.role !== 'model') continue; + const text = + message.parts + ?.filter( + (part): part is { text: string } => typeof part.text === 'string', + ) + .map((part) => part.text) + .join('') ?? ''; + return text || undefined; + } + return undefined; + } + /** * Pop orphaned trailing user entries from the in-memory chat history. * Used by: @@ -921,7 +969,7 @@ export class GeminiClient { ) { const projectRoot = this.config.getProjectRoot(); const sessionId = this.config.getSessionId(); - const history = this.getHistory(); + const history = this.getHistoryShallow(); const mgr = this.config.getMemoryManager(); const autoSkillEnabled = this.config.getAutoSkillEnabled(); @@ -985,7 +1033,7 @@ export class GeminiClient { const projectRoot = this.config.getProjectRoot(); const sessionId = this.config.getSessionId(); - const history = this.getHistory(); + const history = this.getHistoryShallow(); const mgr = this.config.getMemoryManager(); if (!this.config.getManagedAutoMemoryEnabled()) { @@ -1259,7 +1307,7 @@ export class GeminiClient { // retries/hooks) so that model latency during a tool-call loop // doesn't count as user idle time. const mcResult = microcompactHistory( - this.getChat().getHistory(), + this.getHistoryShallow(), this.lastApiCompletionTimestamp, this.config.getClearContextOnIdle(), ); @@ -1394,9 +1442,8 @@ export class GeminiClient { // part from the user immediately follows a functionCall part from the model // in the conversation history . The IDE context is not discarded; it will // be included in the next regular message sent to the model. - const history = this.getHistory(); - const lastMessage = - history.length > 0 ? history[history.length - 1] : undefined; + const historyLength = this.getHistoryLength(); + const lastMessage = this.peekLastHistoryEntry(); const hasPendingToolCall = !!lastMessage && lastMessage.role === 'model' && @@ -1407,7 +1454,7 @@ export class GeminiClient { if (this.config.getIdeMode() && !hasPendingToolCall) { const { contextParts, newIdeContext } = this.getIdeContextParts( - this.forceFullIdeContext || history.length === 0, + this.forceFullIdeContext || historyLength === 0, ); if (contextParts.length > 0) { ideContextText = wrapIdeContext(contextParts.join('\n')); @@ -1643,16 +1690,8 @@ export class GeminiClient { !signal.aborted && this.config.hasHooksForEvent('Stop') ) { - // Get response text from the chat history - const history = this.getHistory(); - const lastModelMessage = history - .filter((msg) => msg.role === 'model') - .pop(); const responseText = - lastModelMessage?.parts - ?.filter((p): p is { text: string } => 'text' in p) - .map((p) => p.text) - .join('') || '[no response text]'; + this.getLastModelMessageText() || '[no response text]'; const response = await messageBus.request< HookExecutionRequest, @@ -1817,12 +1856,11 @@ export class GeminiClient { // see the current turn's history regardless of which path exits below. try { const chat = this.getChat(); - const fullHistory = chat.getHistory(true); const maxHistoryForCache = 40; - const cachedHistory = - fullHistory.length > maxHistoryForCache - ? fullHistory.slice(-maxHistoryForCache) - : fullHistory; + const cachedHistory = this.getHistoryTailShallow( + maxHistoryForCache, + true, + ); saveCacheSafeParams( chat.getGenerationConfig(), cachedHistory, @@ -2008,7 +2046,8 @@ export class GeminiClient { signal, ); if (info.compressionStatus === CompressionStatus.COMPRESSED) { - const compressedHistory = this.getChat().getHistory(); + const chat = this.getChat(); + const compressedHistory = chat.getHistoryShallow?.() ?? chat.getHistory(); await this.startChat(compressedHistory, SessionStartSource.Compact); if ( !this.lastSessionStartContext && diff --git a/packages/core/src/core/geminiChat.test.ts b/packages/core/src/core/geminiChat.test.ts index fd25e8a220..5f3caa976b 100644 --- a/packages/core/src/core/geminiChat.test.ts +++ b/packages/core/src/core/geminiChat.test.ts @@ -27,18 +27,6 @@ import { CompressionStatus, type ChatCompressionInfo } from './turn.js'; import { ChatCompressionService } from '../services/chatCompressionService.js'; import { SessionStartSource } from '../hooks/types.js'; -const { mockGetHeapStatistics } = vi.hoisted(() => ({ - mockGetHeapStatistics: vi.fn(), -})); - -vi.mock('node:v8', async (importOriginal) => { - const actual = await importOriginal(); - return { - ...actual, - getHeapStatistics: mockGetHeapStatistics, - }; -}); - // Mock fs module to prevent actual file system operations during tests const mockFileSystem = new Map(); @@ -115,10 +103,6 @@ describe('GeminiChat', async () => { // Default mock implementation for tests that don't care about retry logic mockRetryWithBackoff.mockImplementation(async (apiCall) => apiCall()); - mockGetHeapStatistics.mockReturnValue({ - used_heap_size: 0, - heap_size_limit: Number.MAX_SAFE_INTEGER, - }); mockConfig = { getSessionId: () => 'test-session-id', getTelemetryLogPromptsEnabled: () => true, @@ -1077,6 +1061,61 @@ describe('GeminiChat', async () => { ); }); + it('does not deep-clone the full curated history when building request contents', async () => { + chat.setHistory([ + { role: 'user', parts: [{ text: 'prior question' }] }, + { role: 'model', parts: [{ text: 'prior answer' }] }, + ]); + const response = (async function* () { + yield { + candidates: [ + { + content: { + parts: [{ text: 'response' }], + role: 'model', + }, + finishReason: 'STOP', + index: 0, + safetyRatings: [], + }, + ], + text: () => 'response', + } as unknown as GenerateContentResponse; + })(); + vi.mocked(mockContentGenerator.generateContentStream).mockResolvedValue( + response, + ); + const structuredCloneSpy = vi + .spyOn(globalThis, 'structuredClone') + .mockImplementation(() => { + throw new Error('structuredClone should not build request contents'); + }); + + try { + const stream = await chat.sendMessageStream( + 'test-model', + { message: 'hello' }, + 'prompt-id-no-request-clone', + ); + for await (const _ of stream) { + // consume stream + } + } finally { + structuredCloneSpy.mockRestore(); + } + + expect(mockContentGenerator.generateContentStream).toHaveBeenCalledWith( + expect.objectContaining({ + contents: [ + { role: 'user', parts: [{ text: 'prior question' }] }, + { role: 'model', parts: [{ text: 'prior answer' }] }, + { role: 'user', parts: [{ text: 'hello' }] }, + ], + }), + 'prompt-id-no-request-clone', + ); + }); + it('should not update global telemetry when no telemetryService is provided (subagent isolation)', async () => { // Simulate a subagent GeminiChat: created without a telemetryService const subagentChat = new GeminiChat(mockConfig, config, []); @@ -1223,7 +1262,10 @@ describe('GeminiChat', async () => { compressionStatus: CompressionStatus.NOOP, }, }); - vi.spyOn(chat, 'getHistory').mockImplementationOnce(() => { + vi.spyOn( + chat as unknown as { getRequestHistory: () => Content[] }, + 'getRequestHistory', + ).mockImplementationOnce(() => { throw new Error('history setup failed'); }); @@ -1928,6 +1970,65 @@ describe('GeminiChat', async () => { }); }); + describe('getHistoryShallow', () => { + it('copies containers without structured-cloning large part payloads', () => { + const payload = { output: 'x'.repeat(128 * 1024) }; + const content: Content = { + role: 'user', + parts: [ + { + functionResponse: { + id: 'call-1', + name: 'read_file', + response: payload, + }, + }, + ], + }; + chat.addHistory(content); + const structuredCloneSpy = vi + .spyOn(globalThis, 'structuredClone') + .mockImplementation(() => { + throw new Error('unexpected deep clone'); + }); + + const history = chat.getHistoryShallow(); + + expect(structuredCloneSpy).not.toHaveBeenCalled(); + expect(history).toEqual([content]); + expect(history[0]).not.toBe(content); + expect(history[0]!.parts).not.toBe(content.parts); + const response = history[0]!.parts![0] as { + functionResponse: { response: typeof payload }; + }; + expect(response.functionResponse.response).toBe(payload); + }); + }); + + describe('getHistoryTailShallow', () => { + it('copies only recent containers without cloning payloads', () => { + const oldContent: Content = { role: 'user', parts: [{ text: 'old' }] }; + const recentContent: Content = { + role: 'model', + parts: [{ text: 'recent' }], + }; + chat.addHistory(oldContent); + chat.addHistory(recentContent); + const structuredCloneSpy = vi + .spyOn(globalThis, 'structuredClone') + .mockImplementation(() => { + throw new Error('unexpected deep clone'); + }); + + const tail = chat.getHistoryTailShallow(1); + + expect(structuredCloneSpy).not.toHaveBeenCalled(); + expect(tail).toEqual([recentContent]); + expect(tail[0]).not.toBe(recentContent); + expect(tail[0]!.parts).not.toBe(recentContent.parts); + }); + }); + describe('getLastHistoryEntry', () => { it('returns undefined for an empty history', () => { expect(chat.getLastHistoryEntry()).toBeUndefined(); @@ -1948,6 +2049,42 @@ describe('GeminiChat', async () => { }); }); + describe('peekLastHistoryEntry', () => { + it('returns the last entry without structured-cloning the full history', () => { + const first: Content = { role: 'user', parts: [{ text: 'a' }] }; + const last: Content = { role: 'model', parts: [{ text: 'b' }] }; + chat.addHistory(first); + chat.addHistory(last); + const structuredCloneSpy = vi + .spyOn(globalThis, 'structuredClone') + .mockImplementation(() => { + throw new Error('unexpected deep clone'); + }); + + expect(chat.peekLastHistoryEntry()).toBe(last); + expect(structuredCloneSpy).not.toHaveBeenCalled(); + }); + }); + + describe('getLastModelMessageText', () => { + it('returns text from the latest model message without cloning history', () => { + chat.addHistory({ role: 'model', parts: [{ text: 'older' }] }); + chat.addHistory({ role: 'user', parts: [{ text: 'question' }] }); + chat.addHistory({ + role: 'model', + parts: [{ text: 'new' }, { text: ' answer' }], + }); + const structuredCloneSpy = vi + .spyOn(globalThis, 'structuredClone') + .mockImplementation(() => { + throw new Error('unexpected deep clone'); + }); + + expect(chat.getLastModelMessageText()).toBe('new answer'); + expect(structuredCloneSpy).not.toHaveBeenCalled(); + }); + }); + describe('sendMessageStream with retries', () => { it('should retry on invalid content, succeed, and report metrics', async () => { vi.useFakeTimers(); @@ -3620,13 +3757,6 @@ describe('GeminiChat', async () => { return compressSpy; } - function mockHeapPressure(usedHeapSize: number, heapLimit = 1000) { - mockGetHeapStatistics.mockReturnValue({ - used_heap_size: usedHeapSize, - heap_size_limit: heapLimit, - }); - } - it('replaces history and updates per-chat lastPromptTokenCount on COMPRESSED', async () => { mockCompressionService('compressed'); chat.setHistory([userMsg('a'), modelMsg('b'), userMsg('c')]); @@ -3690,136 +3820,9 @@ describe('GeminiChat', async () => { it('forwards force=true to the compression service', async () => { const compressSpy = mockCompressionService('compressed'); - mockHeapPressure(900); await chat.tryCompress('p1', 'm1', true); expect(compressSpy.mock.calls[0][1].force).toBe(true); - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(false); - expect(mockGetHeapStatistics).not.toHaveBeenCalled(); - }); - - it('uses heap pressure to bypass the token gate without manual force semantics', async () => { - const compressSpy = mockCompressionService('noop'); - mockHeapPressure(750); - vi.mocked(mockConfig.getContentGeneratorConfig).mockReturnValue({ - authType: AuthType.USE_GEMINI, - model: 'test-model', - contextWindowSize: 1000, - }); - - await chat.tryCompress('p1', 'm1'); - - expect(compressSpy.mock.calls[0][1].force).toBe(false); - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(true); - expect(compressSpy.mock.calls[0][1].originalTokenCount).toBe(0); - }); - - it('does not bypass the token gate below the heap-pressure threshold', async () => { - const compressSpy = mockCompressionService('noop'); - mockHeapPressure(650); - - await chat.tryCompress('p1', 'm1'); - - expect(compressSpy.mock.calls[0][1].force).toBe(false); - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(false); - }); - - it('does not let a failed heap-pressure attempt latch off later auto-compaction', async () => { - const compressSpy = mockCompressionService('failed-inflated'); - mockHeapPressure(701); - - const first = await chat.tryCompress('p1', 'm1'); - expect(first.compressionStatus).toBe( - CompressionStatus.COMPRESSION_FAILED_INFLATED_TOKEN_COUNT, - ); - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(true); - - compressSpy.mockClear(); - compressSpy.mockResolvedValue({ - newHistory: null, - info: { - originalTokenCount: 0, - newTokenCount: 0, - compressionStatus: CompressionStatus.NOOP, - }, - }); - mockHeapPressure(0); - - await chat.tryCompress('p2', 'm1'); - - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(false); - expect(compressSpy.mock.calls[0][1].hasFailedCompressionAttempt).toBe( - false, - ); - }); - - it('backs off repeated heap-pressure bypasses after a heap-triggered failure', async () => { - vi.useFakeTimers(); - vi.setSystemTime(new Date('2026-05-16T00:00:00Z')); - try { - const compressSpy = mockCompressionService('failed-inflated'); - mockHeapPressure(800); - - await chat.tryCompress('p1', 'm1'); - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(true); - - compressSpy.mockClear(); - compressSpy.mockResolvedValue({ - newHistory: null, - info: { - originalTokenCount: 0, - newTokenCount: 0, - compressionStatus: CompressionStatus.NOOP, - }, - }); - - await chat.tryCompress('p2', 'm1'); - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(false); - - vi.setSystemTime(new Date('2026-05-16T00:00:31Z')); - compressSpy.mockClear(); - - await chat.tryCompress('p3', 'm1'); - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(true); - } finally { - vi.useRealTimers(); - } - }); - - it('backs off repeated heap-pressure bypasses after a heap-triggered NOOP', async () => { - vi.useFakeTimers(); - vi.setSystemTime(new Date('2026-05-16T00:00:00Z')); - try { - const compressSpy = mockCompressionService('noop'); - mockHeapPressure(800); - - await chat.tryCompress('p1', 'm1'); - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(true); - - compressSpy.mockClear(); - - await chat.tryCompress('p2', 'm1'); - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(false); - - vi.setSystemTime(new Date('2026-05-16T00:00:31Z')); - compressSpy.mockClear(); - - await chat.tryCompress('p3', 'm1'); - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(true); - } finally { - vi.useRealTimers(); - } - }); - - it('falls back to token-threshold behavior if heap statistics are unavailable', async () => { - const compressSpy = mockCompressionService('noop'); - mockGetHeapStatistics.mockImplementation(() => { - throw new Error('heap stats unavailable'); - }); - - await chat.tryCompress('p1', 'm1'); - - expect(compressSpy.mock.calls[0][1].bypassTokenThreshold).toBe(false); }); }); }); diff --git a/packages/core/src/core/geminiChat.ts b/packages/core/src/core/geminiChat.ts index c2fc71bbea..2655acd61c 100644 --- a/packages/core/src/core/geminiChat.ts +++ b/packages/core/src/core/geminiChat.ts @@ -17,7 +17,6 @@ import type { GenerateContentResponseUsageMetadata, } from '@google/genai'; import { createUserContent, FinishReason } from '@google/genai'; -import { getHeapStatistics } from 'node:v8'; import { retryWithBackoff, isUnattendedMode } from '../utils/retry.js'; import { getErrorStatus, isAbortError } from '../utils/errors.js'; import { createDebugLogger } from '../utils/debugLogger.js'; @@ -59,10 +58,6 @@ import { getCustomSystemPrompt } from './prompts.js'; const debugLogger = createDebugLogger('QWEN_CODE_CHAT'); -// Leave roughly 30% V8 heap headroom for compression's transient allocations. -const HEAP_PRESSURE_COMPRESSION_RATIO = 0.7; -const HEAP_PRESSURE_COMPRESSION_COOLDOWN_MS = 30_000; - /** * Replaces the args on a `structured_output` `functionCall` with the * same `__redacted` placeholder used by `ToolCallEvent` telemetry @@ -353,6 +348,13 @@ function extractCuratedHistory(comprehensiveHistory: Content[]): Content[] { return curatedHistory; } +function copyContentContainer(content: Content): Content { + return { + ...content, + ...(content.parts ? { parts: [...content.parts] } : {}), + }; +} + function stripThoughtPartsFromContent(content: Content): Content | null { if (!content.parts) { return content; @@ -441,14 +443,6 @@ export class GeminiChat { */ private hasFailedCompressionAttempt = false; - /** - * Heap-pressure compaction is process-wide pressure applied per chat. If one - * heap-triggered attempt cannot reduce history, briefly back off this chat - * so every subsequent send does not immediately pay for another compression - * side query while memory is already tight. - */ - private heapPressureCompressionCooldownUntil = 0; - /** * Creates a new GeminiChat instance. * @@ -482,6 +476,18 @@ export class GeminiChat { return this.lastPromptTokenCount; } + /** + * Builds request contents for the content generator without deep-cloning the + * whole chat history. This is an internal hot path: long sessions can make a + * full `structuredClone` larger than the remaining V8 heap headroom. + * + * Public history readers still use {@link getHistory}, which returns a + * defensive deep copy for caller mutation safety. + */ + private getRequestHistory(): Content[] { + return extractCuratedHistory(this.history).map(copyContentContainer); + } + /** * Seed the last-prompt-token-count for chats created with inherited * history (forks, subagents, speculation). Without this, the auto-compress @@ -509,33 +515,6 @@ export class GeminiChat { signal?: AbortSignal, options?: TryCompressOptions, ): Promise { - const heapPressureRatio = force ? null : this.getHeapPressureRatio(); - const heapPressureCooldownActive = - !force && Date.now() < this.heapPressureCompressionCooldownUntil; - const bypassTokenThreshold = - heapPressureRatio !== null && - heapPressureRatio >= HEAP_PRESSURE_COMPRESSION_RATIO && - !heapPressureCooldownActive; - if (bypassTokenThreshold) { - // Temporary safety net: token-based compaction can be too late for - // large-context sessions because JS heap pressure may hit first. - // Do not use force=true here because that carries manual /compress - // semantics in ChatCompressionService. - debugLogger.warn( - `Heap pressure at ${(heapPressureRatio * 100).toFixed(1)}%; ` + - 'attempting auto-compaction before token threshold.', - ); - } else if ( - heapPressureRatio !== null && - heapPressureRatio >= HEAP_PRESSURE_COMPRESSION_RATIO && - heapPressureCooldownActive - ) { - debugLogger.debug( - `Heap pressure at ${(heapPressureRatio * 100).toFixed(1)}%; ` + - 'skipping heap-pressure auto-compaction during cooldown.', - ); - } - const service = new ChatCompressionService(); const { newHistory, info } = await service.compress(this, { promptId, @@ -545,7 +524,6 @@ export class GeminiChat { hasFailedCompressionAttempt: this.hasFailedCompressionAttempt, originalTokenCount: options?.originalTokenCountOverride ?? this.lastPromptTokenCount, - bypassTokenThreshold, trigger: options?.trigger, signal, }); @@ -555,37 +533,13 @@ export class GeminiChat { info, compressedHistory: newHistory, }); - // Auto-compaction replaces history in place — no env-context refresh - // here. Manual /compress goes through GeminiClient.tryCompressChat, - // which calls startChat() to re-prepend a fresh env snapshot. See - // GeminiClient.sendMessageStream for the rationale behind the split. this.setHistory(newHistory); - // Compaction summarises away prior full-Read tool results, but the - // FileReadCache still treats those reads as "in this conversation". - // A follow-up Read could then return the file_unchanged placeholder - // pointing at content the model can no longer retrieve from history. debugLogger.debug('[FILE_READ_CACHE] clear after auto tryCompress'); this.config.getFileReadCache().clear(); this.lastPromptTokenCount = info.newTokenCount; - // Mirror to the global singleton only when wired (main session). - // Subagents pass `telemetryService=undefined` to keep their context - // usage out of the main agent's UI counters. this.telemetryService?.setLastPromptTokenCount(info.newTokenCount); - // Re-enable auto-compaction so a forced /compress recovers a chat - // that an earlier auto-attempt latched off. this.hasFailedCompressionAttempt = false; - this.heapPressureCompressionCooldownUntil = 0; - } else if (bypassTokenThreshold) { - // If heap-pressure compaction cannot reduce history (NOOP or failure), - // avoid repeatedly cloning history and/or paying side-query latency while - // the process-wide pressure remains high. - this.heapPressureCompressionCooldownUntil = - Date.now() + HEAP_PRESSURE_COMPRESSION_COOLDOWN_MS; } else if (isCompressionFailureStatus(info.compressionStatus)) { - // Track failed attempts (only mark as failed if not forced) so we - // stop spending compression-API calls on a chat that can't shrink. - // Heap-pressure attempts are a safety net, not evidence that normal - // token-threshold compaction should be latched off for this chat. if (!force) { this.hasFailedCompressionAttempt = true; } @@ -594,24 +548,6 @@ export class GeminiChat { return info; } - private getHeapPressureRatio(): number | null { - try { - const { used_heap_size: usedHeapSize, heap_size_limit: heapLimit } = - getHeapStatistics(); - if ( - !Number.isFinite(usedHeapSize) || - usedHeapSize < 0 || - !Number.isFinite(heapLimit) || - heapLimit <= 0 - ) { - return null; - } - return usedHeapSize / heapLimit; - } catch { - return null; - } - } - setSystemInstruction(sysInstr: string) { this.generationConfig.systemInstruction = sysInstr; } @@ -701,7 +637,7 @@ export class GeminiChat { // Add user content to history ONCE before any attempts. this.history.push(userContent); userContentAdded = true; - requestContents = this.getHistory(true); + requestContents = this.getRequestHistory(); } catch (error) { if (userContentAdded) { this.history.pop(); @@ -866,7 +802,7 @@ export class GeminiChat { reactiveInfo.compressionStatus === CompressionStatus.COMPRESSED ) { - requestContents = self.getHistory(true); + requestContents = self.getRequestHistory(); debugLogger.info( `Reactive compression succeeded: ` + `${reactiveInfo.originalTokenCount} -> ` + @@ -1070,7 +1006,7 @@ export class GeminiChat { // model's continuation appends to the previous partial output. yield { type: StreamEventType.RETRY, isContinuation: true }; // Re-send with the updated history (includes partial + recovery) - const recoveryContents = self.getHistory(true); + const recoveryContents = self.getRequestHistory(); escalatedFinishReason = undefined; try { const recoveryStream = await self.makeApiCallAndProcessStream( @@ -1237,6 +1173,29 @@ export class GeminiChat { return structuredClone(history.slice(-count)); } + /** + * Returns a shallow copy of the history and each entry's parts array without + * cloning large part payloads. Use only for read-only consumers or consumers + * that replace touched entries before mutating them. + */ + getHistoryShallow(curated: boolean = false): Content[] { + const history = curated + ? extractCuratedHistory(this.history) + : this.history; + return history.map(copyContentContainer); + } + + /** + * Shallow tail variant for hot paths that only need recent history. + */ + getHistoryTailShallow(count: number, curated: boolean = false): Content[] { + if (count <= 0) return []; + const history = curated + ? extractCuratedHistory(this.history) + : this.history; + return history.slice(-count).map(copyContentContainer); + } + /** * Returns a defensive copy of the last raw history entry without cloning the * full conversation. This avoids O(history) cloning, though cloning the last @@ -1246,6 +1205,35 @@ export class GeminiChat { return this.getHistoryTail(1)[0]; } + /** + * Returns the last raw history entry for read-only checks. Callers must not + * mutate the returned object. + */ + peekLastHistoryEntry(): Content | undefined { + return this.history.at(-1); + } + + /** + * Returns concatenated text from the last model entry without cloning the + * full history. Used by stop hooks, where only the latest assistant text is + * needed. + */ + getLastModelMessageText(): string | undefined { + for (let i = this.history.length - 1; i >= 0; i--) { + const message = this.history[i]; + if (message?.role !== 'model') continue; + const text = + message.parts + ?.filter( + (part): part is { text: string } => typeof part.text === 'string', + ) + .map((part) => part.text) + .join('') ?? ''; + return text || undefined; + } + return undefined; + } + /** * Returns the number of entries in the raw chat history. O(1) and * does not clone — use this when you only need the count and would diff --git a/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts b/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts index 059104d5c6..cfd16be7f2 100644 --- a/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts +++ b/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts @@ -44,6 +44,7 @@ import { openaiRequestCaptureContext } from '../openaiContentGenerator/requestCa import type { RequestContext } from '../openaiContentGenerator/types.js'; import { OpenAILogger } from '../../utils/openaiLogger.js'; import { createDebugLogger } from '../../utils/debugLogger.js'; +import { runtimeDiagnostics } from '../../utils/runtimeDiagnostics.js'; import { getErrorMessage, getErrorStatus, @@ -226,6 +227,10 @@ export class LoggingContentGenerator implements ContentGenerator { const isInternal = isInternalPromptId(userPromptId); const session = this.startCaptureSession(); try { + runtimeDiagnostics.recordGenerateContentRequest(req, { + stream: false, + source: 'generateContent', + }); if (!isInternal) { addSystemPromptAttributes( this.config, @@ -336,6 +341,10 @@ export class LoggingContentGenerator implements ContentGenerator { let stream: AsyncGenerator; try { + runtimeDiagnostics.recordGenerateContentRequest(req, { + stream: true, + source: 'generateContentStream', + }); if (!isInternal) { addSystemPromptAttributes( this.config, diff --git a/packages/core/src/core/openaiContentGenerator/pipeline.ts b/packages/core/src/core/openaiContentGenerator/pipeline.ts index c814527d61..605ab4b45d 100644 --- a/packages/core/src/core/openaiContentGenerator/pipeline.ts +++ b/packages/core/src/core/openaiContentGenerator/pipeline.ts @@ -18,6 +18,7 @@ import { StreamingToolCallParser } from './streamingToolCallParser.js'; import { TaggedThinkingParser } from './taggedThinkingParser.js'; import type { PipelineConfig, RequestContext } from './types.js'; import { redactProxyError } from '../../utils/runtimeFetchOptions.js'; +import { runtimeDiagnostics } from '../../utils/runtimeDiagnostics.js'; /** * The OpenAI SDK adds an abort listener for every `chat.completions.create` @@ -515,6 +516,7 @@ export class ContentGenerationPipeline { // provider enhancement, post disable-reasoning) and before the SDK call // so the logger sees the exact bytes sent on the wire. openaiRequestCaptureContext.getStore()?.(openaiRequest); + runtimeDiagnostics.recordOpenAIWireRequest(openaiRequest); const result = await executor(openaiRequest, context); return result; diff --git a/packages/core/src/index.ts b/packages/core/src/index.ts index 8e69bacd66..73acc7e3ce 100644 --- a/packages/core/src/index.ts +++ b/packages/core/src/index.ts @@ -183,7 +183,11 @@ export * from './memory/writeContextFile.js'; export * from './ide/ide-client.js'; export * from './ide/ideContext.js'; export * from './ide/ide-installer.js'; -export { IDE_DEFINITIONS, type IdeInfo } from './ide/detect-ide.js'; +export { + detectIdeFromEnv, + IDE_DEFINITIONS, + type IdeInfo, +} from './ide/detect-ide.js'; export * from './ide/constants.js'; export * from './ide/types.js'; @@ -285,6 +289,7 @@ export * from './utils/errorParsing.js'; export * from './utils/errors.js'; export * from './utils/fileUtils.js'; export * from './utils/filesearch/fileSearch.js'; +export * as crawlCache from './utils/filesearch/crawlCache.js'; export { Ignore, loadIgnoreRules, @@ -301,6 +306,7 @@ export * from './utils/jsonl-utils.js'; export * from './utils/memoryDiagnostics.js'; export * from './utils/memoryDiscovery.js'; export * from './utils/modelId.js'; +export * from './utils/runtimeDiagnostics.js'; export { ConditionalRulesRegistry } from './utils/rulesDiscovery.js'; export type { RuleFile } from './utils/rulesDiscovery.js'; export { diff --git a/packages/core/src/services/chatCompressionService.test.ts b/packages/core/src/services/chatCompressionService.test.ts index 3aa349863e..e42d6e80d4 100644 --- a/packages/core/src/services/chatCompressionService.test.ts +++ b/packages/core/src/services/chatCompressionService.test.ts @@ -389,6 +389,9 @@ describe('ChatCompressionService', () => { service = new ChatCompressionService(); mockChat = { getHistory: vi.fn(), + getHistoryShallow: vi.fn((curated?: boolean) => + mockChat.getHistory(curated), + ), appendSystemInstruction: vi.fn(), } as unknown as GeminiChat; mockGetHookSystem = vi.fn().mockReturnValue({}); @@ -463,88 +466,6 @@ describe('ChatCompressionService', () => { expect(result.newHistory).toBeNull(); }); - it('should bypass the token threshold when requested without force=true', async () => { - const history: Content[] = [ - { role: 'user', parts: [{ text: 'msg1' }] }, - { role: 'model', parts: [{ text: 'msg2' }] }, - { role: 'user', parts: [{ text: 'msg3' }] }, - { role: 'model', parts: [{ text: 'msg4' }] }, - ]; - vi.mocked(mockChat.getHistory).mockReturnValue(history); - vi.mocked(uiTelemetryService.getLastPromptTokenCount).mockReturnValue(100); - vi.mocked(mockConfig.getContentGeneratorConfig).mockReturnValue({ - model: 'gemini-pro', - contextWindowSize: 1000, - } as unknown as ReturnType); - - const mockGenerateContent = vi.fn().mockResolvedValue({ - text: 'Summary', - usage: { - promptTokenCount: 1100, - candidatesTokenCount: 50, - totalTokenCount: 1150, - }, - }); - vi.mocked(mockConfig.getBaseLlmClient).mockReturnValue({ - generateText: mockGenerateContent, - } as unknown as BaseLlmClient); - - const result = await service.compress(mockChat, { - promptId: mockPromptId, - force: false, - bypassTokenThreshold: true, - model: mockModel, - config: mockConfig, - hasFailedCompressionAttempt: false, - originalTokenCount: uiTelemetryService.getLastPromptTokenCount(), - }); - - expect(result.info.compressionStatus).toBe(CompressionStatus.COMPRESSED); - expect(result.newHistory).not.toBeNull(); - expect(mockGenerateContent).toHaveBeenCalled(); - }); - - it('should bypass the failed-attempt latch when heap pressure requests compaction', async () => { - const history: Content[] = [ - { role: 'user', parts: [{ text: 'msg1' }] }, - { role: 'model', parts: [{ text: 'msg2' }] }, - { role: 'user', parts: [{ text: 'msg3' }] }, - { role: 'model', parts: [{ text: 'msg4' }] }, - ]; - vi.mocked(mockChat.getHistory).mockReturnValue(history); - vi.mocked(uiTelemetryService.getLastPromptTokenCount).mockReturnValue(100); - vi.mocked(mockConfig.getContentGeneratorConfig).mockReturnValue({ - model: 'gemini-pro', - contextWindowSize: 1000, - } as unknown as ReturnType); - - const mockGenerateContent = vi.fn().mockResolvedValue({ - text: 'Summary', - usage: { - promptTokenCount: 1100, - candidatesTokenCount: 50, - totalTokenCount: 1150, - }, - }); - vi.mocked(mockConfig.getBaseLlmClient).mockReturnValue({ - generateText: mockGenerateContent, - } as unknown as BaseLlmClient); - - const result = await service.compress(mockChat, { - promptId: mockPromptId, - force: false, - bypassTokenThreshold: true, - model: mockModel, - config: mockConfig, - hasFailedCompressionAttempt: true, - originalTokenCount: uiTelemetryService.getLastPromptTokenCount(), - }); - - expect(result.info.compressionStatus).toBe(CompressionStatus.COMPRESSED); - expect(result.newHistory).not.toBeNull(); - expect(mockGenerateContent).toHaveBeenCalled(); - }); - it('should return NOOP when contextPercentageThreshold is 0', async () => { const history: Content[] = [ { role: 'user', parts: [{ text: 'msg1' }] }, @@ -595,41 +516,6 @@ describe('ChatCompressionService', () => { expect(tokenLimit).not.toHaveBeenCalled(); }); - it('should return NOOP when contextPercentageThreshold is 0 even with token threshold bypass', async () => { - const history: Content[] = [ - { role: 'user', parts: [{ text: 'msg1' }] }, - { role: 'model', parts: [{ text: 'msg2' }] }, - ]; - vi.mocked(mockChat.getHistory).mockReturnValue(history); - vi.mocked(uiTelemetryService.getLastPromptTokenCount).mockReturnValue(800); - vi.mocked(mockConfig.getChatCompression).mockReturnValue({ - contextPercentageThreshold: 0, - }); - - const mockGenerateContent = vi.fn(); - vi.mocked(mockConfig.getBaseLlmClient).mockReturnValue({ - generateText: mockGenerateContent, - } as unknown as BaseLlmClient); - - const result = await service.compress(mockChat, { - promptId: mockPromptId, - force: false, - bypassTokenThreshold: true, - model: mockModel, - config: mockConfig, - hasFailedCompressionAttempt: false, - originalTokenCount: uiTelemetryService.getLastPromptTokenCount(), - }); - - expect(result.info).toMatchObject({ - compressionStatus: CompressionStatus.NOOP, - originalTokenCount: 0, - newTokenCount: 0, - }); - expect(mockGenerateContent).not.toHaveBeenCalled(); - expect(tokenLimit).not.toHaveBeenCalled(); - }); - it('should return NOOP when historyToCompress is below MIN_COMPRESSION_FRACTION of total', async () => { // Construct a history where the split point lands on the 2nd regular user // message (index 2), but indices 0-1 are tiny relative to the huge content @@ -715,6 +601,72 @@ describe('ChatCompressionService', () => { expect(mockGetHookSystem).toHaveBeenCalled(); }); + it('does not deep-clone full history while compressing', async () => { + const largeToolOutput = 'x'.repeat(1024 * 1024); + const history: Content[] = [ + { role: 'user', parts: [{ text: 'review this PR' }] }, + { + role: 'model', + parts: [ + { + functionCall: { + id: 'read-1', + name: 'read_file', + args: { path: 'large.ts' }, + }, + }, + ], + }, + { + role: 'user', + parts: [ + { + functionResponse: { + id: 'read-1', + name: 'read_file', + response: { output: largeToolOutput }, + }, + }, + ], + }, + { role: 'model', parts: [{ text: 'analysis' }] }, + ]; + vi.mocked(mockChat.getHistory).mockImplementation(() => { + throw new Error('getHistory should not be called by compression'); + }); + vi.mocked(mockChat.getHistoryShallow).mockReturnValue(history); + vi.mocked(mockConfig.getContentGeneratorConfig).mockReturnValue({ + model: 'gemini-pro', + contextWindowSize: 1000, + } as unknown as ReturnType); + vi.mocked(uiTelemetryService.getLastPromptTokenCount).mockReturnValue(800); + + const mockGenerateContent = vi.fn().mockResolvedValue({ + text: 'Summary', + usage: { + promptTokenCount: 1600, + candidatesTokenCount: 50, + totalTokenCount: 1650, + }, + }); + vi.mocked(mockConfig.getBaseLlmClient).mockReturnValue({ + generateText: mockGenerateContent, + } as unknown as BaseLlmClient); + + const result = await service.compress(mockChat, { + promptId: mockPromptId, + force: false, + model: mockModel, + config: mockConfig, + hasFailedCompressionAttempt: false, + originalTokenCount: uiTelemetryService.getLastPromptTokenCount(), + }); + + expect(result.info.compressionStatus).toBe(CompressionStatus.COMPRESSED); + expect(mockChat.getHistory).not.toHaveBeenCalled(); + expect(mockChat.getHistoryShallow).toHaveBeenCalledWith(true); + }); + it('should force compress even if under threshold', async () => { const history: Content[] = [ { role: 'user', parts: [{ text: 'msg1' }] }, diff --git a/packages/core/src/services/chatCompressionService.ts b/packages/core/src/services/chatCompressionService.ts index f704ee10fe..97934d819e 100644 --- a/packages/core/src/services/chatCompressionService.ts +++ b/packages/core/src/services/chatCompressionService.ts @@ -181,14 +181,6 @@ export interface CompressOptions { * the service does not read or write any global telemetry. */ originalTokenCount: number; - /** - * Bypass the token-count threshold gate and the failed-attempt latch while - * preserving automatic compaction semantics. Used for temporary heap-pressure - * relief where `force=true` would be too broad because it means manual - * `/compress`. The heap-pressure check that sets this lives in - * `GeminiChat.tryCompress()`. - */ - bypassTokenThreshold?: boolean; /** * Hook trigger to report for this compression. `force=true` bypasses the * threshold gate but does not always mean the user manually requested @@ -210,7 +202,6 @@ export class ChatCompressionService { config, hasFailedCompressionAttempt, originalTokenCount, - bypassTokenThreshold = false, trigger, signal, } = opts; @@ -221,13 +212,7 @@ export class ChatCompressionService { COMPRESSION_TOKEN_THRESHOLD; const slimmingConfig = resolveSlimmingConfig(chatCompressionSettings); - // Cheap gates first — these don't need the curated history. Heap-pressure - // bypass must also bypass the failed-attempt latch, otherwise one failed - // compression would disable this safety net for the rest of the chat. - if ( - threshold <= 0 || - (hasFailedCompressionAttempt && !force && !bypassTokenThreshold) - ) { + if (threshold <= 0 || (hasFailedCompressionAttempt && !force)) { return { newHistory: null, info: { @@ -238,10 +223,7 @@ export class ChatCompressionService { }; } - // Don't compress if not forced and we are under the token limit. This is - // the steady-state path on every send; heap pressure may bypass it because - // the JS heap can become the limiting resource before token count does. - if (!force && !bypassTokenThreshold) { + if (!force) { const contextLimit = config.getContentGeneratorConfig()?.contextWindowSize ?? DEFAULT_TOKEN_LIMIT; @@ -257,7 +239,12 @@ export class ChatCompressionService { } } - const curatedHistory = chat.getHistory(true); + // Compression only reads the existing history while deciding the split and + // preparing the side-query payload. Avoid `getHistory(true)` here: long + // tool-heavy sessions can make a defensive deep clone larger than the + // remaining V8 heap headroom at exactly the moment compaction is trying to + // reduce memory pressure. + const curatedHistory = chat.getHistoryShallow(true); if (curatedHistory.length === 0) { return { newHistory: null, diff --git a/packages/core/src/services/sessionService.test.ts b/packages/core/src/services/sessionService.test.ts index 24e5942587..83c574bfee 100644 --- a/packages/core/src/services/sessionService.test.ts +++ b/packages/core/src/services/sessionService.test.ts @@ -947,6 +947,57 @@ describe('SessionService', () => { expect(history).toEqual([recordA1.message, assistantA1.message]); }); + it('does not deep-clone stored messages when rebuilding resume API history', () => { + const largePayload = { + output: 'x'.repeat(128 * 1024), + nested: { keep: true }, + }; + const toolResult: ChatRecord = { + uuid: 'large-tool-result', + parentUuid: recordA1.uuid, + sessionId: sessionIdA, + timestamp: '2024-01-01T00:02:00Z', + type: 'tool_result', + message: { + role: 'user', + parts: [ + { + functionResponse: { + id: 'call-1', + name: 'read_file', + response: largePayload, + }, + }, + ], + }, + cwd: '/test/project/root', + version: '1.0.0', + }; + const conversation: ConversationRecord = { + sessionId: sessionIdA, + projectHash: 'test-project-hash', + startTime: '2024-01-01T00:00:00Z', + lastUpdated: '2024-01-01T00:02:00Z', + messages: [recordA1, toolResult], + }; + const structuredCloneSpy = vi + .spyOn(globalThis, 'structuredClone') + .mockImplementation(() => { + throw new Error('unexpected deep clone'); + }); + + const history = buildApiHistoryFromConversation(conversation); + + expect(structuredCloneSpy).not.toHaveBeenCalled(); + expect(history).toEqual([recordA1.message, toolResult.message]); + expect(history[1]).not.toBe(toolResult.message); + expect(history[1].parts).not.toBe(toolResult.message!.parts); + const response = history[1].parts![0] as { + functionResponse: { response: typeof largePayload }; + }; + expect(response.functionResponse.response).toBe(largePayload); + }); + it('merges mid-turn user messages into the preceding tool result on resume', () => { const assistantWithToolCall: ChatRecord = { uuid: 'a2', diff --git a/packages/core/src/services/sessionService.ts b/packages/core/src/services/sessionService.ts index 3ccf2152fa..ffd0d0c721 100644 --- a/packages/core/src/services/sessionService.ts +++ b/packages/core/src/services/sessionService.ts @@ -1191,10 +1191,38 @@ function stripThoughtsFromContent(content: Content): Content | null { }; } +function copyContentForApiHistory(content: Content): Content { + return { + ...content, + parts: content.parts?.map((part) => { + if ('functionCall' in part && part.functionCall) { + return { + ...part, + functionCall: { + ...part.functionCall, + args: part.functionCall.args + ? { ...part.functionCall.args } + : part.functionCall.args, + }, + }; + } + if ('functionResponse' in part && part.functionResponse) { + return { + ...part, + functionResponse: { + ...part.functionResponse, + }, + }; + } + return { ...part }; + }), + }; +} + function appendApiHistoryRecord(history: Content[], record: ChatRecord): void { if (!record.message) return; - const message = structuredClone(record.message as Content); + const message = copyContentForApiHistory(record.message as Content); if (record.subtype === 'mid_turn_user_message') { const previous = history.at(-1); if (previous?.role === 'user') { @@ -1240,7 +1268,9 @@ export function buildApiHistoryFromConversation( }); if (compressedHistory && lastCompressionIndex >= 0) { - const baseHistory: Content[] = structuredClone(compressedHistory); + const baseHistory: Content[] = compressedHistory.map( + copyContentForApiHistory, + ); // Append everything after the compression record (newer turns) for (let i = lastCompressionIndex + 1; i < messages.length; i++) { diff --git a/packages/core/src/tools/agent/agent.ts b/packages/core/src/tools/agent/agent.ts index ba871d3c4a..05f8cc2bd3 100644 --- a/packages/core/src/tools/agent/agent.ts +++ b/packages/core/src/tools/agent/agent.ts @@ -960,7 +960,10 @@ class AgentToolInvocation extends BaseToolInvocation { toolConfig: ToolConfig; }> { const geminiClient = this.config.getGeminiClient(); - const rawHistory = geminiClient ? geminiClient.getHistory(true) : []; + const rawHistory = geminiClient + ? (geminiClient.getHistoryShallow?.(true) ?? + geminiClient.getHistory(true)) + : []; // Build the history that will seed the fork's chat. Must end with a // model message so agent-headless can send the task_prompt as a user diff --git a/packages/core/src/utils/forkedAgent.ts b/packages/core/src/utils/forkedAgent.ts index c9c56ef936..f8b13ebf43 100644 --- a/packages/core/src/utils/forkedAgent.ts +++ b/packages/core/src/utils/forkedAgent.ts @@ -66,7 +66,7 @@ import { export interface CacheSafeParams { /** Full generation config including systemInstruction and tools */ generationConfig: GenerateContentConfig; - /** Curated conversation history (deep clone) */ + /** Curated conversation history (shallow copy; consumers must not mutate) */ history: Content[]; /** Model identifier */ model: string; diff --git a/packages/core/src/utils/memoryDiagnostics.test.ts b/packages/core/src/utils/memoryDiagnostics.test.ts index 0e7c3de4a0..5a1daa24e3 100644 --- a/packages/core/src/utils/memoryDiagnostics.test.ts +++ b/packages/core/src/utils/memoryDiagnostics.test.ts @@ -83,6 +83,9 @@ describe('collectMemoryDiagnostics', () => { activeRequests: () => 3, openFileDescriptors: async () => 501, smapsRollup: async () => 'Rss: 5000 kB', + processTree: async () => { + throw new Error('not available'); + }, platform: 'linux', nodeVersion: 'v20.19.0', }); @@ -117,10 +120,13 @@ describe('collectMemoryDiagnostics', () => { }, ], resourceUsage: { - maxRSS: 6, + maxRSS: 6 * 1024, + maxRSSRaw: 6, + maxRSSUnit: 'KiB', userCPUTime: 10, systemCPUTime: 20, }, + processTree: null, activeHandles: 300, activeRequests: 3, openFileDescriptors: 501, @@ -226,7 +232,7 @@ describe('collectMemoryDiagnostics', () => { ); }); - it('treats maxRSS as bytes on all platforms', async () => { + it('normalizes resourceUsage maxRSS from KiB to bytes', async () => { const diagnostics = await collectMemoryDiagnostics({ memoryUsage: () => ({ heapUsed: 100, @@ -273,8 +279,70 @@ describe('collectMemoryDiagnostics', () => { nodeVersion: 'v20.19.0', }); - // Node.js >=14.10.0 returns maxRSS in bytes on all platforms. - expect(diagnostics.resourceUsage.maxRSS).toBe(4_096); + expect(diagnostics.resourceUsage.maxRSS).toBe(4_096 * 1024); + expect(diagnostics.resourceUsage.maxRSSRaw).toBe(4_096); + expect(diagnostics.resourceUsage.maxRSSUnit).toBe('KiB'); + }); + + it('includes process tree RSS when the optional probe is available', async () => { + const diagnostics = await collectMemoryDiagnostics({ + memoryUsage: () => ({ + heapUsed: 100, + heapTotal: 200, + rss: 300, + external: 10, + arrayBuffers: 5, + }), + heapStatistics: () => ({ + heap_size_limit: 1_000, + total_heap_size: 200, + total_heap_size_executable: 0, + total_physical_size: 200, + used_heap_size: 100, + malloced_memory: 0, + peak_malloced_memory: 0, + does_zap_garbage: 0, + number_of_native_contexts: 1, + number_of_detached_contexts: 0, + total_available_size: 900, + total_global_handles_size: 0, + used_global_handles_size: 0, + external_memory: 10, + }), + resourceUsage: () => ({ + userCPUTime: 10, + systemCPUTime: 20, + maxRSS: 4_096, + sharedMemorySize: 0, + unsharedDataSize: 0, + unsharedStackSize: 0, + minorPageFault: 0, + majorPageFault: 0, + swappedOut: 0, + fsRead: 0, + fsWrite: 0, + ipcSent: 0, + ipcReceived: 0, + signalsCount: 0, + voluntaryContextSwitches: 0, + involuntaryContextSwitches: 0, + }), + processTree: async () => ({ + rootPid: 123, + processCount: 3, + rootRSS: 10 * 1024 * 1024, + treeRSS: 25 * 1024 * 1024, + }), + platform: 'darwin', + nodeVersion: 'v20.19.0', + }); + + expect(diagnostics.processTree).toEqual({ + rootPid: 123, + processCount: 3, + rootRSS: 10 * 1024 * 1024, + treeRSS: 25 * 1024 * 1024, + }); }); it('treats unsupported optional probes as unavailable instead of failing', async () => { diff --git a/packages/core/src/utils/memoryDiagnostics.ts b/packages/core/src/utils/memoryDiagnostics.ts index 2ebe3c88e6..e5f3718b61 100644 --- a/packages/core/src/utils/memoryDiagnostics.ts +++ b/packages/core/src/utils/memoryDiagnostics.ts @@ -5,7 +5,9 @@ */ import { readdir, readFile } from 'node:fs/promises'; +import { execFile } from 'node:child_process'; import process from 'node:process'; +import { promisify } from 'node:util'; import v8 from 'node:v8'; import { createDebugLogger } from './debugLogger.js'; import { formatMemoryUsage } from './formatters.js'; @@ -20,6 +22,7 @@ const ACTIVE_HANDLES_THRESHOLD = 256; const ACTIVE_REQUESTS_THRESHOLD = 100; const OPEN_FD_THRESHOLD = 500; const debugLogger = createDebugLogger('MEMORY_DIAGNOSTICS'); +const execFileAsync = promisify(execFile); export interface MemoryDiagnostics { timestamp: string; @@ -30,6 +33,7 @@ export interface MemoryDiagnostics { v8HeapStats: V8HeapStats; v8HeapSpaces: V8HeapSpaceStats[] | null; resourceUsage: MemoryResourceUsage; + processTree: ProcessTreeMemoryUsage | null; activeHandles: number; activeRequests: number; openFileDescriptors: number | null; @@ -57,11 +61,21 @@ export interface V8HeapSpaceStats { } export interface MemoryResourceUsage { + /** Normalized bytes. Node/resourceUsage reports maxRSS in KiB. */ maxRSS: number; + maxRSSRaw: number; + maxRSSUnit: 'KiB'; userCPUTime: number; systemCPUTime: number; } +export interface ProcessTreeMemoryUsage { + rootPid: number; + processCount: number; + rootRSS: number; + treeRSS: number; +} + export interface MemoryDiagnosticsAnalysis { risks: MemoryRisk[]; recommendation: string; @@ -92,6 +106,7 @@ export interface MemoryDiagnosticsOptions { activeRequests?: () => number; openFileDescriptors?: () => Promise; smapsRollup?: () => Promise; + processTree?: () => Promise; platform?: NodeJS.Platform; nodeVersion?: string; } @@ -114,7 +129,7 @@ export async function collectMemoryDiagnostics( const heapStatistics = options.heapStatistics?.() ?? v8.getHeapStatistics(); const resourceUsage = options.resourceUsage?.() ?? process.resourceUsage(); const uptimeSeconds = options.uptimeSeconds?.() ?? process.uptime(); - const [openFileDescriptors, smapsRollup, heapSpaceStatistics] = + const [openFileDescriptors, smapsRollup, heapSpaceStatistics, processTree] = await Promise.all([ optionalProbe( 'openFileDescriptors', @@ -125,12 +140,15 @@ export async function collectMemoryDiagnostics( 'heapSpaceStatistics', options.heapSpaceStatistics ?? (() => v8.getHeapSpaceStatistics()), ), + optionalProbe( + 'processTree', + options.processTree ?? (() => collectProcessTreeMemoryUsage(platform)), + ), ]); const v8HeapSpaces = mapHeapSpaces(heapSpaceStatistics); - // Node.js >=14.10.0 returns maxRSS in bytes on all platforms. - // This project requires Node >=22. - const maxRSSBytes = resourceUsage.maxRSS; + const maxRSSRaw = resourceUsage.maxRSS; + const maxRSSBytes = normalizeMaxRSSBytes(maxRSSRaw); const diagnostics = { timestamp: now().toISOString(), @@ -142,9 +160,12 @@ export async function collectMemoryDiagnostics( v8HeapSpaces, resourceUsage: { maxRSS: maxRSSBytes, + maxRSSRaw, + maxRSSUnit: 'KiB' as const, userCPUTime: resourceUsage.userCPUTime, systemCPUTime: resourceUsage.systemCPUTime, }, + processTree, activeHandles: getProcessInternalCount( 'activeHandles', '_getActiveHandles', @@ -167,6 +188,10 @@ export async function collectMemoryDiagnostics( }; } +function normalizeMaxRSSBytes(maxRSSKiB: number): number { + return maxRSSKiB * 1024; +} + function mapHeapStats(heapInfo: v8.HeapInfo): V8HeapStats { return { heapSizeLimit: heapInfo.heap_size_limit, @@ -233,6 +258,85 @@ async function readProcSmapsRollup(): Promise { return readFile('/proc/self/smaps_rollup', 'utf8'); } +async function collectProcessTreeMemoryUsage( + platform: NodeJS.Platform, +): Promise { + if (platform === 'win32') { + throw new Error('process tree RSS probe is unavailable on win32'); + } + + const { stdout } = await execFileAsync('ps', ['-axo', 'pid=,ppid=,rss='], { + maxBuffer: 1024 * 1024, + timeout: 5000, + }); + const rows = parsePsRows(stdout); + const rootPid = process.pid; + const rowsByPid = new Map(rows.map((row) => [row.pid, row])); + const childrenByParent = new Map(); + for (const row of rows) { + const children = childrenByParent.get(row.ppid); + if (children) { + children.push(row); + } else { + childrenByParent.set(row.ppid, [row]); + } + } + + const queue = [rootPid]; + const seen = new Set(); + let rootRSS = 0; + let treeRSS = 0; + let processCount = 0; + while (queue.length > 0) { + const pid = queue.shift()!; + if (seen.has(pid)) { + continue; + } + seen.add(pid); + const row = rowsByPid.get(pid); + if (row) { + const rssBytes = row.rssKiB * 1024; + if (pid === rootPid) { + rootRSS = rssBytes; + } + treeRSS += rssBytes; + processCount += 1; + } + for (const child of childrenByParent.get(pid) ?? []) { + queue.push(child.pid); + } + } + + return { + rootPid, + processCount, + rootRSS, + treeRSS, + }; +} + +interface PsRow { + pid: number; + ppid: number; + rssKiB: number; +} + +function parsePsRows(output: string): PsRow[] { + return output + .trim() + .split(/\r?\n/) + .map((line) => { + const [pid, ppid, rssKiB] = line.trim().split(/\s+/).map(Number); + return { pid, ppid, rssKiB }; + }) + .filter( + (row) => + Number.isFinite(row.pid) && + Number.isFinite(row.ppid) && + Number.isFinite(row.rssKiB), + ); +} + async function optionalProbe( name: string, probe: () => Promise, diff --git a/packages/core/src/utils/nextSpeakerChecker.test.ts b/packages/core/src/utils/nextSpeakerChecker.test.ts index 5ccb9dd434..451f38ee94 100644 --- a/packages/core/src/utils/nextSpeakerChecker.test.ts +++ b/packages/core/src/utils/nextSpeakerChecker.test.ts @@ -88,6 +88,7 @@ describe('checkNextSpeaker', () => { // Spy on getHistory for chatInstance vi.spyOn(chatInstance, 'getHistory'); + vi.spyOn(chatInstance, 'getHistoryTail'); vi.spyOn(chatInstance, 'getLastHistoryEntry'); }); @@ -97,6 +98,9 @@ describe('checkNextSpeaker', () => { function mockChatHistory(history: Content[]): void { vi.mocked(chatInstance.getHistory).mockReturnValue(history); + vi.mocked(chatInstance.getHistoryTail).mockReturnValue( + history.length > 0 ? [structuredClone(history[history.length - 1]!)] : [], + ); vi.mocked(chatInstance.getLastHistoryEntry).mockReturnValue( history.length > 0 ? structuredClone(history[history.length - 1]!) @@ -279,8 +283,36 @@ describe('checkNextSpeaker', () => { expect(generateJsonCall[0].promptId).toBe(promptId); }); + it('should send only the last curated model message to the side query', async () => { + const oldHistory: Content[] = [ + { role: 'user', parts: [{ text: 'old user context'.repeat(1000) }] }, + { role: 'model', parts: [{ text: 'old model context'.repeat(1000) }] }, + ]; + const lastModelMessage: Content = { + role: 'model', + parts: [{ text: 'Some model output.' }], + }; + mockChatHistory([...oldHistory, lastModelMessage]); + (mockBaseLlmClient.generateJson as Mock).mockResolvedValue({ + reasoning: 'Model made a statement, awaiting user input.', + next_speaker: 'user', + } satisfies NextSpeakerResponse); + + await checkNextSpeaker(chatInstance, mockConfig, abortSignal, promptId); + + const generateJsonCall = (mockBaseLlmClient.generateJson as Mock).mock + .calls[0]; + expect(generateJsonCall[0].contents).toHaveLength(2); + expect(generateJsonCall[0].contents[0]).toEqual(lastModelMessage); + expect(generateJsonCall[0].contents[1]).toMatchObject({ + role: 'user', + }); + expect(chatInstance.getHistory).not.toHaveBeenCalled(); + expect(chatInstance.getHistoryTail).toHaveBeenCalledWith(1, true); + }); + it('should use raw last history entry to detect function responses', async () => { - vi.mocked(chatInstance.getHistory).mockReturnValue([ + vi.mocked(chatInstance.getHistoryTail).mockReturnValue([ { role: 'model', parts: [{ functionCall: { name: 'read_file', args: {} } }], @@ -310,7 +342,8 @@ describe('checkNextSpeaker', () => { 'The last message was a function response, so the model should speak next.', next_speaker: 'model', }); - expect(chatInstance.getHistory).toHaveBeenCalledWith(true); + expect(chatInstance.getHistory).not.toHaveBeenCalled(); + expect(chatInstance.getHistoryTail).not.toHaveBeenCalled(); expect(chatInstance.getLastHistoryEntry).toHaveBeenCalledTimes(1); expect(mockBaseLlmClient.generateJson).not.toHaveBeenCalled(); }); @@ -327,8 +360,9 @@ describe('checkNextSpeaker', () => { await checkNextSpeaker(chatInstance, mockConfig, abortSignal, promptId); - expect(chatInstance.getHistory).toHaveBeenCalledTimes(1); - expect(chatInstance.getHistory).toHaveBeenCalledWith(true); + expect(chatInstance.getHistory).not.toHaveBeenCalled(); + expect(chatInstance.getHistoryTail).toHaveBeenCalledTimes(1); + expect(chatInstance.getHistoryTail).toHaveBeenCalledWith(1, true); expect(chatInstance.getLastHistoryEntry).toHaveBeenCalledTimes(1); }); }); diff --git a/packages/core/src/utils/nextSpeakerChecker.ts b/packages/core/src/utils/nextSpeakerChecker.ts index c36a8eb9ac..33b4e2f06c 100644 --- a/packages/core/src/utils/nextSpeakerChecker.ts +++ b/packages/core/src/utils/nextSpeakerChecker.ts @@ -48,23 +48,9 @@ export async function checkNextSpeaker( abortSignal: AbortSignal, promptId: string, ): Promise { - // We need to capture the curated history because there are many moments when the model will return invalid turns - // that when passed back up to the endpoint will break subsequent calls. An example of this is when the model decides - // to respond with an empty part collection if you were to send that message back to the server it will respond with - // a 400 indicating that model part collections MUST have content. - const curatedHistory = chat.getHistory(/* curated */ true); - - // Ensure there's a model response to analyze - if (curatedHistory.length === 0) { - // Cannot determine next speaker if history is empty. - return null; - } - // Read the last raw history entry by design: functionResponse turns can be // stripped from curated history, but they are decisive for next-speaker flow. const lastComprehensiveMessage = chat.getLastHistoryEntry(); - // Raw history can still be empty even if the curated-history guard above is - // the normal empty-chat path, so keep this defensive check local. if (!lastComprehensiveMessage) { return null; } @@ -94,7 +80,10 @@ export async function checkNextSpeaker( // Things checked out. Let's proceed to potentially making an LLM request. - const lastMessage = curatedHistory[curatedHistory.length - 1]; + // The next-speaker prompt only analyzes the immediately preceding response. + // Keep the side query and its structuredClone cost bounded to that one + // curated message rather than cloning and sending the entire chat history. + const [lastMessage] = chat.getHistoryTail(1, /* curated */ true); if (!lastMessage || lastMessage.role !== 'model') { // Cannot determine next speaker if the last turn wasn't from the model // or if history is empty. @@ -102,7 +91,7 @@ export async function checkNextSpeaker( } const contents: Content[] = [ - ...curatedHistory, + lastMessage, { role: 'user', parts: [{ text: CHECK_PROMPT }] }, ]; diff --git a/packages/core/src/utils/runtimeDiagnostics.test.ts b/packages/core/src/utils/runtimeDiagnostics.test.ts new file mode 100644 index 0000000000..cff3de3c33 --- /dev/null +++ b/packages/core/src/utils/runtimeDiagnostics.test.ts @@ -0,0 +1,237 @@ +/** + * @license + * Copyright 2026 Qwen + * SPDX-License-Identifier: Apache-2.0 + */ + +import { describe, expect, it } from 'vitest'; +import type { GenerateContentParameters } from '@google/genai'; +import { + RuntimeDiagnosticsCollector, + summarizeAnthropicWireRequest, + summarizeOpenAIWireRequest, +} from './runtimeDiagnostics.js'; + +describe('RuntimeDiagnosticsCollector', () => { + it('summarizes generate-content requests without retaining prompt text or tool args', () => { + const collector = new RuntimeDiagnosticsCollector({ + enabled: true, + now: () => '2026-05-19T00:00:00.000Z', + }); + const request = { + model: 'diagnostic-model', + contents: [ + { + role: 'user', + parts: [{ text: 'secret user prompt' }], + }, + { + role: 'user', + parts: [ + { + functionResponse: { + id: 'tool-1', + name: 'read_file', + response: { output: 'secret tool output' }, + }, + }, + ], + }, + ], + config: { + systemInstruction: { parts: [{ text: 'secret system prompt' }] }, + tools: [ + { + functionDeclarations: [ + { + name: 'read_file', + description: 'Read file', + parametersJsonSchema: { + type: 'object', + properties: { path: { type: 'string' } }, + }, + }, + ], + }, + ], + }, + } satisfies GenerateContentParameters; + + collector.recordGenerateContentRequest(request, { + stream: true, + source: 'generateContentStream', + }); + + const snapshot = collector.snapshot(); + expect(snapshot.requests).toHaveLength(1); + expect(snapshot.requests[0]).toMatchObject({ + index: 1, + source: 'generateContentStream', + model: 'diagnostic-model', + stream: true, + contents: { + count: 2, + roleCounts: { user: 2 }, + partCount: 2, + textBytes: Buffer.byteLength('secret user prompt'), + functionResponseCount: 1, + functionResponseBytes: expect.any(Number), + }, + systemInstructionBytes: Buffer.byteLength('secret system prompt'), + tools: { + count: 1, + functionDeclarationCount: 1, + schemaBytes: expect.any(Number), + }, + }); + expect(JSON.stringify(snapshot)).not.toContain('secret user prompt'); + expect(JSON.stringify(snapshot)).not.toContain('secret tool output'); + expect(JSON.stringify(snapshot)).not.toContain('secret system prompt'); + }); + + it('summarizes OpenAI wire requests by size and role only', () => { + const summary = summarizeOpenAIWireRequest({ + model: 'wire-model', + stream: true, + messages: [ + { role: 'system', content: 'secret system' }, + { role: 'user', content: [{ type: 'text', text: 'secret user' }] }, + ], + tools: [ + { + type: 'function', + function: { + name: 'run_shell_command', + description: 'Run shell command', + parameters: { + type: 'object', + properties: { command: { type: 'string' } }, + }, + }, + }, + ], + }); + + expect(summary).toMatchObject({ + model: 'wire-model', + stream: true, + messageCount: 2, + messageBytesByRole: { + system: Buffer.byteLength('secret system'), + user: expect.any(Number), + }, + toolsCount: 1, + toolSchemaBytes: expect.any(Number), + bodyBytes: expect.any(Number), + topLevelKeys: ['messages', 'model', 'stream', 'tools'], + }); + expect(JSON.stringify(summary)).not.toContain('secret system'); + expect(JSON.stringify(summary)).not.toContain('secret user'); + }); + + it('summarizes Anthropic wire requests by size and role only', () => { + const summary = summarizeAnthropicWireRequest({ + model: 'anthropic-wire-model', + stream: true, + system: [{ type: 'text', text: 'secret system' }], + messages: [ + { role: 'user', content: 'secret user' }, + { + role: 'assistant', + content: [ + { + type: 'tool_use', + id: 'tool-1', + name: 'run_shell_command', + input: { command: 'secret command' }, + }, + ], + }, + ], + tools: [ + { + name: 'run_shell_command', + description: 'Run shell command', + input_schema: { + type: 'object', + properties: { command: { type: 'string' } }, + }, + }, + ], + max_tokens: 1024, + }); + + expect(summary).toMatchObject({ + model: 'anthropic-wire-model', + stream: true, + messageCount: 2, + messageBytesByRole: { + user: Buffer.byteLength('secret user'), + assistant: expect.any(Number), + }, + systemBytes: expect.any(Number), + toolsCount: 1, + toolSchemaBytes: expect.any(Number), + bodyBytes: expect.any(Number), + topLevelKeys: [ + 'max_tokens', + 'messages', + 'model', + 'stream', + 'system', + 'tools', + ], + }); + expect(JSON.stringify(summary)).not.toContain('secret system'); + expect(JSON.stringify(summary)).not.toContain('secret user'); + expect(JSON.stringify(summary)).not.toContain('secret command'); + }); + + it('aggregates tool use and tool result sizes without retaining payloads', () => { + const collector = new RuntimeDiagnosticsCollector({ enabled: true }); + + collector.recordToolUse('read_file', { path: '/private/path.txt' }); + collector.recordToolResult({ + name: 'read_file', + callId: 'tool-1', + resultBytes: 2048, + isError: false, + }); + collector.recordToolResult({ + name: 'run_shell_command', + callId: 'tool-2', + resultBytes: 512, + isError: true, + }); + + const snapshot = collector.snapshot(); + expect(snapshot.tools).toMatchObject({ + toolUseCount: 1, + toolResultCount: 2, + toolResultErrorCount: 1, + totalToolUseArgBytes: expect.any(Number), + maxToolUseArgBytes: expect.any(Number), + totalToolResultBytes: 2560, + maxToolResultBytes: 2048, + byName: { + read_file: { + uses: 1, + argBytes: expect.any(Number), + maxArgBytes: expect.any(Number), + results: 1, + errors: 0, + resultBytes: 2048, + maxResultBytes: 2048, + }, + run_shell_command: { + uses: 0, + results: 1, + errors: 1, + resultBytes: 512, + maxResultBytes: 512, + }, + }, + }); + expect(JSON.stringify(snapshot)).not.toContain('/private/path.txt'); + }); +}); diff --git a/packages/core/src/utils/runtimeDiagnostics.ts b/packages/core/src/utils/runtimeDiagnostics.ts new file mode 100644 index 0000000000..74f367bba9 --- /dev/null +++ b/packages/core/src/utils/runtimeDiagnostics.ts @@ -0,0 +1,557 @@ +/** + * @license + * Copyright 2026 Qwen + * SPDX-License-Identifier: Apache-2.0 + */ + +import type { GenerateContentParameters } from '@google/genai'; +import type Anthropic from '@anthropic-ai/sdk'; +import type OpenAI from 'openai'; + +export interface RuntimeDiagnosticsSnapshot { + enabled: boolean; + startedAt: string; + requests: GenerateContentRequestDiagnostics[]; + openaiWireRequests: OpenAIWireRequestDiagnostics[]; + anthropicWireRequests: AnthropicWireRequestDiagnostics[]; + tools: RuntimeToolDiagnostics; +} + +export interface GenerateContentRequestDiagnostics { + index: number; + timestamp: string; + source: 'generateContent' | 'generateContentStream'; + model: string; + stream: boolean; + serializedBytes: number; + contents: RuntimeContentDiagnostics; + systemInstructionBytes: number; + generationConfigBytes: number; + tools: RuntimeToolSchemaDiagnostics; +} + +export interface RuntimeContentDiagnostics { + count: number; + roleCounts: Record; + partCount: number; + textBytes: number; + functionCallCount: number; + functionCallArgBytes: number; + functionResponseCount: number; + functionResponseBytes: number; + inlineDataCount: number; + inlineDataBytes: number; + fileDataCount: number; +} + +export interface RuntimeToolSchemaDiagnostics { + count: number; + functionDeclarationCount: number; + schemaBytes: number; +} + +export interface OpenAIWireRequestDiagnostics { + index?: number; + timestamp?: string; + model: string; + stream: boolean; + bodyBytes: number; + messageCount: number; + messageBytesByRole: Record; + toolsCount: number; + toolSchemaBytes: number; + topLevelKeys: string[]; +} + +export interface AnthropicWireRequestDiagnostics { + index?: number; + timestamp?: string; + model: string; + stream: boolean; + bodyBytes: number; + messageCount: number; + messageBytesByRole: Record; + systemBytes: number; + toolsCount: number; + toolSchemaBytes: number; + topLevelKeys: string[]; +} + +export interface RuntimeToolDiagnostics { + toolUseCount: number; + toolResultCount: number; + toolResultErrorCount: number; + totalToolUseArgBytes: number; + maxToolUseArgBytes: number; + totalToolResultBytes: number; + maxToolResultBytes: number; + byName: Record; +} + +export interface RuntimeToolNameDiagnostics { + uses: number; + argBytes: number; + maxArgBytes: number; + results: number; + errors: number; + resultBytes: number; + maxResultBytes: number; +} + +export interface RuntimeToolResultRecord { + name: string; + callId: string; + resultBytes: number; + isError: boolean; +} + +export interface RuntimeDiagnosticsCollectorOptions { + enabled?: boolean; + now?: () => string; +} + +const RUNTIME_PROFILE_ENV = 'QWEN_CODE_PROFILE_RUNTIME'; + +export function isRuntimeDiagnosticsEnabled( + env: NodeJS.ProcessEnv = process.env, +): boolean { + return env[RUNTIME_PROFILE_ENV] === '1'; +} + +export class RuntimeDiagnosticsCollector { + private enabled: boolean; + private readonly now: () => string; + private startedAt: string; + private requestIndex = 0; + private openAIWireRequestIndex = 0; + private anthropicWireRequestIndex = 0; + private requests: GenerateContentRequestDiagnostics[] = []; + private openaiWireRequests: OpenAIWireRequestDiagnostics[] = []; + private anthropicWireRequests: AnthropicWireRequestDiagnostics[] = []; + private tools: RuntimeToolDiagnostics = createInitialToolDiagnostics(); + + constructor(options: RuntimeDiagnosticsCollectorOptions = {}) { + this.enabled = options.enabled ?? isRuntimeDiagnosticsEnabled(); + this.now = options.now ?? (() => new Date().toISOString()); + this.startedAt = this.now(); + } + + reset(options: { enabled?: boolean } = {}): void { + this.enabled = options.enabled ?? isRuntimeDiagnosticsEnabled(); + this.startedAt = this.now(); + this.requestIndex = 0; + this.openAIWireRequestIndex = 0; + this.anthropicWireRequestIndex = 0; + this.requests = []; + this.openaiWireRequests = []; + this.anthropicWireRequests = []; + this.tools = createInitialToolDiagnostics(); + } + + isEnabled(): boolean { + return this.enabled; + } + + recordGenerateContentRequest( + request: GenerateContentParameters, + options: { + stream: boolean; + source: 'generateContent' | 'generateContentStream'; + }, + ): void { + if (!this.enabled) { + return; + } + + this.requestIndex += 1; + this.requests.push({ + index: this.requestIndex, + timestamp: this.now(), + source: options.source, + model: request.model, + stream: options.stream, + serializedBytes: utf8Bytes(toJsonSafeRequest(request)), + contents: summarizeContents(request.contents), + systemInstructionBytes: summarizeContentTextBytes( + request.config?.systemInstruction, + ), + generationConfigBytes: utf8Bytes(toJsonSafeConfig(request.config)), + tools: summarizeToolSchemas(request.config?.tools), + }); + } + + recordOpenAIWireRequest( + request: OpenAI.Chat.ChatCompletionCreateParams, + ): void { + if (!this.enabled) { + return; + } + + this.openAIWireRequestIndex += 1; + this.openaiWireRequests.push({ + index: this.openAIWireRequestIndex, + timestamp: this.now(), + ...summarizeOpenAIWireRequest(request), + }); + } + + recordAnthropicWireRequest( + request: + | Anthropic.MessageCreateParamsNonStreaming + | Anthropic.MessageCreateParamsStreaming, + ): void { + if (!this.enabled) { + return; + } + + this.anthropicWireRequestIndex += 1; + this.anthropicWireRequests.push({ + index: this.anthropicWireRequestIndex, + timestamp: this.now(), + ...summarizeAnthropicWireRequest(request), + }); + } + + recordToolUse(name: string, args: unknown): void { + if (!this.enabled) { + return; + } + + const argBytes = utf8Bytes(args); + const tool = this.getToolNameDiagnostics(name); + tool.uses += 1; + tool.argBytes += argBytes; + tool.maxArgBytes = Math.max(tool.maxArgBytes, argBytes); + this.tools.toolUseCount += 1; + this.tools.totalToolUseArgBytes += argBytes; + this.tools.maxToolUseArgBytes = Math.max( + this.tools.maxToolUseArgBytes, + argBytes, + ); + } + + recordToolResult(record: RuntimeToolResultRecord): void { + if (!this.enabled) { + return; + } + + const tool = this.getToolNameDiagnostics(record.name); + tool.results += 1; + tool.resultBytes += record.resultBytes; + tool.maxResultBytes = Math.max(tool.maxResultBytes, record.resultBytes); + if (record.isError) { + tool.errors += 1; + this.tools.toolResultErrorCount += 1; + } + this.tools.toolResultCount += 1; + this.tools.totalToolResultBytes += record.resultBytes; + this.tools.maxToolResultBytes = Math.max( + this.tools.maxToolResultBytes, + record.resultBytes, + ); + } + + snapshot(): RuntimeDiagnosticsSnapshot { + return { + enabled: this.enabled, + startedAt: this.startedAt, + requests: this.requests.map((request) => ({ + ...request, + contents: { + ...request.contents, + roleCounts: { ...request.contents.roleCounts }, + }, + tools: { ...request.tools }, + })), + openaiWireRequests: this.openaiWireRequests.map((request) => ({ + ...request, + messageBytesByRole: { ...request.messageBytesByRole }, + topLevelKeys: [...request.topLevelKeys], + })), + anthropicWireRequests: this.anthropicWireRequests.map((request) => ({ + ...request, + messageBytesByRole: { ...request.messageBytesByRole }, + topLevelKeys: [...request.topLevelKeys], + })), + tools: { + ...this.tools, + byName: Object.fromEntries( + Object.entries(this.tools.byName).map(([name, value]) => [ + name, + { ...value }, + ]), + ), + }, + }; + } + + private getToolNameDiagnostics(name: string): RuntimeToolNameDiagnostics { + const existing = this.tools.byName[name]; + if (existing) { + return existing; + } + const created = createInitialToolNameDiagnostics(); + this.tools.byName[name] = created; + return created; + } +} + +export const runtimeDiagnostics = new RuntimeDiagnosticsCollector(); + +export function summarizeOpenAIWireRequest( + request: OpenAI.Chat.ChatCompletionCreateParams, +): OpenAIWireRequestDiagnostics { + const requestRecord = asRecord(request); + const messages = Array.isArray(requestRecord['messages']) + ? requestRecord['messages'] + : []; + const tools = Array.isArray(requestRecord['tools']) + ? requestRecord['tools'] + : []; + const messageBytesByRole: Record = {}; + for (const message of messages) { + const messageRecord = asRecord(message); + const role = + typeof messageRecord['role'] === 'string' + ? messageRecord['role'] + : 'unknown'; + messageBytesByRole[role] = + (messageBytesByRole[role] ?? 0) + utf8Bytes(messageRecord['content']); + } + + return { + model: + typeof requestRecord['model'] === 'string' + ? requestRecord['model'] + : 'unknown', + stream: requestRecord['stream'] === true, + bodyBytes: utf8Bytes(request), + messageCount: messages.length, + messageBytesByRole, + toolsCount: tools.length, + toolSchemaBytes: utf8Bytes(tools), + topLevelKeys: Object.keys(requestRecord).sort(), + }; +} + +export function summarizeAnthropicWireRequest( + request: + | Anthropic.MessageCreateParamsNonStreaming + | Anthropic.MessageCreateParamsStreaming, +): AnthropicWireRequestDiagnostics { + const requestRecord = asRecord(request); + const messages = Array.isArray(requestRecord['messages']) + ? requestRecord['messages'] + : []; + const tools = Array.isArray(requestRecord['tools']) + ? requestRecord['tools'] + : []; + const messageBytesByRole: Record = {}; + for (const message of messages) { + const messageRecord = asRecord(message); + const role = + typeof messageRecord['role'] === 'string' + ? messageRecord['role'] + : 'unknown'; + messageBytesByRole[role] = + (messageBytesByRole[role] ?? 0) + utf8Bytes(messageRecord['content']); + } + + return { + model: + typeof requestRecord['model'] === 'string' + ? requestRecord['model'] + : 'unknown', + stream: requestRecord['stream'] === true, + bodyBytes: utf8Bytes(request), + messageCount: messages.length, + messageBytesByRole, + systemBytes: utf8Bytes(requestRecord['system']), + toolsCount: tools.length, + toolSchemaBytes: utf8Bytes(tools), + topLevelKeys: Object.keys(requestRecord).sort(), + }; +} + +function createInitialToolDiagnostics(): RuntimeToolDiagnostics { + return { + toolUseCount: 0, + toolResultCount: 0, + toolResultErrorCount: 0, + totalToolUseArgBytes: 0, + maxToolUseArgBytes: 0, + totalToolResultBytes: 0, + maxToolResultBytes: 0, + byName: Object.create(null) as Record, + }; +} + +function createInitialToolNameDiagnostics(): RuntimeToolNameDiagnostics { + return { + uses: 0, + argBytes: 0, + maxArgBytes: 0, + results: 0, + errors: 0, + resultBytes: 0, + maxResultBytes: 0, + }; +} + +function summarizeContents(contents: unknown): RuntimeContentDiagnostics { + const summary: RuntimeContentDiagnostics = { + count: 0, + roleCounts: {}, + partCount: 0, + textBytes: 0, + functionCallCount: 0, + functionCallArgBytes: 0, + functionResponseCount: 0, + functionResponseBytes: 0, + inlineDataCount: 0, + inlineDataBytes: 0, + fileDataCount: 0, + }; + const contentItems = Array.isArray(contents) + ? contents + : contents === undefined || contents === null + ? [] + : [contents]; + + for (const content of contentItems) { + summary.count += 1; + if (typeof content === 'string') { + summary.roleCounts['user'] = (summary.roleCounts['user'] ?? 0) + 1; + summary.partCount += 1; + summary.textBytes += utf8Bytes(content); + continue; + } + + const contentRecord = asRecord(content); + const role = + typeof contentRecord['role'] === 'string' + ? contentRecord['role'] + : 'unknown'; + summary.roleCounts[role] = (summary.roleCounts[role] ?? 0) + 1; + const parts = Array.isArray(contentRecord['parts']) + ? contentRecord['parts'] + : []; + summarizeParts(parts, summary); + } + + return summary; +} + +function summarizeContentTextBytes(content: unknown): number { + const summary = summarizeContents(content); + return summary.textBytes; +} + +function summarizeParts( + parts: unknown[], + summary: RuntimeContentDiagnostics, +): void { + for (const part of parts) { + summary.partCount += 1; + if (typeof part === 'string') { + summary.textBytes += utf8Bytes(part); + continue; + } + const partRecord = asRecord(part); + if (typeof partRecord['text'] === 'string') { + summary.textBytes += utf8Bytes(partRecord['text']); + } + const functionCall = asOptionalRecord(partRecord['functionCall']); + if (functionCall) { + summary.functionCallCount += 1; + summary.functionCallArgBytes += utf8Bytes(functionCall['args']); + } + const functionResponse = asOptionalRecord(partRecord['functionResponse']); + if (functionResponse) { + summary.functionResponseCount += 1; + summary.functionResponseBytes += + utf8Bytes(functionResponse['response']) + + utf8Bytes(functionResponse['parts']); + } + const inlineData = asOptionalRecord(partRecord['inlineData']); + if (inlineData) { + summary.inlineDataCount += 1; + summary.inlineDataBytes += utf8Bytes(inlineData['data']); + } + if (partRecord['fileData']) { + summary.fileDataCount += 1; + } + } +} + +function summarizeToolSchemas(tools: unknown): RuntimeToolSchemaDiagnostics { + const toolList = Array.isArray(tools) ? tools : []; + let functionDeclarationCount = 0; + for (const tool of toolList) { + const toolRecord = asRecord(tool); + const declarations = Array.isArray(toolRecord['functionDeclarations']) + ? toolRecord['functionDeclarations'] + : []; + functionDeclarationCount += declarations.length; + } + return { + count: toolList.length, + functionDeclarationCount, + schemaBytes: utf8Bytes(toolList), + }; +} + +function toJsonSafeRequest(request: GenerateContentParameters): unknown { + return { + model: request.model, + contents: request.contents, + config: toJsonSafeConfig(request.config), + }; +} + +function toJsonSafeConfig( + config: GenerateContentParameters['config'], +): unknown { + if (!config) { + return undefined; + } + const configRecord = asRecord(config); + const safeConfig: Record = {}; + for (const [key, value] of Object.entries(configRecord)) { + if (key === 'abortSignal') { + continue; + } + safeConfig[key] = value; + } + return safeConfig; +} + +function utf8Bytes(value: unknown): number { + if (value === undefined || value === null) { + return 0; + } + if (typeof value === 'string') { + return Buffer.byteLength(value, 'utf8'); + } + return Buffer.byteLength(safeStringify(value), 'utf8'); +} + +function safeStringify(value: unknown): string { + try { + return JSON.stringify(value) ?? ''; + } catch { + return '[unserializable]'; + } +} + +function asRecord(value: unknown): Record { + return typeof value === 'object' && value !== null + ? (value as Record) + : {}; +} + +function asOptionalRecord(value: unknown): Record | null { + return typeof value === 'object' && value !== null + ? (value as Record) + : null; +} diff --git a/packages/vscode-ide-companion/src/diff-manager.ts b/packages/vscode-ide-companion/src/diff-manager.ts index 755143a4e4..ccabe3657e 100644 --- a/packages/vscode-ide-companion/src/diff-manager.ts +++ b/packages/vscode-ide-companion/src/diff-manager.ts @@ -7,7 +7,7 @@ import { IdeDiffAcceptedNotificationSchema, IdeDiffClosedNotificationSchema, -} from '@qwen-code/qwen-code-core/src/ide/types.js'; +} from '@qwen-code/qwen-code-core'; import { type JSONRPCNotification } from '@modelcontextprotocol/sdk/types.js'; import * as path from 'node:path'; import * as vscode from 'vscode'; diff --git a/packages/vscode-ide-companion/src/extension.test.ts b/packages/vscode-ide-companion/src/extension.test.ts index 72c3d476ef..d22062515d 100644 --- a/packages/vscode-ide-companion/src/extension.test.ts +++ b/packages/vscode-ide-companion/src/extension.test.ts @@ -7,18 +7,14 @@ import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; import * as vscode from 'vscode'; import { activate } from './extension.js'; -import { - IDE_DEFINITIONS, - detectIdeFromEnv, -} from '@qwen-code/qwen-code-core/src/ide/detect-ide.js'; - -vi.mock('@qwen-code/qwen-code-core/src/ide/detect-ide.js', async () => { - const actual = await vi.importActual( - '@qwen-code/qwen-code-core/src/ide/detect-ide.js', - ); +import { IDE_DEFINITIONS, detectIdeFromEnv } from '@qwen-code/qwen-code-core'; + +vi.mock('@qwen-code/qwen-code-core', async (importOriginal) => { + const actual = + await importOriginal(); return { ...actual, - detectIdeFromEnv: vi.fn(() => IDE_DEFINITIONS.vscode), + detectIdeFromEnv: vi.fn(() => actual.IDE_DEFINITIONS.vscode), }; }); diff --git a/packages/vscode-ide-companion/src/extension.ts b/packages/vscode-ide-companion/src/extension.ts index 3f83a67942..56c441af61 100644 --- a/packages/vscode-ide-companion/src/extension.ts +++ b/packages/vscode-ide-companion/src/extension.ts @@ -13,7 +13,7 @@ import { detectIdeFromEnv, IDE_DEFINITIONS, type IdeInfo, -} from '@qwen-code/qwen-code-core/src/ide/detect-ide.js'; +} from '@qwen-code/qwen-code-core'; import { WebViewProvider } from './webview/providers/WebViewProvider.js'; import { ChatProviderRegistry } from './webview/providers/ChatProviderRegistry.js'; import { registerChatViewProviders } from './webview/providers/chatViewRegistration.js'; diff --git a/packages/vscode-ide-companion/src/ide-server.test.ts b/packages/vscode-ide-companion/src/ide-server.test.ts index 9c51d50215..ee99ce105e 100644 --- a/packages/vscode-ide-companion/src/ide-server.test.ts +++ b/packages/vscode-ide-companion/src/ide-server.test.ts @@ -38,9 +38,17 @@ vi.mock('node:os', async (importOriginal) => { }; }); -vi.mock('@qwen-code/qwen-code-core/src/ide/detect-ide.js', () => ({ - detectIdeFromEnv: vi.fn(() => ({ name: 'vscode', displayName: 'VS Code' })), -})); +vi.mock('@qwen-code/qwen-code-core', async (importOriginal) => { + const actual = + await importOriginal(); + return { + ...actual, + detectIdeFromEnv: vi.fn(() => ({ + name: 'vscode', + displayName: 'VS Code', + })), + }; +}); const vscodeMock = vi.hoisted(() => ({ workspace: { @@ -62,13 +70,6 @@ const vscodeMock = vi.hoisted(() => ({ vi.mock('vscode', () => vscodeMock); -vi.mock('@qwen-code/qwen-code-core/src/ide/detect-ide.js', () => ({ - detectIdeFromEnv: vi.fn(() => ({ - name: 'vscode', - displayName: 'VS Code', - })), -})); - vi.mock('./open-files-manager', () => { const OpenFilesManager = vi.fn(); OpenFilesManager.prototype.onDidChange = vi.fn(() => ({ dispose: vi.fn() })); diff --git a/packages/vscode-ide-companion/src/ide-server.ts b/packages/vscode-ide-companion/src/ide-server.ts index 1122677b76..2f19fbbc92 100644 --- a/packages/vscode-ide-companion/src/ide-server.ts +++ b/packages/vscode-ide-companion/src/ide-server.ts @@ -7,10 +7,10 @@ import * as vscode from 'vscode'; import { CloseDiffRequestSchema, + detectIdeFromEnv, IdeContextNotificationSchema, OpenDiffRequestSchema, -} from '@qwen-code/qwen-code-core/src/ide/types.js'; -import { detectIdeFromEnv } from '@qwen-code/qwen-code-core/src/ide/detect-ide.js'; +} from '@qwen-code/qwen-code-core'; import { isInitializeRequest } from '@modelcontextprotocol/sdk/types.js'; import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'; import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamableHttp.js'; diff --git a/packages/vscode-ide-companion/src/open-files-manager.ts b/packages/vscode-ide-companion/src/open-files-manager.ts index ee7f595e18..30c9029ac8 100644 --- a/packages/vscode-ide-companion/src/open-files-manager.ts +++ b/packages/vscode-ide-companion/src/open-files-manager.ts @@ -5,10 +5,7 @@ */ import * as vscode from 'vscode'; -import type { - File, - IdeContext, -} from '@qwen-code/qwen-code-core/src/ide/types.js'; +import type { File, IdeContext } from '@qwen-code/qwen-code-core'; import { isFileUri, isNotebookFileUri, diff --git a/packages/vscode-ide-companion/src/services/open-files-manager/notebook-handler.ts b/packages/vscode-ide-companion/src/services/open-files-manager/notebook-handler.ts index 40e6637446..64907fe315 100644 --- a/packages/vscode-ide-companion/src/services/open-files-manager/notebook-handler.ts +++ b/packages/vscode-ide-companion/src/services/open-files-manager/notebook-handler.ts @@ -5,7 +5,7 @@ */ import * as vscode from 'vscode'; -import type { File } from '@qwen-code/qwen-code-core/src/ide/types.js'; +import type { File } from '@qwen-code/qwen-code-core'; import { MAX_FILES, MAX_SELECTED_TEXT_LENGTH } from './constants.js'; import { deactivateCurrentActiveFile, diff --git a/packages/vscode-ide-companion/src/services/open-files-manager/text-handler.ts b/packages/vscode-ide-companion/src/services/open-files-manager/text-handler.ts index 88853f31bf..a1e7dda5b4 100644 --- a/packages/vscode-ide-companion/src/services/open-files-manager/text-handler.ts +++ b/packages/vscode-ide-companion/src/services/open-files-manager/text-handler.ts @@ -5,7 +5,7 @@ */ import type * as vscode from 'vscode'; -import type { File } from '@qwen-code/qwen-code-core/src/ide/types.js'; +import type { File } from '@qwen-code/qwen-code-core'; import { MAX_FILES, MAX_SELECTED_TEXT_LENGTH } from './constants.js'; import { deactivateCurrentActiveFile, diff --git a/packages/vscode-ide-companion/src/services/open-files-manager/utils.ts b/packages/vscode-ide-companion/src/services/open-files-manager/utils.ts index dd4b46126a..ea59ccdbd7 100644 --- a/packages/vscode-ide-companion/src/services/open-files-manager/utils.ts +++ b/packages/vscode-ide-companion/src/services/open-files-manager/utils.ts @@ -5,7 +5,7 @@ */ import * as vscode from 'vscode'; -import type { File } from '@qwen-code/qwen-code-core/src/ide/types.js'; +import type { File } from '@qwen-code/qwen-code-core'; export function isFileUri(uri: vscode.Uri): boolean { return uri.scheme === 'file'; diff --git a/packages/vscode-ide-companion/src/services/qwenSessionManager.ts b/packages/vscode-ide-companion/src/services/qwenSessionManager.ts index a39a37ebed..34a2f1349a 100644 --- a/packages/vscode-ide-companion/src/services/qwenSessionManager.ts +++ b/packages/vscode-ide-companion/src/services/qwenSessionManager.ts @@ -7,7 +7,7 @@ import * as fs from 'fs'; import * as path from 'path'; import * as crypto from 'crypto'; -import { getProjectHash } from '@qwen-code/qwen-code-core/src/utils/paths.js'; +import { getProjectHash } from '@qwen-code/qwen-code-core'; import { getRuntimeBaseDir } from '../utils/paths.js'; import type { QwenSession } from './qwenSessionReader.js'; diff --git a/packages/vscode-ide-companion/src/services/qwenSessionReader.ts b/packages/vscode-ide-companion/src/services/qwenSessionReader.ts index 1b15598f97..abfdb126e0 100644 --- a/packages/vscode-ide-companion/src/services/qwenSessionReader.ts +++ b/packages/vscode-ide-companion/src/services/qwenSessionReader.ts @@ -8,8 +8,7 @@ import * as fs from 'fs'; import * as path from 'path'; import * as readline from 'readline'; import * as crypto from 'crypto'; -import { getProjectHash } from '@qwen-code/qwen-code-core/src/utils/paths.js'; -import { getGitBranch } from '@qwen-code/qwen-code-core/src/utils/gitUtils.js'; +import { getGitBranch, getProjectHash } from '@qwen-code/qwen-code-core'; import { getRuntimeBaseDir } from '../utils/paths.js'; import { truncatePanelTitle } from '../webview/utils/panelTitleUtils.js'; diff --git a/packages/vscode-ide-companion/src/utils/acpModelInfo.ts b/packages/vscode-ide-companion/src/utils/acpModelInfo.ts index 53d14c5bcf..120873f705 100644 --- a/packages/vscode-ide-companion/src/utils/acpModelInfo.ts +++ b/packages/vscode-ide-companion/src/utils/acpModelInfo.ts @@ -5,7 +5,7 @@ */ import type { ModelInfo } from '@agentclientprotocol/sdk'; -import { knownTokenLimit } from '@qwen-code/qwen-code-core/src/core/tokenLimits.js'; +import { knownTokenLimit } from '@qwen-code/qwen-code-core'; import type { ApprovalModeValue } from '../types/approvalModeValueTypes.js'; type AcpMeta = Record; diff --git a/packages/vscode-ide-companion/src/utils/editorGroupUtils.ts b/packages/vscode-ide-companion/src/utils/editorGroupUtils.ts index 3326cd3368..53575d4884 100644 --- a/packages/vscode-ide-companion/src/utils/editorGroupUtils.ts +++ b/packages/vscode-ide-companion/src/utils/editorGroupUtils.ts @@ -29,7 +29,9 @@ function findNeighborGroup( ): vscode.ViewColumn | undefined { let candidate: vscode.ViewColumn | undefined; for (const g of vscode.window.tabGroups.all) { - if (!isOnSide(g.viewColumn)) continue; + if (!isOnSide(g.viewColumn)) { + continue; + } if (candidate === undefined || isCloser(candidate, g.viewColumn)) { candidate = g.viewColumn; } diff --git a/packages/vscode-ide-companion/src/utils/imageSupport.test.ts b/packages/vscode-ide-companion/src/utils/imageSupport.test.ts index b2b78d0ce5..b7948655d9 100644 --- a/packages/vscode-ide-companion/src/utils/imageSupport.test.ts +++ b/packages/vscode-ide-companion/src/utils/imageSupport.test.ts @@ -5,7 +5,7 @@ */ import { describe, expect, it } from 'vitest'; -import { SUPPORTED_IMAGE_MIME_TYPES } from '@qwen-code/qwen-code-core/src/utils/request-tokenizer/supportedImageFormats.js'; +import { SUPPORTED_IMAGE_MIME_TYPES } from '@qwen-code/qwen-code-core'; import { SUPPORTED_PASTED_IMAGE_MIME_TYPES } from './imageSupport.js'; describe('imageSupport constants', () => { diff --git a/packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.test.ts b/packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.test.ts index faeaa8f19f..3d16f841a7 100644 --- a/packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.test.ts +++ b/packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.test.ts @@ -61,24 +61,25 @@ const vscodeMock = vi.hoisted(() => { }); vi.mock('vscode', () => vscodeMock); -vi.mock( - '@qwen-code/qwen-code-core/src/services/fileDiscoveryService.js', - () => ({ +vi.mock('@qwen-code/qwen-code-core', async (importOriginal) => { + const actual = + await importOriginal(); + return { + ...actual, FileDiscoveryService: class { shouldIgnoreFile(filePath: string, options?: unknown) { return shouldIgnoreFileMock(filePath, options); } }, - }), -); -vi.mock('@qwen-code/qwen-code-core/src/utils/filesearch/fileSearch.js', () => ({ - FileSearchFactory: { - create: () => fileSearchMock, - }, -})); -vi.mock('@qwen-code/qwen-code-core/src/utils/filesearch/crawlCache.js', () => ({ - clear: vi.fn(), -})); + FileSearchFactory: { + create: () => fileSearchMock, + }, + crawlCache: { + ...actual.crawlCache, + clear: vi.fn(), + }, + }; +}); const readonlyProviderMock = vi.hoisted(() => ({ createUri: vi.fn(), diff --git a/packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts b/packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts index 547cd6108a..eaf527a147 100644 --- a/packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts +++ b/packages/vscode-ide-companion/src/webview/handlers/FileMessageHandler.ts @@ -13,12 +13,12 @@ import { findRightGroupOfChatWebview, } from '../../utils/editorGroupUtils.js'; import { ReadonlyFileSystemProvider } from '../../services/readonlyFileSystemProvider.js'; -import { FileDiscoveryService } from '@qwen-code/qwen-code-core/src/services/fileDiscoveryService.js'; import { + crawlCache, + FileDiscoveryService, FileSearchFactory, type FileSearch, -} from '@qwen-code/qwen-code-core/src/utils/filesearch/fileSearch.js'; -import * as crawlCache from '@qwen-code/qwen-code-core/src/utils/filesearch/crawlCache.js'; +} from '@qwen-code/qwen-code-core'; import { getErrorMessage } from '../../utils/errorMessage.js'; /**