diff --git a/docs/design/auto-compaction-threshold-redesign.md b/docs/design/auto-compaction-threshold-redesign.md
index 79bd6a8afc..544f5baecd 100644
--- a/docs/design/auto-compaction-threshold-redesign.md
+++ b/docs/design/auto-compaction-threshold-redesign.md
@@ -4,6 +4,8 @@
 
 ## 背景
 
+> 本节描述本 PR 落地**之前**的状态（pre-redesign behavior）。下文出现的 `COMPRESSION_TOKEN_THRESHOLD`、`thinkingConfig.includeThoughts = true`、`hasFailedCompressionAttempt`、以及具体的 file:line 引用都对应 PR #4345 合入前的代码——合入后这些符号 / 行号会不再有效。
+
 当前 qwen-code 的自动压缩仅使用单一比例阈值 `COMPRESSION_TOKEN_THRESHOLD = 0.7`（`chatCompressionService.ts:33`），所有窗口大小共用同一比例。对比 claude-code 的「绝对 token 梯子」（autoCompact.ts:62-65），qwen-code 存在三个具体问题：
 
 1. **大窗口下预留过多**：1M 模型 70% 阈值在 700K 触发，剩余 300K 远超摘要 + 输出实际所需的 ~33K
@@ -136,12 +138,22 @@ export interface ChatCompressionSettings {
 
 ### Breaking change 处理
 
-启动时 `Config` 加载发现 `chatCompression.contextPercentageThreshold` 存在：
+**用户面：** 启动时 `Config` 加载发现 `chatCompression.contextPercentageThreshold` 存在：
 
 - 写入 stderr 一行警告：`"chatCompression.contextPercentageThreshold has been removed and is now controlled by built-in thresholds."`
 - **不**报错、**不**阻塞启动
 - 字段值被忽略
 
+**SDK 面（R5.4）：** `CompressOptions` 的 `hasFailedCompressionAttempt: boolean` 字段重命名为 `consecutiveFailures: number`。两点差异：
+
+|      | 旧字段                         | 新字段                                                               |
+| ---- | ------------------------------ | -------------------------------------------------------------------- |
+| 名称 | `hasFailedCompressionAttempt`  | `consecutiveFailures`                                                |
+| 类型 | `boolean`                      | `number`                                                             |
+| 语义 | `true` = 永久禁用 auto-compact | `>= MAX_CONSECUTIVE_FAILURES`（默认 3）= 暂时禁用直到 force 成功重置 |
+
+仓库内只有 `GeminiChat.tryCompress` 一个内部消费方，所以内部 migration 风险低；但 `@qwen-code/qwen-code-core` 是 published package、`CompressOptions` 在 d.ts 里可见，下游 SDK 直接调 `service.compress({ ..., hasFailedCompressionAttempt: true })` 的代码会拿到 TS 编译错误。**迁移指引：** 把 `true` 改为 `MAX_CONSECUTIVE_FAILURES`（或任意 >= 3 的整数），`false` 改为 `0`。如果调用方维护自己的失败计数，直接传入即可。
+
 ## Token 估算补偿
 
 qwen-code 的 `lastPromptTokenCount` 来自上一轮 API response 的 `usageMetadata.totalTokenCount`（[geminiChat.ts:1217-1232](packages/core/src/core/geminiChat.ts:1217)）。这导致：
@@ -415,4 +427,10 @@ const { warn, auto, hard, effectiveWindow } =
 ## 开放问题（等 review）
 
 1. **breaking change 强度**：警告 + 忽略字段 vs 启动报错。当前选警告，需要确认对企业部署/团队配置是否够友好
-2. **小窗口（32K）下 hard 与 auto 退化为同一值**：用户视角是否需要在 `/context` 明示「该窗口下 hard 已退化」
+
+## 已结案
+
+2. **小窗口（≤ ~76.7K）下 hard 与 auto 退化为同一值** — 决定**不在 `/context` 明示**。理由：
+   - 塌缩范围不只是 32K，所有 `effectiveWindow - HARD_BUFFER ≤ 0.7 × window` 的窗口都塌缩（包括 64K）
+   - 用户行为不变：塌缩窗口上 `currentTier` 跳过 `'auto'` 直接报 `'hard'`（`contextCommand.ts:43-44` 先判 `>= hard`），`context-high` band（`auto ≤ t < hard`）变成空带，少一档提示在小窗口上是合理的——窗口本身就小，用户大概率手动管理上下文
+   - 如果未来有真实用户报告"小窗口看不到中间档提示"，再决定加 UI 标注或调整 `context-high` 触发条件（这是 UI 工作，不是 spec 工作）。当前选不增加 UI 复杂度
diff --git a/docs/plans/2026-05-14-auto-compaction-threshold-redesign.md b/docs/plans/2026-05-14-auto-compaction-threshold-redesign.md
new file mode 100644
index 0000000000..41efc45d78
--- /dev/null
+++ b/docs/plans/2026-05-14-auto-compaction-threshold-redesign.md
@@ -0,0 +1,1752 @@
+# Auto-Compaction Threshold Redesign Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** 把 qwen-code 自动压缩的单层比例阈值（70%）升级为「比例 + 绝对」混合的三层阈值梯子（warn / auto / hard），同时给压缩调用本身打上 `maxOutputTokens` 上限、关闭 thinking、引入失败熔断、修复 `lastPromptTokenCount` 的滞后/首轮缺口、清理用户配置面。
+
+**Architecture:**
+
+- `chatCompressionService.ts` 新增 `computeThresholds(window)` 输出 `{ warn, auto, hard }`；cheap-gate 用 `auto`，`sendMessageStream` 入口加 hard 主动救场。
+- 新建 `tokenEstimation.ts` 提供本地 char/4 估算函数，补偿 `lastPromptTokenCount` 的「滞后一轮 + 首轮为 0」两个 gap。
+- 失败处理从 `hasFailedCompressionAttempt: boolean` 单次锁升级为 `consecutiveFailures: number` 三次熔断。
+- 压缩 sideQuery 调用关 thinking + 加 `maxOutputTokens: 20K`。
+- 删除 `chatCompression.contextPercentageThreshold` settings 字段，启动时遇旧配置 stderr 警告并忽略。
+- `tipRegistry.ts` 三条 context-\* tip 重写为跟随新阈值；`/context` 命令显示三层数值。
+
+**Tech Stack:** TypeScript, Vitest, `@google/genai`, 现有 `compactionInputSlimming` 估算工具。
+
+**合并顺序：** P6 → P7 → P1 → P2 → P4 → P3 → P5。每个 Task 都是单 PR 候选。
+
+---
+
+## 文件结构
+
+| 路径                                                        | 操作      | 责任                                                                                        |
+| ----------------------------------------------------------- | --------- | ------------------------------------------------------------------------------------------- |
+| `packages/core/src/services/tokenEstimation.ts`             | 创建      | 字符级 token 估算 + `estimatePromptTokens` 入口                                             |
+| `packages/core/src/services/tokenEstimation.test.ts`        | 创建      | 估算函数单元测试                                                                            |
+| `packages/core/src/services/chatCompressionService.ts`      | 修改      | 新增常量 + `computeThresholds`；改 cheap-gate；关 thinking + maxOutput；改失败计数          |
+| `packages/core/src/services/chatCompressionService.test.ts` | 修改      | computeThresholds 单测 + cheap-gate / sideQuery config 断言                                 |
+| `packages/core/src/core/geminiChat.ts`                      | 修改      | `sendMessageStream` 入口加 hard 检查；`hasFailedCompressionAttempt` → `consecutiveFailures` |
+| `packages/core/src/core/geminiChat.test.ts`                 | 修改      | hard 触发 + 熔断器 + 首轮覆盖集成测试                                                       |
+| `packages/core/src/config/config.ts`                        | 修改      | `ChatCompressionSettings` 删除 `contextPercentageThreshold`；启动 warning                   |
+| `packages/cli/src/services/tips/tipRegistry.ts`             | 修改      | 三条 context-\* tip 改用阈值绝对比较；`TipContext` 加 `thresholds`                          |
+| `packages/cli/src/services/tips/tipRegistry.test.ts`        | 创建/修改 | tip 触发区间测试                                                                            |
+| `packages/cli/src/ui/commands/contextCommand.ts`            | 修改      | 显示新三层阈值                                                                              |
+| `packages/cli/src/ui/commands/contextCommand.test.ts`       | 修改      | 输出快照                                                                                    |
+| `packages/cli/src/ui/AppContainer.tsx`                      | 修改      | 构造 `TipContext` 时注入 `thresholds`                                                       |
+
+---
+
+## Phase P6 — 压缩 sideQuery 关 thinking + 加 maxOutputTokens
+
+第一个落地，让后续阈值假设可信。独立 PR。
+
+### Task 1: 改 chatCompressionService 的 sideQuery 调用
+
+**Files:**
+
+- Modify: `packages/core/src/services/chatCompressionService.ts:374-376`
+- Modify: `packages/core/src/services/chatCompressionService.test.ts`
+
+- [ ] **Step 1: Write the failing test**
+
+在 `chatCompressionService.test.ts` 顶部 import 部分增加 spy 入口，并在合适的 describe 内加测试。`runSideQuery` 已经是模块导出，可以 spyOn：
+
+```ts
+import * as sideQueryModule from '../utils/sideQuery.js';
+
+describe('ChatCompressionService.compress sideQuery config', () => {
+  it('passes maxOutputTokens=20_000 and includeThoughts=false to runSideQuery', async () => {
+    const spy = vi.spyOn(sideQueryModule, 'runSideQuery').mockResolvedValue({
+      text: '<state_snapshot>summary</state_snapshot>',
+      usage: {
+        promptTokenCount: 1000,
+        candidatesTokenCount: 500,
+        totalTokenCount: 1500,
+      },
+    } as any);
+
+    const service = new ChatCompressionService();
+    await service.compress(makeFakeChat(), {
+      promptId: 'p',
+      force: true,
+      model: 'qwen-test',
+      config: makeFakeConfig({ contextWindowSize: 200_000 }),
+      hasFailedCompressionAttempt: false,
+      originalTokenCount: 180_000,
+    });
+
+    expect(spy).toHaveBeenCalledTimes(1);
+    const callArg = spy.mock.calls[0]![1];
+    expect(callArg.config?.thinkingConfig?.includeThoughts).toBe(false);
+    expect(callArg.config?.maxOutputTokens).toBe(20_000);
+  });
+});
+```
+
+`makeFakeChat` / `makeFakeConfig` 复用现有测试 helper（如果文件里已有，直接用；没有就 inline 一个最小桩）。
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/services/chatCompressionService.test.ts -t 'passes maxOutputTokens=20_000'
+```
+
+Expected: FAIL — 现在传入的是 `{ thinkingConfig: { includeThoughts: true } }`，且没有 `maxOutputTokens`。
+
+- [ ] **Step 3: Implement — 修改 chatCompressionService.ts**
+
+替换 [chatCompressionService.ts:374-376](packages/core/src/services/chatCompressionService.ts:374) 整段 `config:`：
+
+```ts
+const summaryResult = await runSideQuery(config, {
+  purpose: 'chat-compression',
+  model,
+  maxAttempts: 1,
+  systemInstruction: getCompressionPrompt(),
+  contents: [
+    ...slim.slimmedHistory,
+    {
+      role: 'user',
+      parts: [
+        {
+          text: 'First, reason in your scratchpad. Then, generate the <state_snapshot>.',
+        },
+      ],
+    },
+  ],
+  // Compression output is bounded by maxOutputTokens to guarantee a predictable
+  // reserve across providers (see docs/design/auto-compaction-threshold-redesign.md).
+  // Thinking is disabled because per-provider thinking-budget semantics are
+  // inconsistent (Anthropic/OpenAI count it separately, Gemini varies by model).
+  config: {
+    thinkingConfig: { includeThoughts: false },
+    maxOutputTokens: COMPACT_MAX_OUTPUT_TOKENS,
+  },
+  abortSignal: signal ?? new AbortController().signal,
+  promptId,
+});
+```
+
+在文件顶部常量区（紧跟 `TOOL_ROUND_RETAIN_COUNT` 之后）加：
+
+```ts
+/**
+ * Hard cap on the compression sideQuery output (summary text only, since
+ * thinking is disabled). Mirrors claude-code's MAX_OUTPUT_TOKENS_FOR_SUMMARY
+ * (autoCompact.ts:30) which is based on p99.99 of real compaction outputs.
+ */
+export const COMPACT_MAX_OUTPUT_TOKENS = 20_000;
+```
+
+同时清理 `compress()` 内 token math 段（约 line 436-437）那条 `"may include non-persisted tokens (thoughts)"` 注释 —— 现在不存在 thinking 输出了，把句子改成「compressionOutputTokenCount reflects the summary tokens only since thinking is disabled」。
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/services/chatCompressionService.test.ts
+```
+
+Expected: PASS（新测试 + 现有测试不应回归）
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck --workspace=packages/core
+npm run lint
+```
+
+Expected: 无错误。
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/core/src/services/chatCompressionService.ts packages/core/src/services/chatCompressionService.test.ts
+git commit -m "$(cat <<'EOF'
+feat(core): cap compression sideQuery output and disable thinking
+
+Add COMPACT_MAX_OUTPUT_TOKENS=20_000 and pass maxOutputTokens to the
+runSideQuery call, disable thinkingConfig.includeThoughts. Aligns with
+claude-code's autoCompact reserve so the downstream threshold ladder
+(P1/P3) can rely on a predictable upper bound on summary output across
+providers (Anthropic / OpenAI / Gemini handle thinking budgets
+inconsistently).
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+---
+
+## Phase P7 — Token 估算补偿
+
+修复 `lastPromptTokenCount` 的滞后/首轮缺口。3 个 Task。
+
+### Task 2: 新建 tokenEstimation.ts 单元
+
+**Files:**
+
+- Create: `packages/core/src/services/tokenEstimation.ts`
+- Create: `packages/core/src/services/tokenEstimation.test.ts`
+
+- [ ] **Step 1: Write the failing test**
+
+`packages/core/src/services/tokenEstimation.test.ts`：
+
+```ts
+/**
+ * @license
+ * Copyright 2025 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, it, expect } from 'vitest';
+import type { Content } from '@google/genai';
+import {
+  estimateContentTokens,
+  estimatePromptTokens,
+} from './tokenEstimation.js';
+
+const textContent = (text: string): Content => ({
+  role: 'user',
+  parts: [{ text }],
+});
+
+describe('estimateContentTokens', () => {
+  it('returns 0 for empty array', () => {
+    expect(estimateContentTokens([])).toBe(0);
+  });
+
+  it('estimates plain text at ~chars/4', () => {
+    // "hello world" = 11 chars → ceil(11/4) = 3
+    expect(estimateContentTokens([textContent('hello world')])).toBe(3);
+  });
+
+  it('sums tokens across multiple messages', () => {
+    const a = textContent('aaaa'); // 4/4 = 1
+    const b = textContent('bbbbbbbb'); // 8/4 = 2
+    expect(estimateContentTokens([a, b])).toBe(3);
+  });
+
+  it('estimates inlineData via imageTokenEstimate', () => {
+    const c: Content = {
+      role: 'user',
+      parts: [{ inlineData: { mimeType: 'image/png', data: 'xxx' } }],
+    };
+    expect(estimateContentTokens([c], 1600)).toBe(1600);
+  });
+
+  it('estimates functionCall (json-dense) at ~chars/2', () => {
+    const c: Content = {
+      role: 'model',
+      parts: [{ functionCall: { name: 'foo', args: { a: 1, b: 2 } } }],
+    };
+    // estimateContentChars stringifies; the resulting JSON is short but the
+    // ratio (chars/2) should make this >= chars/4 path.
+    const result = estimateContentTokens([c]);
+    expect(result).toBeGreaterThan(0);
+  });
+});
+
+describe('estimatePromptTokens', () => {
+  const history: Content[] = [
+    textContent('older message a'),
+    textContent('older message b'),
+  ];
+  const user = textContent('current user message');
+
+  it('uses lastPromptTokenCount + user-message estimate when count > 0', () => {
+    const userEst = estimateContentTokens([user]);
+    expect(estimatePromptTokens(history, user, 5000)).toBe(5000 + userEst);
+  });
+
+  it('falls back to full estimate when lastPromptTokenCount is 0', () => {
+    const fullEst = estimateContentTokens([...history, user]);
+    expect(estimatePromptTokens(history, user, 0)).toBe(fullEst);
+  });
+});
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/services/tokenEstimation.test.ts
+```
+
+Expected: FAIL — `tokenEstimation.ts` 尚未创建。
+
+- [ ] **Step 3: Implement — 新建 tokenEstimation.ts**
+
+`packages/core/src/services/tokenEstimation.ts`：
+
+```ts
+/**
+ * @license
+ * Copyright 2025 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import type { Content } from '@google/genai';
+import {
+  DEFAULT_IMAGE_TOKEN_ESTIMATE,
+  estimateContentChars,
+} from './compactionInputSlimming.js';
+
+/**
+ * Average bytes-per-token for char-based token estimation.
+ * Matches claude-code's roughTokenCountEstimation default (tokens.ts).
+ */
+const BYTES_PER_TOKEN = 4;
+
+/**
+ * Estimate the token count of a list of Content objects via char/4.
+ *
+ * Reuses `estimateContentChars` so that inlineData / functionCall /
+ * functionResponse get the same treatment they receive when computing
+ * compression split points — keeping the two estimators in sync prevents
+ * the auto-compaction trigger and the splitter from disagreeing on size.
+ *
+ * Intended for the pre-send threshold gate only. Char/4 is a conservative
+ * lower bound (real tokenizers vary ±30%); using it to TRIGGER compaction
+ * earlier is safe (false-positive), using it to SKIP compaction is not.
+ */
+export function estimateContentTokens(
+  contents: Content[],
+  imageTokenEstimate: number = DEFAULT_IMAGE_TOKEN_ESTIMATE,
+): number {
+  let totalChars = 0;
+  for (const content of contents) {
+    totalChars += estimateContentChars(content, imageTokenEstimate);
+  }
+  return Math.ceil(totalChars / BYTES_PER_TOKEN);
+}
+
+/**
+ * Compute an effective prompt-token count for the auto-compaction gate.
+ *
+ * `lastPromptTokenCount` (from the previous turn's usage metadata) lacks
+ * two things: the current user message, and any initial value on the
+ * very first send. This helper closes both gaps via local estimation.
+ */
+export function estimatePromptTokens(
+  history: Content[],
+  userMessage: Content,
+  lastPromptTokenCount: number,
+  imageTokenEstimate: number = DEFAULT_IMAGE_TOKEN_ESTIMATE,
+): number {
+  if (lastPromptTokenCount > 0) {
+    return (
+      lastPromptTokenCount +
+      estimateContentTokens([userMessage], imageTokenEstimate)
+    );
+  }
+  return estimateContentTokens([...history, userMessage], imageTokenEstimate);
+}
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/services/tokenEstimation.test.ts
+```
+
+Expected: PASS
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck --workspace=packages/core
+npm run lint
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/core/src/services/tokenEstimation.ts packages/core/src/services/tokenEstimation.test.ts
+git commit -m "$(cat <<'EOF'
+feat(core): add token estimation helper for compaction gate
+
+Introduce estimateContentTokens / estimatePromptTokens built on the
+existing estimateContentChars (compactionInputSlimming) divided by a
+char/4 ratio. Will replace raw lastPromptTokenCount usage at the cheap-
+gate and hard-threshold checks so the system can react to (a) the
+current user message and (b) the very first send (where the API-
+reported count is 0).
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+### Task 3: 在 chatCompressionService cheap-gate 应用估算
+
+**Files:**
+
+- Modify: `packages/core/src/services/chatCompressionService.ts`
+- Modify: `packages/core/src/services/chatCompressionService.test.ts`
+
+- [ ] **Step 1: Write the failing test**
+
+本 Task 在 P1 之前落地，所以使用**现有的** `threshold * contextLimit` 公式（70% \* 200K = 140K），只把 `originalTokenCount` 替换为 `estimatePromptTokens(...)`：
+
+```ts
+import * as sideQueryModule from '../utils/sideQuery.js';
+
+describe('ChatCompressionService.compress cheap-gate uses estimated tokens', () => {
+  it('triggers compaction when API-reported tokens are below threshold but estimated tokens with the pending user message exceed it', async () => {
+    // 200K 窗口当前阈值 = 0.7 * 200K = 140K
+    // originalTokenCount = 135K（差 5K）
+    // user message 估算 ~10K → 145K，跨越 140K
+    const userMessage: Content = {
+      role: 'user',
+      parts: [{ text: 'x'.repeat(40_000) }], // 40K chars ≈ 10K tokens
+    };
+    const chat = makeFakeChat({ historyChars: 500_000 });
+
+    // Mock runSideQuery 让 compress 后续步骤不爆
+    vi.spyOn(sideQueryModule, 'runSideQuery').mockResolvedValue({
+      text: '<state_snapshot>x</state_snapshot>',
+      usage: {
+        promptTokenCount: 100,
+        candidatesTokenCount: 50,
+        totalTokenCount: 150,
+      },
+    } as any);
+
+    const result = await new ChatCompressionService().compress(chat, {
+      promptId: 'p',
+      force: false,
+      model: 'qwen-test',
+      config: makeFakeConfig({ contextWindowSize: 200_000 }),
+      hasFailedCompressionAttempt: false,
+      originalTokenCount: 135_000,
+      pendingUserMessage: userMessage,
+    });
+    expect(result.info.compressionStatus).not.toBe(CompressionStatus.NOOP);
+  });
+
+  it('NOOPs when neither originalTokenCount nor estimated total reaches threshold', async () => {
+    const chat = makeFakeChat();
+    const result = await new ChatCompressionService().compress(chat, {
+      promptId: 'p',
+      force: false,
+      model: 'qwen-test',
+      config: makeFakeConfig({ contextWindowSize: 200_000 }),
+      hasFailedCompressionAttempt: false,
+      originalTokenCount: 80_000,
+      pendingUserMessage: {
+        role: 'user',
+        parts: [{ text: 'short' }],
+      },
+    });
+    expect(result.info.compressionStatus).toBe(CompressionStatus.NOOP);
+  });
+});
+```
+
+`makeFakeChat({ historyChars })` 是测试文件内 inline helper：构造 `GeminiChat` 替身，`getHistory()` 返回长度近似匹配 `historyChars` 的 Content 数组（如果文件已有 helper 则复用）。
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/services/chatCompressionService.test.ts -t 'cheap-gate uses estimated tokens'
+```
+
+Expected: FAIL — 当前 cheap-gate 只看 `originalTokenCount`，会判定 NOOP。
+
+- [ ] **Step 3: Implement — 改 compress() cheap-gate**
+
+修改 [chatCompressionService.ts:235-249](packages/core/src/services/chatCompressionService.ts:235) 这段：
+
+```ts
+// Don't compress if not forced and we are under the limit. This is the
+// steady-state path on every send; we want to exit before paying for the
+// full `getHistory(true)` clone below.
+if (!force) {
+  const contextLimit =
+    config.getContentGeneratorConfig()?.contextWindowSize ??
+    DEFAULT_TOKEN_LIMIT;
+  const pendingUserMessage = opts.pendingUserMessage;
+  const effectiveTokens = pendingUserMessage
+    ? estimatePromptTokens(
+        chat.getHistory(true),
+        pendingUserMessage,
+        originalTokenCount,
+        slimmingConfig.imageTokenEstimate,
+      )
+    : originalTokenCount;
+  if (effectiveTokens < threshold * contextLimit) {
+    return {
+      newHistory: null,
+      info: {
+        originalTokenCount,
+        newTokenCount: originalTokenCount,
+        compressionStatus: CompressionStatus.NOOP,
+      },
+    };
+  }
+}
+```
+
+`CompressOptions` 接口（[:172-196](packages/core/src/services/chatCompressionService.ts:172)）加新字段：
+
+```ts
+export interface CompressOptions {
+  // ... 现有字段 ...
+  /**
+   * Pending user message about to be sent. When present, the cheap-gate
+   * adds its estimated token count to `originalTokenCount` (which reflects
+   * only the prior turn's API usage) so the gate sees the real prompt size.
+   * Optional for backward compatibility with callers that don't have a
+   * user message in hand (e.g. manual /compress force=true paths).
+   */
+  pendingUserMessage?: Content;
+}
+```
+
+加 import：`import { estimatePromptTokens } from './tokenEstimation.js';`
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/services/chatCompressionService.test.ts
+```
+
+Expected: PASS
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck --workspace=packages/core
+npm run lint
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/core/src/services/chatCompressionService.ts packages/core/src/services/chatCompressionService.test.ts
+git commit -m "$(cat <<'EOF'
+feat(core): cheap-gate uses estimated tokens when user message is pending
+
+Add `pendingUserMessage` to CompressOptions and feed it through
+estimatePromptTokens at the auto-compaction cheap-gate. Closes the
+'lag by one turn' gap where the threshold check missed the user
+message about to be sent.
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+### Task 4: 在 geminiChat sendMessageStream 入口透传 pendingUserMessage
+
+**Files:**
+
+- Modify: `packages/core/src/core/geminiChat.ts`
+- Modify: `packages/core/src/core/geminiChat.test.ts`
+
+- [ ] **Step 1: Write the failing test**
+
+`packages/core/src/core/geminiChat.test.ts` 增加：
+
+```ts
+describe('sendMessageStream first-turn estimation', () => {
+  it('triggers auto-compaction on the very first send when inherited history is huge', async () => {
+    // 模拟 sub-agent 继承大历史 / --continue 场景：
+    // lastPromptTokenCount = 0，但 history 已经填到接近 auto 阈值
+    const chat = makeChatWithLargeInheritedHistory(/* ~150K chars worth */);
+    expect(chat.getLastPromptTokenCount()).toBe(0);
+
+    const mockGen = mockContentGeneratorWithUsage({
+      totalTokenCount: 80_000,
+    });
+    chat.setContentGenerator(mockGen);
+
+    const stream = await chat.sendMessageStream(
+      'qwen-test',
+      { message: 'next user prompt' },
+      'prompt-1',
+    );
+    // 收集 stream 的第一个事件，应是 COMPRESSED
+    const first = await stream.next();
+    expect(first.value?.type).toBe(StreamEventType.COMPRESSED);
+  });
+});
+```
+
+helper `makeChatWithLargeInheritedHistory` 在测试文件里 inline：构造一个 `GeminiChat`，`history` 装入 1500 个简单 user/model content，每条 100 chars，总 ~150K chars。
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/core/geminiChat.test.ts -t 'first-turn estimation'
+```
+
+Expected: FAIL — 当前 `tryCompress` 用的是 `lastPromptTokenCount = 0`，cheap-gate 判 NOOP。
+
+- [ ] **Step 3: Implement — 改 sendMessageStream 与 tryCompress**
+
+[geminiChat.ts:562](packages/core/src/core/geminiChat.ts:562) 改为：
+
+```ts
+compressionInfo = await this.tryCompress(
+  prompt_id,
+  model,
+  false,
+  params.config?.abortSignal,
+  {
+    pendingUserMessage: createUserContent(params.message),
+  },
+);
+```
+
+`tryCompress` 函数签名（约 [:460-478](packages/core/src/core/geminiChat.ts:460)）的 `options` 接口 `TryCompressOptions` 加：
+
+```ts
+interface TryCompressOptions {
+  originalTokenCountOverride?: number;
+  trigger?: CompactTrigger;
+  pendingUserMessage?: Content; // ← 新增
+}
+```
+
+把 `pendingUserMessage` 透传给 `service.compress`：
+
+```ts
+const { newHistory, info } = await service.compress(this, {
+  // ... 现有字段 ...
+  pendingUserMessage: options?.pendingUserMessage,
+});
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/core/geminiChat.test.ts
+```
+
+Expected: PASS
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck --workspace=packages/core
+npm run lint
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/core/src/core/geminiChat.ts packages/core/src/core/geminiChat.test.ts
+git commit -m "$(cat <<'EOF'
+feat(core): pass pendingUserMessage from sendMessageStream to tryCompress
+
+Closes the 'first send after inherited history' gap where
+lastPromptTokenCount is 0 and the cheap-gate would always NOOP.
+estimatePromptTokens falls back to a full-history estimate in that
+case once the user message is provided.
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+---
+
+## Phase P1 — 三层阈值常量 + computeThresholds + cheap-gate
+
+### Task 5: 添加常量与 computeThresholds 函数
+
+**Files:**
+
+- Modify: `packages/core/src/services/chatCompressionService.ts`
+- Modify: `packages/core/src/services/chatCompressionService.test.ts`
+
+- [ ] **Step 1: Write the failing test**
+
+`chatCompressionService.test.ts` 增加：
+
+```ts
+import { computeThresholds } from './chatCompressionService.js';
+
+describe('computeThresholds', () => {
+  it('32K window — proportional fallback for all tiers, hard degrades to auto', () => {
+    const t = computeThresholds(32_000);
+    expect(t.warn).toBe(19_200); // 0.6 * 32K
+    expect(t.auto).toBe(22_400); // 0.7 * 32K
+    expect(t.hard).toBe(22_400); // max(window-23K=9K, auto=22.4K) = auto
+    expect(t.effectiveWindow).toBe(12_000);
+  });
+
+  it('128K window — mixed (warn=pct, auto/hard=abs)', () => {
+    const t = computeThresholds(128_000);
+    expect(t.warn).toBe(76_800); // 0.6 * 128K (pct wins: 76.8K vs auto-20K=75K)
+    expect(t.auto).toBe(95_000); // abs: window-33K (abs wins: 95K vs 0.7*128K=89.6K)
+    expect(t.hard).toBe(105_000); // abs: window-23K
+    expect(t.effectiveWindow).toBe(108_000);
+  });
+
+  it('200K window — absolute takes over all tiers', () => {
+    const t = computeThresholds(200_000);
+    expect(t.warn).toBe(147_000); // abs: auto-20K (abs wins: 147K vs 0.6*200K=120K)
+    expect(t.auto).toBe(167_000); // abs: 200K-33K
+    expect(t.hard).toBe(177_000); // abs: 200K-23K
+  });
+
+  it('1M window — fully absolute', () => {
+    const t = computeThresholds(1_000_000);
+    expect(t.warn).toBe(947_000);
+    expect(t.auto).toBe(967_000);
+    expect(t.hard).toBe(977_000);
+  });
+
+  it('extreme small window (10K) does not crash; returns sane values', () => {
+    const t = computeThresholds(10_000);
+    expect(t.warn).toBeGreaterThan(0);
+    expect(t.auto).toBeGreaterThan(0);
+    expect(t.warn).toBeLessThanOrEqual(t.auto);
+    expect(t.auto).toBeLessThanOrEqual(t.hard);
+  });
+
+  it('thresholds always satisfy warn <= auto <= hard', () => {
+    for (const w of [32_000, 64_000, 128_000, 200_000, 256_000, 1_000_000]) {
+      const t = computeThresholds(w);
+      expect(t.warn).toBeLessThanOrEqual(t.auto);
+      expect(t.auto).toBeLessThanOrEqual(t.hard);
+    }
+  });
+});
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/services/chatCompressionService.test.ts -t 'computeThresholds'
+```
+
+Expected: FAIL — `computeThresholds` 不存在。
+
+- [ ] **Step 3: Implement — 加常量与函数**
+
+在 [chatCompressionService.ts](packages/core/src/services/chatCompressionService.ts) 文件常量区（紧跟 `COMPACT_MAX_OUTPUT_TOKENS`）加：
+
+```ts
+/**
+ * Default proportional auto-compaction threshold (legacy semantics
+ * preserved as a small-window fallback / safety net).
+ */
+export const DEFAULT_PCT = 0.7;
+
+/**
+ * Warn-tier proportional offset: warn-pct = PCT - WARN_PCT_OFFSET (= 0.6).
+ */
+export const WARN_PCT_OFFSET = 0.1;
+
+/**
+ * Token budget reserved for compression output. Matches COMPACT_MAX_OUTPUT_TOKENS
+ * because thinking is disabled (see Task 1) so maxOutputTokens is the hard
+ * ceiling on summary output.
+ */
+export const SUMMARY_RESERVE = COMPACT_MAX_OUTPUT_TOKENS; // 20_000
+
+/** Distance between auto threshold and effectiveWindow. */
+export const AUTOCOMPACT_BUFFER = 13_000;
+
+/** Distance between warn threshold and auto threshold. */
+export const WARN_BUFFER = 20_000;
+
+/** Distance between hard threshold and effectiveWindow (claude-code MANUAL_COMPACT_BUFFER). */
+export const HARD_BUFFER = 3_000;
+
+/** Auto-compaction consecutive-failure circuit breaker. */
+export const MAX_CONSECUTIVE_FAILURES = 3;
+
+export interface CompactionThresholds {
+  /** Token count at which UI warn tier triggers. */
+  warn: number;
+  /** Token count at which auto-compaction triggers. */
+  auto: number;
+  /** Token count at which auto-compaction is forced (resets failure counter). */
+  hard: number;
+  /** Window minus SUMMARY_RESERVE; the budget available for input + summary. */
+  effectiveWindow: number;
+}
+
+/**
+ * Compute the three-tier threshold ladder for a given context window.
+ *
+ * Each tier is `max(proportional, absolute)`:
+ *   auto  = max(PCT * window,                effectiveWindow - AUTOCOMPACT_BUFFER)
+ *   warn  = max((PCT - WARN_OFFSET) * window, auto - WARN_BUFFER)
+ *   hard  = max(effectiveWindow - HARD_BUFFER, auto)  // hard degrades to auto for tiny windows
+ *
+ * Small windows (where the absolute branch goes negative) automatically fall
+ * back to the proportional branch. Large windows are dominated by the absolute
+ * branch, capping wasted reservation to ~33K instead of 30% of the window.
+ */
+export function computeThresholds(window: number): CompactionThresholds {
+  const effectiveWindow = window - SUMMARY_RESERVE;
+
+  const absAuto = effectiveWindow - AUTOCOMPACT_BUFFER;
+  const auto = Math.max(DEFAULT_PCT * window, absAuto);
+
+  const absWarn = auto - WARN_BUFFER;
+  const warn = Math.max((DEFAULT_PCT - WARN_PCT_OFFSET) * window, absWarn);
+
+  const rawHard = effectiveWindow - HARD_BUFFER;
+  const hard = Math.max(rawHard, auto);
+
+  return { warn, auto, hard, effectiveWindow };
+}
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/services/chatCompressionService.test.ts
+```
+
+Expected: PASS
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck --workspace=packages/core
+npm run lint
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/core/src/services/chatCompressionService.ts packages/core/src/services/chatCompressionService.test.ts
+git commit -m "$(cat <<'EOF'
+feat(core): add computeThresholds for three-tier compaction ladder
+
+Introduces warn/auto/hard thresholds combining proportional fallback
+(small windows) with absolute reservation (large windows). Matches the
+formula in docs/design/auto-compaction-threshold-redesign.md. Pure
+function with full coverage across 32K/128K/200K/1M/extreme-small
+windows.
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+### Task 6: cheap-gate 切换到 computeThresholds.auto
+
+**Files:**
+
+- Modify: `packages/core/src/services/chatCompressionService.ts`
+- Modify: `packages/core/src/services/chatCompressionService.test.ts`
+
+- [ ] **Step 1: Write the failing test**
+
+```ts
+describe('compress cheap-gate uses computeThresholds.auto', () => {
+  it('on a 200K window with originalTokenCount=160K, NOOP (below auto=167K)', async () => {
+    const chat = makeFakeChat();
+    const result = await new ChatCompressionService().compress(chat, {
+      promptId: 'p',
+      force: false,
+      model: 'qwen-test',
+      config: makeFakeConfig({ contextWindowSize: 200_000 }),
+      hasFailedCompressionAttempt: false,
+      originalTokenCount: 160_000,
+    });
+    expect(result.info.compressionStatus).toBe(CompressionStatus.NOOP);
+  });
+
+  it('on a 200K window with originalTokenCount=168K, proceeds past gate', async () => {
+    // 168K > 167K (auto)，cheap-gate 放行，进入 curatedHistory 阶段
+    const chat = makeFakeChat({ historyChars: 500_000 });
+    const result = await new ChatCompressionService().compress(chat, {
+      promptId: 'p',
+      force: false,
+      model: 'qwen-test',
+      config: makeFakeConfig({ contextWindowSize: 200_000 }),
+      hasFailedCompressionAttempt: false,
+      originalTokenCount: 168_000,
+    });
+    // 实际结果取决于 mock 出来的 sideQuery；只验证不是被 cheap-gate 拦下的早期 NOOP
+    expect(result.info.compressionStatus).not.toBe(CompressionStatus.NOOP);
+  });
+});
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/services/chatCompressionService.test.ts -t 'cheap-gate uses computeThresholds'
+```
+
+Expected: FAIL — 当前阈值是 `threshold * contextLimit = 0.7 * 200K = 140K`，160K 已经超过 140K 直接 cheap-gate 放行（不符断言①）；168K 同理。
+
+- [ ] **Step 3: Implement — 切换 cheap-gate 公式**
+
+修改 [chatCompressionService.ts:235-249](packages/core/src/services/chatCompressionService.ts:235) 那段 `if (!force) { ... }` 块：
+
+```ts
+if (!force) {
+  const contextLimit =
+    config.getContentGeneratorConfig()?.contextWindowSize ??
+    DEFAULT_TOKEN_LIMIT;
+  const { auto } = computeThresholds(contextLimit);
+  const pendingUserMessage = opts.pendingUserMessage;
+  const effectiveTokens = pendingUserMessage
+    ? estimatePromptTokens(
+        chat.getHistory(true),
+        pendingUserMessage,
+        originalTokenCount,
+        slimmingConfig.imageTokenEstimate,
+      )
+    : originalTokenCount;
+  if (effectiveTokens < auto) {
+    return {
+      newHistory: null,
+      info: {
+        originalTokenCount,
+        newTokenCount: originalTokenCount,
+        compressionStatus: CompressionStatus.NOOP,
+      },
+    };
+  }
+}
+```
+
+同时删除 [chatCompressionService.ts:214-217](packages/core/src/services/chatCompressionService.ts:214) 那段 `const threshold = chatCompressionSettings?.contextPercentageThreshold ?? COMPRESSION_TOKEN_THRESHOLD;`，因为 `threshold` 现在不再被 cheap-gate 使用。同时去掉 line 221 那个 `threshold <= 0` 分支（隐式禁用语义，详细在 P4 处理）。
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/services/chatCompressionService.test.ts
+```
+
+Expected: PASS
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck --workspace=packages/core
+npm run lint
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/core/src/services/chatCompressionService.ts packages/core/src/services/chatCompressionService.test.ts
+git commit -m "$(cat <<'EOF'
+refactor(core): cheap-gate uses computeThresholds.auto
+
+Replace the legacy `threshold * contextLimit` formula with
+computeThresholds.auto, which combines proportional fallback with
+absolute reservation. On large windows (>=128K) the gate now triggers
+later than 70% but reserves a fixed ~33K, freeing tens of thousands of
+context tokens that the old formula wasted.
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+---
+
+## Phase P2 — 失败处理升级（1 次锁 → 3 次熔断）
+
+### Task 7: hasFailedCompressionAttempt → consecutiveFailures
+
+**Files:**
+
+- Modify: `packages/core/src/core/geminiChat.ts`
+- Modify: `packages/core/src/services/chatCompressionService.ts`
+- Modify: `packages/core/src/core/geminiChat.test.ts`
+- Modify: `packages/core/src/services/chatCompressionService.test.ts`
+
+- [ ] **Step 1: Write the failing test**
+
+`geminiChat.test.ts`：
+
+```ts
+describe('compression failure circuit breaker', () => {
+  it('tolerates 2 consecutive failures, NOOPs the third', async () => {
+    const chat = makeChatWithMockedFailingCompression();
+    // 触发 3 次连续失败：
+    await chat.sendMessageStream('m', { message: 'a' }, 'p1'); // attempt 1 fails
+    await chat.sendMessageStream('m', { message: 'b' }, 'p2'); // attempt 2 fails
+    const events = await collectEvents(
+      await chat.sendMessageStream('m', { message: 'c' }, 'p3'), // attempt 3 should NOOP
+    );
+    expect(
+      events.find((e) => e.type === StreamEventType.COMPRESSED),
+    ).toBeUndefined();
+    // 验证 service.compress 第 3 次根本没被调用（熔断器 NOOP 在 cheap-gate）
+    expect(getCompressCallCount()).toBe(2);
+  });
+
+  it('resets counter on a successful force compress', async () => {
+    const chat = makeChatWithMockedFailingCompression();
+    await chat.sendMessageStream('m', { message: 'a' }, 'p1'); // fail
+    await chat.sendMessageStream('m', { message: 'b' }, 'p2'); // fail
+    // 用户手动 /compress
+    await chat.tryCompress('p3', 'm', /* force */ true);
+    // 现在熔断器应该已重置
+    await chat.sendMessageStream('m', { message: 'c' }, 'p4');
+    expect(getCompressCallCount()).toBeGreaterThan(3);
+  });
+});
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/core/geminiChat.test.ts -t 'circuit breaker'
+```
+
+Expected: FAIL — 当前一次失败就永久锁，第 2 次 send 已经被 cheap-gate NOOP，第 3 次也 NOOP，但断言 ② 期望力 force 之后能恢复且 sendMessageStream 走得到 compress。
+
+- [ ] **Step 3: Implement —替换字段**
+
+[geminiChat.ts](packages/core/src/core/geminiChat.ts) 内部字段（grep `hasFailedCompressionAttempt`）：
+
+```ts
+// 替换前
+private hasFailedCompressionAttempt = false;
+
+// 替换后
+private consecutiveFailures = 0;
+```
+
+[geminiChat.ts:467-478](packages/core/src/core/geminiChat.ts:467) 的 `tryCompress` 函数传给 `service.compress` 的字段：
+
+```ts
+const { newHistory, info } = await service.compress(this, {
+  promptId,
+  force,
+  model,
+  config: this.config,
+  consecutiveFailures: this.consecutiveFailures, // ← 取代 hasFailedCompressionAttempt
+  originalTokenCount:
+    options?.originalTokenCountOverride ?? this.lastPromptTokenCount,
+  pendingUserMessage: options?.pendingUserMessage,
+  trigger: options?.trigger,
+  signal,
+});
+```
+
+[geminiChat.ts:503-510](packages/core/src/core/geminiChat.ts:503) 失败/成功分支：
+
+```ts
+if (info.compressionStatus === CompressionStatus.COMPRESSED && newHistory) {
+  // ... 现有逻辑 ...
+  this.setHistory(newHistory);
+  this.config.getFileReadCache().clear();
+  this.lastPromptTokenCount = info.newTokenCount;
+  this.telemetryService?.setLastPromptTokenCount(info.newTokenCount);
+  this.consecutiveFailures = 0; // ← 取代 hasFailedCompressionAttempt = false
+} else if (isCompressionFailureStatus(info.compressionStatus)) {
+  if (!force) {
+    this.consecutiveFailures += 1; // ← 取代 hasFailedCompressionAttempt = true
+  }
+}
+```
+
+[chatCompressionService.ts](packages/core/src/services/chatCompressionService.ts) 的 `CompressOptions` 接口：
+
+```ts
+export interface CompressOptions {
+  // ... 现有字段 ...
+  /**
+   * Number of consecutive auto-compaction failures for this chat. When
+   * it reaches MAX_CONSECUTIVE_FAILURES, the gate stops trying until a
+   * successful force=true call resets it.
+   */
+  consecutiveFailures: number;
+  // 删除 hasFailedCompressionAttempt
+}
+```
+
+`compress()` 函数内 [:221](packages/core/src/services/chatCompressionService.ts:221) 那段 cheap-gate 检查：
+
+```ts
+// Cheap gates first — these don't need the curated history.
+if (consecutiveFailures >= MAX_CONSECUTIVE_FAILURES && !force) {
+  return {
+    newHistory: null,
+    info: {
+      originalTokenCount: 0,
+      newTokenCount: 0,
+      compressionStatus: CompressionStatus.NOOP,
+    },
+  };
+}
+```
+
+更新解构 `const { ... } = opts;` 把 `hasFailedCompressionAttempt` 替换成 `consecutiveFailures`。
+
+`chatCompressionService.test.ts` 中所有传 `hasFailedCompressionAttempt: false/true` 的地方改为 `consecutiveFailures: 0` / `consecutiveFailures: MAX_CONSECUTIVE_FAILURES`，逐个修正测试期望。
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/core/geminiChat.test.ts packages/core/src/services/chatCompressionService.test.ts
+```
+
+Expected: PASS
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck --workspace=packages/core
+npm run lint
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/core/src/core/geminiChat.ts packages/core/src/services/chatCompressionService.ts packages/core/src/core/geminiChat.test.ts packages/core/src/services/chatCompressionService.test.ts
+git commit -m "$(cat <<'EOF'
+refactor(core): replace hasFailedCompressionAttempt with circuit breaker
+
+Switches from a one-shot permanent lock to a three-strike circuit
+breaker (MAX_CONSECUTIVE_FAILURES=3). Successful force compress
+(manual /compress, reactive overflow, or hard-tier rescue) resets the
+counter. Aligns with claude-code's design and unblocks recovery from
+transient failures (rate limits, transient model errors) that
+previously disabled auto-compaction for the rest of the session.
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+---
+
+## Phase P4 — 配置面：删除 contextPercentageThreshold + breaking-change 警告
+
+### Task 8: 删除字段 + 启动 warning
+
+**Files:**
+
+- Modify: `packages/core/src/config/config.ts`
+- Modify: `packages/cli/src/config/settingsSchema.ts`（如果有引用）
+- Modify: `packages/core/src/services/chatCompressionService.ts`
+- Modify: `packages/core/src/services/chatCompressionService.test.ts`
+
+- [ ] **Step 1: Write the failing test**
+
+`packages/core/src/config/config.test.ts`（如果不存在则创建）：
+
+```ts
+import { describe, it, expect, vi } from 'vitest';
+
+describe('Config — chatCompression.contextPercentageThreshold deprecation', () => {
+  it('logs a stderr warning when the deprecated field is set', () => {
+    const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});
+    new Config({
+      // ... minimal required Config params ...
+      chatCompression: { contextPercentageThreshold: 0.5 } as any,
+    });
+    expect(warnSpy).toHaveBeenCalledWith(
+      expect.stringContaining(
+        'chatCompression.contextPercentageThreshold has been removed',
+      ),
+    );
+    warnSpy.mockRestore();
+  });
+
+  it('does not warn when the deprecated field is absent', () => {
+    const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});
+    new Config({
+      // ... minimal params, no chatCompression.contextPercentageThreshold ...
+    });
+    expect(warnSpy).not.toHaveBeenCalledWith(
+      expect.stringContaining('chatCompression.contextPercentageThreshold'),
+    );
+    warnSpy.mockRestore();
+  });
+});
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/config/config.test.ts
+```
+
+Expected: FAIL — Config 当前完全接受这个字段，无 warning。
+
+- [ ] **Step 3: Implement — 改 ChatCompressionSettings + Config 构造函数**
+
+[config.ts:217-227](packages/core/src/config/config.ts:217)：
+
+```ts
+export interface ChatCompressionSettings {
+  /**
+   * Estimated tokens for a single inline image / document part when
+   * apportioning chars across history in `findCompressSplitPoint`.
+   * Also used as the placeholder budget when stripping inline media
+   * out of the side-query compaction prompt. Default 1600.
+   * Env override: `QWEN_IMAGE_TOKEN_ESTIMATE`.
+   */
+  imageTokenEstimate?: number;
+}
+```
+
+（删除 `contextPercentageThreshold` 字段。）
+
+[config.ts](packages/core/src/config/config.ts) 找到 Config 构造函数中处理 `params.chatCompression` 的位置（约 line 933），在赋值前加：
+
+```ts
+if (
+  params.chatCompression &&
+  typeof (params.chatCompression as Record<string, unknown>)
+    .contextPercentageThreshold !== 'undefined'
+) {
+  console.warn(
+    '[qwen-code] chatCompression.contextPercentageThreshold has been removed ' +
+      'and is now controlled by built-in thresholds. Setting will be ignored.',
+  );
+}
+this.chatCompression = params.chatCompression;
+```
+
+`chatCompressionService.ts` 同时清理：[:214-217](packages/core/src/services/chatCompressionService.ts:214) 那段已经在 Task 6 删除，再检查文件里有没有残留 `chatCompressionSettings?.contextPercentageThreshold` 或导出的常量 `COMPRESSION_TOKEN_THRESHOLD`：
+
+- 如果 `COMPRESSION_TOKEN_THRESHOLD` 已经无任何引用，删除该常量。
+- 如果还有引用（比如 telemetry 或 doc），改为引用 `DEFAULT_PCT`。
+
+cli/config/settingsSchema.ts 不需要改 —— `chatCompression` 仍然是 `type: 'object'`，里面没有 schema 字段（[settingsSchema.ts:1020-1028](packages/cli/src/config/settingsSchema.ts:1020)）。如果 schema 内部有对 `contextPercentageThreshold` 的引用，删除。
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/core
+npm test --workspace=packages/cli
+```
+
+Expected: PASS（包括既有压缩相关测试）
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck
+npm run lint
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/core/src/config/config.ts packages/core/src/config/config.test.ts packages/core/src/services/chatCompressionService.ts packages/core/src/services/chatCompressionService.test.ts
+git commit -m "$(cat <<'EOF'
+refactor(core)!: remove chatCompression.contextPercentageThreshold setting
+
+The proportional threshold is now an internal constant (DEFAULT_PCT) and
+the auto-compaction threshold is computed from a mixed proportional /
+absolute formula (computeThresholds). User-facing tuning of the bare
+percentage no longer maps to meaningful behavior on large-window models.
+
+Existing settings.json files containing the field will log a one-line
+stderr warning on startup; the field is otherwise ignored.
+
+BREAKING CHANGE: chatCompression.contextPercentageThreshold is removed.
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+---
+
+## Phase P3 — hard 层主动救场
+
+### Task 9: sendMessageStream 入口加 hard 检查 + force compress
+
+**Files:**
+
+- Modify: `packages/core/src/core/geminiChat.ts`
+- Modify: `packages/core/src/core/geminiChat.test.ts`
+
+- [ ] **Step 1: Write the failing test**
+
+```ts
+describe('sendMessageStream hard-tier rescue', () => {
+  it('triggers force compress when estimated tokens cross hard threshold', async () => {
+    // 构造 200K 窗口：hard = 177K
+    const chat = makeChatWithLastPromptTokenCount(176_000);
+    // 本轮 user message 估算 + 176K 越过 177K
+    const userMessage = makeBigUserMessage(/* ~3K tokens */);
+    const stream = await chat.sendMessageStream(
+      'm',
+      { message: userMessage },
+      'p',
+    );
+    const first = await stream.next();
+    expect(first.value?.type).toBe(StreamEventType.COMPRESSED);
+    expect(getLastCompressCallForce()).toBe(true);
+  });
+
+  it('hard rescue resets consecutiveFailures before forcing', async () => {
+    const chat = makeChatWithLastPromptTokenCount(176_000);
+    // 先制造 3 次失败，使 consecutiveFailures = 3
+    setMockedCompressionToFail(3);
+    await chat.sendMessageStream('m', { message: 'a' }, 'p1');
+    await chat.sendMessageStream('m', { message: 'b' }, 'p2');
+    await chat.sendMessageStream('m', { message: 'c' }, 'p3');
+    expect(chat.getConsecutiveFailures()).toBe(3);
+    // 第 4 次：token 跨越 hard，hard rescue 重置熔断器并 force=true
+    setMockedCompressionToSucceed();
+    await chat.sendMessageStream('m', { message: 'd' }, 'p4');
+    expect(getLastCompressCallForce()).toBe(true);
+    expect(chat.getConsecutiveFailures()).toBe(0);
+  });
+});
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/core/geminiChat.test.ts -t 'hard-tier rescue'
+```
+
+Expected: FAIL — sendMessageStream 当前永远以 `force=false` 调 tryCompress。
+
+- [ ] **Step 3: Implement —在 sendMessageStream 入口加 hard 判断**
+
+[geminiChat.ts:560-567](packages/core/src/core/geminiChat.ts:560)：
+
+```ts
+// Hard-tier rescue: if pending prompt is large enough to risk overflow,
+// force compress before the send and reset the failure counter so a
+// session already in circuit-breaker NOOP can recover. This proactively
+// covers what reactive overflow (line ~711) would otherwise catch
+// after a wasted round-trip.
+const contextLimit =
+  this.config.getContentGeneratorConfig()?.contextWindowSize ??
+  DEFAULT_TOKEN_LIMIT;
+const { hard } = computeThresholds(contextLimit);
+const pendingUserMessage = createUserContent(params.message);
+const effectiveTokens = estimatePromptTokens(
+  this.getHistory(true),
+  pendingUserMessage,
+  this.lastPromptTokenCount,
+);
+const shouldForceFromHard = effectiveTokens >= hard;
+if (shouldForceFromHard) {
+  this.consecutiveFailures = 0;
+}
+
+compressionInfo = await this.tryCompress(
+  prompt_id,
+  model,
+  shouldForceFromHard,
+  params.config?.abortSignal,
+  { pendingUserMessage },
+);
+```
+
+注意：`createUserContent` 在 sendMessageStream 内部本来在 [:569](packages/core/src/core/geminiChat.ts:569) 调一次；现在我们提前调，所以 [:569](packages/core/src/core/geminiChat.ts:569) 那行 `const userContent = createUserContent(params.message);` 可以删除/替换为 `const userContent = pendingUserMessage;`。
+
+加 import：`import { computeThresholds } from '../services/chatCompressionService.js';`
+加 import：`import { estimatePromptTokens } from '../services/tokenEstimation.js';`
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/core -- --run packages/core/src/core/geminiChat.test.ts
+```
+
+Expected: PASS
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck --workspace=packages/core
+npm run lint
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/core/src/core/geminiChat.ts packages/core/src/core/geminiChat.test.ts
+git commit -m "$(cat <<'EOF'
+feat(core): hard-tier rescue forces compaction before oversized send
+
+When estimated tokens cross computeThresholds.hard, sendMessageStream
+now resets the consecutive-failure counter and calls tryCompress with
+force=true. This pulls reactive overflow recovery forward to before
+the send, saving one wasted round-trip and unblocking sessions whose
+circuit breaker had latched off.
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+---
+
+## Phase P5 — UI 改动（tip 重写 + /context 显示）
+
+### Task 10: tipRegistry 重写三条 context-\* tip
+
+**Files:**
+
+- Modify: `packages/cli/src/services/tips/tipRegistry.ts`
+- Modify: `packages/cli/src/services/tips/tipRegistry.test.ts`（如不存在则创建）
+- Modify: `packages/cli/src/ui/AppContainer.tsx`
+
+- [ ] **Step 1: Write the failing test**
+
+`packages/cli/src/services/tips/tipRegistry.test.ts`：
+
+```ts
+/**
+ * @license
+ * Copyright 2025 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, it, expect } from 'vitest';
+import { tipRegistry, type TipContext } from './tipRegistry.js';
+
+const baseCtx: TipContext = {
+  lastPromptTokenCount: 0,
+  contextWindowSize: 200_000,
+  sessionPromptCount: 10,
+  sessionCount: 1,
+  platform: 'darwin',
+  thresholds: {
+    warn: 147_000,
+    auto: 167_000,
+    hard: 177_000,
+    effectiveWindow: 180_000,
+  },
+};
+
+function tipById(id: string) {
+  return tipRegistry.find((t) => t.id === id)!;
+}
+
+describe('context-* tip thresholds align with computeThresholds', () => {
+  it('compress-intro fires between warn and auto', () => {
+    const t = tipById('compress-intro');
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 100_000 })).toBe(
+      false,
+    );
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 150_000 })).toBe(
+      true,
+    );
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 168_000 })).toBe(
+      false,
+    );
+  });
+
+  it('context-high fires between auto and hard', () => {
+    const t = tipById('context-high');
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 150_000 })).toBe(
+      false,
+    );
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 170_000 })).toBe(
+      true,
+    );
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 178_000 })).toBe(
+      false,
+    );
+  });
+
+  it('context-critical fires at or above hard', () => {
+    const t = tipById('context-critical');
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 170_000 })).toBe(
+      false,
+    );
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 178_000 })).toBe(
+      true,
+    );
+  });
+
+  it('falls back gracefully when thresholds undefined (legacy callers)', () => {
+    const ctx = { ...baseCtx, thresholds: undefined };
+    // 三条 tip 在缺 thresholds 时应该都不触发（不能比较）
+    expect(tipById('compress-intro').isRelevant(ctx)).toBe(false);
+    expect(tipById('context-high').isRelevant(ctx)).toBe(false);
+    expect(tipById('context-critical').isRelevant(ctx)).toBe(false);
+  });
+});
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/cli -- --run packages/cli/src/services/tips/tipRegistry.test.ts
+```
+
+Expected: FAIL — `TipContext` 没有 `thresholds` 字段；三条 tip 仍按 50/80/95 百分比触发。
+
+- [ ] **Step 3: Implement — 改 tipRegistry**
+
+[tipRegistry.ts:15-21](packages/cli/src/services/tips/tipRegistry.ts:15)：
+
+```ts
+import type { CompactionThresholds } from '@qwen-code/qwen-code-core';
+import { DEFAULT_TOKEN_LIMIT } from '@qwen-code/qwen-code-core';
+
+export type TipTrigger = 'startup' | 'post-response';
+
+export interface TipContext {
+  lastPromptTokenCount: number;
+  contextWindowSize: number;
+  sessionPromptCount: number;
+  sessionCount: number;
+  platform: string;
+  /**
+   * Three-tier auto-compaction thresholds, computed by callers.
+   * Optional for backward compat; tip checks return false when missing.
+   */
+  thresholds?: CompactionThresholds;
+}
+```
+
+`getContextUsagePercent` 保留（其他 startup tip 可能用到），但 context-\* tips 不再依赖它。
+
+替换 [tipRegistry.ts:37-69](packages/cli/src/services/tips/tipRegistry.ts:37) 三条 tip 的 `isRelevant`：
+
+```ts
+export const tipRegistry: ContextualTip[] = [
+  // --- Post-response contextual tips (priority: higher = more urgent) ---
+  {
+    id: 'context-critical',
+    content:
+      'Context near hard limit — auto-compact will force on next send. Consider /clear if you want to start fresh.',
+    trigger: 'post-response',
+    isRelevant: (ctx) =>
+      ctx.thresholds !== undefined &&
+      ctx.lastPromptTokenCount >= ctx.thresholds.hard,
+    cooldownPrompts: 3,
+    priority: 100,
+  },
+  {
+    id: 'context-high',
+    content: 'Context is getting full. Use /compress to free up space.',
+    trigger: 'post-response',
+    isRelevant: (ctx) =>
+      ctx.thresholds !== undefined &&
+      ctx.lastPromptTokenCount >= ctx.thresholds.auto &&
+      ctx.lastPromptTokenCount < ctx.thresholds.hard,
+    cooldownPrompts: 5,
+    priority: 90,
+  },
+  {
+    id: 'compress-intro',
+    content: 'Long conversation? /compress summarizes history to free context.',
+    trigger: 'post-response',
+    isRelevant: (ctx) =>
+      ctx.thresholds !== undefined &&
+      ctx.lastPromptTokenCount >= ctx.thresholds.warn &&
+      ctx.lastPromptTokenCount < ctx.thresholds.auto &&
+      ctx.sessionPromptCount > 5,
+    cooldownPrompts: 10,
+    priority: 50,
+  },
+
+  // --- Startup tips ---  ← 保持不变
+  // ... 后面 startup tips 不动 ...
+```
+
+`packages/cli/src/ui/AppContainer.tsx:1150` 那一带（已知是 contextual-tips 构造点），改为：
+
+```tsx
+// pseudo — 具体取决于现有代码
+const thresholds = computeThresholds(contextWindowSize);
+const tipCtx: TipContext = {
+  lastPromptTokenCount,
+  contextWindowSize,
+  sessionPromptCount,
+  sessionCount,
+  platform: process.platform,
+  thresholds,
+};
+```
+
+加 import 到 AppContainer.tsx：
+
+```tsx
+import { computeThresholds } from '@qwen-code/qwen-code-core';
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/cli -- --run packages/cli/src/services/tips/tipRegistry.test.ts
+npm test --workspace=packages/cli
+```
+
+Expected: PASS
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck
+npm run lint
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/cli/src/services/tips/tipRegistry.ts packages/cli/src/services/tips/tipRegistry.test.ts packages/cli/src/ui/AppContainer.tsx
+git commit -m "$(cat <<'EOF'
+feat(cli): align context-* tips with new compaction thresholds
+
+The three context-usage tips now compare tokenCount against the
+warn/auto/hard ladder from computeThresholds instead of fixed 50/80/95
+percentages. compress-intro fires between warn and auto, context-high
+between auto and hard, context-critical at or above hard. Threshold
+data is injected into TipContext from the AppContainer.
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+### Task 11: /context 命令显示三层阈值
+
+**Files:**
+
+- Modify: `packages/cli/src/ui/commands/contextCommand.ts`
+- Modify: `packages/cli/src/ui/commands/contextCommand.test.ts`
+
+- [ ] **Step 1: Write the failing test**
+
+```ts
+describe('/context shows three-tier thresholds', () => {
+  it('renders warn/auto/hard with current tier marker', () => {
+    const result = renderContextCommand({
+      contextWindowSize: 200_000,
+      lastPromptTokenCount: 150_000, // 在 warn 与 auto 之间
+    });
+    expect(result).toMatch(/Warn threshold:\s+147[,.]?000/);
+    expect(result).toMatch(/Auto threshold:\s+167[,.]?000/);
+    expect(result).toMatch(/Hard threshold:\s+177[,.]?000/);
+    expect(result).toMatch(/current tier:\s+warn/i);
+  });
+
+  it('correctly identifies "below warn" tier when tokens are low', () => {
+    const result = renderContextCommand({
+      contextWindowSize: 200_000,
+      lastPromptTokenCount: 50_000,
+    });
+    expect(result).toMatch(/current tier:\s+(safe|below warn|normal)/i);
+  });
+});
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+```bash
+npm test --workspace=packages/cli -- --run packages/cli/src/ui/commands/contextCommand.test.ts -t 'three-tier'
+```
+
+Expected: FAIL — 当前 [contextCommand.ts:177-183](packages/cli/src/ui/commands/contextCommand.ts:177) 用的是 `(1 - threshold) * contextWindowSize` 公式，只显示单个 "autocompactBuffer" 数。
+
+- [ ] **Step 3: Implement — 改 contextCommand 输出**
+
+替换 [contextCommand.ts:177-183](packages/cli/src/ui/commands/contextCommand.ts:177) 那段：
+
+```ts
+import { computeThresholds } from '@qwen-code/qwen-code-core';
+
+// ... 在 buildContextSummary 或类似入口里：
+const thresholds = computeThresholds(contextWindowSize);
+const { warn, auto, hard, effectiveWindow } = thresholds;
+
+function currentTier(tokens: number): string {
+  if (tokens >= hard) return 'hard (force compress imminent)';
+  if (tokens >= auto) return 'auto (compaction in progress / just ran)';
+  if (tokens >= warn) return 'warn';
+  return 'safe';
+}
+
+// 在格式化输出部分追加：
+const lines = [
+  // ... 现有输出 ...
+  `Effective window:   ${formatNum(effectiveWindow)}  (window − 20K reserve)`,
+  `Warn threshold:     ${formatNum(warn)}`,
+  `Auto threshold:     ${formatNum(auto)}`,
+  `Hard threshold:     ${formatNum(hard)}`,
+  `Current tier:       ${currentTier(lastPromptTokenCount)}`,
+];
+```
+
+注：`formatNum` 是现有项目里的 `.toLocaleString()` 等；如未在文件内则 inline 一个 `(n: number) => n.toLocaleString('en-US')`。
+
+同时**删除**原来计算 `autocompactBuffer` 的代码（[:180-183](packages/cli/src/ui/commands/contextCommand.ts:180)）和对 `compressionThreshold` 的使用 —— 现在直接看 `auto`。
+
+- [ ] **Step 4: Run test to verify it passes**
+
+```bash
+npm test --workspace=packages/cli -- --run packages/cli/src/ui/commands/contextCommand.test.ts
+```
+
+Expected: PASS
+
+- [ ] **Step 5: Typecheck + lint**
+
+```bash
+npm run typecheck
+npm run lint
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add packages/cli/src/ui/commands/contextCommand.ts packages/cli/src/ui/commands/contextCommand.test.ts
+git commit -m "$(cat <<'EOF'
+feat(cli): /context shows three-tier thresholds and current tier
+
+Replace the legacy single-buffer display with effective window + warn /
+auto / hard threshold lines and a "current tier" label so users can see
+exactly where in the ladder the session sits.
+
+Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+EOF
+)"
+```
+
+---
+
+## 验收（最终全量回归）
+
+落地所有 task 后，最后跑一遍全量校验：
+
+- [ ] **Step 1: 全量测试**
+
+```bash
+npm test
+```
+
+Expected: 全部 workspace 测试通过。
+
+- [ ] **Step 2: 全量 typecheck**
+
+```bash
+npm run typecheck
+```
+
+- [ ] **Step 3: 全量 lint**
+
+```bash
+npm run lint
+```
+
+- [ ] **Step 4: 手动 smoke**
+
+启动 CLI，执行：
+
+1. `/context` —— 看新三层显示是否合理
+2. 跑一个会触发压缩的对话（可用 200K 窗口模型把 prompt 灌到 170K+）
+3. 设置 `chatCompression.contextPercentageThreshold = 0.5` 启动 —— 看 stderr 是否打印 deprecation 警告
+4. 用 `--continue` 恢复一个 huge session，首次 send 时压缩是否被首轮估算路径触发
+
+- [ ] **Step 5: PR 描述统一脚本（可选）**
+
+如果 PR 是分批提交的，每个 PR 描述里链接 [docs/design/auto-compaction-threshold-redesign.md](docs/design/auto-compaction-threshold-redesign.md) 并标注 Phase / Task。
diff --git a/docs/users/configuration/settings.md b/docs/users/configuration/settings.md
index 48d26e8ef7..1b23b327d5 100644
--- a/docs/users/configuration/settings.md
+++ b/docs/users/configuration/settings.md
@@ -144,7 +144,7 @@ Settings are organized into categories. Most settings should be placed within th
 | `model.name`                                       | string  | The Qwen model to use for conversations.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | `undefined` |
 | `model.maxSessionTurns`                            | number  | Maximum number of user/model/tool turns to keep in a session. -1 means unlimited.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | `-1`        |
 | `model.generationConfig`                           | object  | Advanced overrides passed to the underlying content generator. Supports request controls such as `timeout`, `maxRetries`, `enableCacheControl`, `splitToolMedia` (set `true` for strict OpenAI-compatible servers like LM Studio that reject non-text content on `role: "tool"` messages — splits media into a follow-up user message), `contextWindowSize` (override model's context window size), `modalities` (override auto-detected input modalities), `customHeaders` (custom HTTP headers for API requests), and `extra_body` (additional body parameters for OpenAI-compatible API requests only), along with fine-tuning knobs under `samplingParams` (for example `temperature`, `top_p`, `max_tokens`). Leave unset to rely on provider defaults. | `undefined` |
-| `model.chatCompression.contextPercentageThreshold` | number  | Sets the threshold for chat history compression as a percentage of the model's total token limit. This is a value between 0 and 1 that applies to both automatic compression and the manual `/compress` command. For example, a value of `0.6` will trigger compression when the chat history exceeds 60% of the token limit. Use `0` to disable compression entirely.                                                                                                                                                                                                                                                                                                                                                                                       | `0.7`       |
+| `model.chatCompression.contextPercentageThreshold` | number  | **REMOVED.** Auto-compaction now uses a three-tier threshold ladder (warn / auto / hard) computed internally from the model's context window via the `computeThresholds()` function — no longer user-configurable. Setting this field in `settings.json` is silently ignored, and a one-line deprecation warning is emitted to stderr at startup. There is currently no replacement for "disable compression entirely" — reactive overflow recovery remains the safety net at the API layer if compression itself fails. (See PR #4345 / `docs/design/auto-compaction-threshold-redesign.md` for the redesign rationale.)                                                                                                                                    | `N/A`       |
 | `model.skipNextSpeakerCheck`                       | boolean | Skip the next speaker check.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | `false`     |
 | `model.skipLoopDetection`                          | boolean | Disables loop detection checks. Loop detection prevents infinite loops in AI responses but can generate false positives that interrupt legitimate workflows. Enable this option if you experience frequent false positive loop detection interruptions.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | `false`     |
 | `model.skipStartupContext`                         | boolean | Skips sending the startup workspace context (environment summary and acknowledgement) at the beginning of each session. Enable this if you prefer to provide context manually or want to save tokens on startup.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | `false`     |
diff --git a/packages/cli/src/services/tips/index.ts b/packages/cli/src/services/tips/index.ts
index aac01be57c..e0429bb264 100644
--- a/packages/cli/src/services/tips/index.ts
+++ b/packages/cli/src/services/tips/index.ts
@@ -10,7 +10,6 @@ export { TipHistory } from './tipHistory.js';
 export { selectTip } from './tipScheduler.js';
 export {
   tipRegistry,
-  getContextUsagePercent,
   type ContextualTip,
   type TipContext,
   type TipTrigger,
diff --git a/packages/cli/src/services/tips/tipRegistry.test.ts b/packages/cli/src/services/tips/tipRegistry.test.ts
new file mode 100644
index 0000000000..8573d2335b
--- /dev/null
+++ b/packages/cli/src/services/tips/tipRegistry.test.ts
@@ -0,0 +1,92 @@
+/**
+ * @license
+ * Copyright 2025 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, it, expect } from 'vitest';
+import { tipRegistry, type TipContext } from './tipRegistry.js';
+
+const baseCtx: TipContext = {
+  lastPromptTokenCount: 0,
+  contextWindowSize: 200_000,
+  sessionPromptCount: 10,
+  sessionCount: 1,
+  platform: 'darwin',
+  thresholds: {
+    warn: 147_000,
+    auto: 167_000,
+    hard: 177_000,
+    effectiveWindow: 180_000,
+  },
+};
+
+function tipById(id: string) {
+  return tipRegistry.find((t) => t.id === id)!;
+}
+
+describe('context-* tip thresholds align with computeThresholds', () => {
+  it('compress-intro fires between warn and auto', () => {
+    const t = tipById('compress-intro');
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 100_000 })).toBe(
+      false,
+    );
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 150_000 })).toBe(
+      true,
+    );
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 168_000 })).toBe(
+      false,
+    );
+  });
+
+  it('context-high fires between auto and hard', () => {
+    const t = tipById('context-high');
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 150_000 })).toBe(
+      false,
+    );
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 170_000 })).toBe(
+      true,
+    );
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 178_000 })).toBe(
+      false,
+    );
+  });
+
+  it('context-critical fires at or above hard', () => {
+    const t = tipById('context-critical');
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 170_000 })).toBe(
+      false,
+    );
+    expect(t.isRelevant({ ...baseCtx, lastPromptTokenCount: 178_000 })).toBe(
+      true,
+    );
+  });
+
+  it('falls back gracefully when thresholds undefined (legacy callers)', () => {
+    const ctx = { ...baseCtx, thresholds: undefined };
+    // All three context-* tips return false when thresholds are missing
+    // (the comparison would be unsafe without them).
+    expect(tipById('compress-intro').isRelevant(ctx)).toBe(false);
+    expect(tipById('context-high').isRelevant(ctx)).toBe(false);
+    expect(tipById('context-critical').isRelevant(ctx)).toBe(false);
+  });
+
+  it('compress-intro additionally gates on sessionPromptCount > 5', () => {
+    const t = tipById('compress-intro');
+    // Above warn, below auto, but session is too new.
+    expect(
+      t.isRelevant({
+        ...baseCtx,
+        lastPromptTokenCount: 150_000,
+        sessionPromptCount: 3,
+      }),
+    ).toBe(false);
+    expect(
+      t.isRelevant({
+        ...baseCtx,
+        lastPromptTokenCount: 150_000,
+        sessionPromptCount: 6,
+      }),
+    ).toBe(true);
+  });
+});
diff --git a/packages/cli/src/services/tips/tipRegistry.ts b/packages/cli/src/services/tips/tipRegistry.ts
index cb655783b2..9870f29c09 100644
--- a/packages/cli/src/services/tips/tipRegistry.ts
+++ b/packages/cli/src/services/tips/tipRegistry.ts
@@ -8,7 +8,7 @@
  * Contextual tip registry — defines tips, their conditions, and display rules.
  */
 
-import { DEFAULT_TOKEN_LIMIT } from '@qwen-code/qwen-code-core';
+import { type CompactionThresholds } from '@qwen-code/qwen-code-core';
 
 export type TipTrigger = 'startup' | 'post-response';
 
@@ -18,6 +18,12 @@ export interface TipContext {
   sessionPromptCount: number;
   sessionCount: number;
   platform: string;
+  /**
+   * Three-tier auto-compaction thresholds, computed by callers via
+   * `computeThresholds(contextWindowSize)`. Optional for backward compat;
+   * context-* tip checks return false when missing.
+   */
+  thresholds?: CompactionThresholds;
 }
 
 export interface ContextualTip {
@@ -29,19 +35,16 @@ export interface ContextualTip {
   priority: number;
 }
 
-export function getContextUsagePercent(ctx: TipContext): number {
-  const windowSize = ctx.contextWindowSize || DEFAULT_TOKEN_LIMIT;
-  return (ctx.lastPromptTokenCount / windowSize) * 100;
-}
-
 export const tipRegistry: ContextualTip[] = [
   // --- Post-response contextual tips (priority: higher = more urgent) ---
   {
     id: 'context-critical',
     content:
-      'Context is almost full! Run /compress now or start /new to continue.',
+      'Context near hard limit — auto-compact will force on next send. Consider /clear if you want to start fresh.',
     trigger: 'post-response',
-    isRelevant: (ctx) => getContextUsagePercent(ctx) >= 95,
+    isRelevant: (ctx) =>
+      ctx.thresholds !== undefined &&
+      ctx.lastPromptTokenCount >= ctx.thresholds.hard,
     cooldownPrompts: 3,
     priority: 100,
   },
@@ -49,10 +52,10 @@ export const tipRegistry: ContextualTip[] = [
     id: 'context-high',
     content: 'Context is getting full. Use /compress to free up space.',
     trigger: 'post-response',
-    isRelevant: (ctx) => {
-      const pct = getContextUsagePercent(ctx);
-      return pct >= 80 && pct < 95;
-    },
+    isRelevant: (ctx) =>
+      ctx.thresholds !== undefined &&
+      ctx.lastPromptTokenCount >= ctx.thresholds.auto &&
+      ctx.lastPromptTokenCount < ctx.thresholds.hard,
     cooldownPrompts: 5,
     priority: 90,
   },
@@ -60,10 +63,11 @@ export const tipRegistry: ContextualTip[] = [
     id: 'compress-intro',
     content: 'Long conversation? /compress summarizes history to free context.',
     trigger: 'post-response',
-    isRelevant: (ctx) => {
-      const pct = getContextUsagePercent(ctx);
-      return pct >= 50 && pct < 80 && ctx.sessionPromptCount > 5;
-    },
+    isRelevant: (ctx) =>
+      ctx.thresholds !== undefined &&
+      ctx.lastPromptTokenCount >= ctx.thresholds.warn &&
+      ctx.lastPromptTokenCount < ctx.thresholds.auto &&
+      ctx.sessionPromptCount > 5,
     cooldownPrompts: 10,
     priority: 50,
   },
diff --git a/packages/cli/src/ui/commands/contextCommand.test.ts b/packages/cli/src/ui/commands/contextCommand.test.ts
index 99d0d74693..a89d1fedd7 100644
--- a/packages/cli/src/ui/commands/contextCommand.test.ts
+++ b/packages/cli/src/ui/commands/contextCommand.test.ts
@@ -6,28 +6,59 @@
 
 import { describe, it, expect, vi, beforeEach } from 'vitest';
 import type { Config } from '@qwen-code/qwen-code-core';
-import { collectContextData } from './contextCommand.js';
+import {
+  collectContextData,
+  formatContextUsageText,
+} from './contextCommand.js';
 
 // uiTelemetryService is consumed inside collectContextData via the
 // re-export from core; mock it here so the function returns deterministic
-// numbers without needing a real session.
+// numbers without needing a real session. The mock fns live inside
+// vi.hoisted so they are available when vi.mock's factory runs (vi.mock
+// is hoisted above module-level const declarations).
+const { mockGetLastPromptTokenCount, mockGetLastCachedContentTokenCount } =
+  vi.hoisted(() => ({
+    mockGetLastPromptTokenCount: vi.fn().mockReturnValue(0),
+    mockGetLastCachedContentTokenCount: vi.fn().mockReturnValue(0),
+  }));
+
 vi.mock('@qwen-code/qwen-code-core', async (importOriginal) => {
   const original =
     await importOriginal<typeof import('@qwen-code/qwen-code-core')>();
   return {
     ...original,
     uiTelemetryService: {
-      getLastPromptTokenCount: vi.fn().mockReturnValue(0),
-      getLastCachedContentTokenCount: vi.fn().mockReturnValue(0),
+      getLastPromptTokenCount: mockGetLastPromptTokenCount,
+      getLastCachedContentTokenCount: mockGetLastCachedContentTokenCount,
     },
   };
 });
 
+function makeMockConfig(contextWindowSize = 32_000): Config {
+  return {
+    getModel: vi.fn().mockReturnValue('test-model'),
+    getContentGeneratorConfig: vi.fn().mockReturnValue({
+      contextWindowSize,
+    }),
+    getToolRegistry: vi.fn().mockReturnValue({
+      getAllTools: vi.fn().mockReturnValue([]),
+      getFunctionDeclarations: vi.fn().mockReturnValue([]),
+    }),
+    getUserMemory: vi.fn().mockReturnValue(''),
+    getSkillManager: vi.fn().mockReturnValue({
+      listSkills: vi.fn().mockResolvedValue([]),
+    }),
+    getChatCompression: vi.fn().mockReturnValue(undefined),
+  } as unknown as Config;
+}
+
 describe('collectContextData (contextCommand)', () => {
   let getFunctionDeclarationsSpy: ReturnType<typeof vi.fn>;
   let mockConfig: Config;
 
   beforeEach(() => {
+    mockGetLastPromptTokenCount.mockReturnValue(0);
+    mockGetLastCachedContentTokenCount.mockReturnValue(0);
     getFunctionDeclarationsSpy = vi.fn().mockReturnValue([]);
     mockConfig = {
       getModel: vi.fn().mockReturnValue('test-model'),
@@ -62,3 +93,76 @@ describe('collectContextData (contextCommand)', () => {
     });
   });
 });
+
+describe('/context shows three-tier thresholds', () => {
+  beforeEach(() => {
+    mockGetLastPromptTokenCount.mockReturnValue(0);
+    mockGetLastCachedContentTokenCount.mockReturnValue(0);
+  });
+
+  it('renders warn/auto/hard with the warn-tier marker when usage sits between warn and auto', async () => {
+    // 200K window. computeThresholds(200K) = {
+    //   warn: 147,000, auto: 167,000, hard: 177,000, effectiveWindow: 180,000
+    // }
+    // lastPromptTokenCount = 150K → between warn and auto → tier = warn.
+    mockGetLastPromptTokenCount.mockReturnValue(150_000);
+    const data = await collectContextData(makeMockConfig(200_000), false);
+    const text = formatContextUsageText(data);
+
+    expect(text).toMatch(/Effective window:\s+180,000/);
+    expect(text).toMatch(/Warn threshold:\s+147,000/);
+    expect(text).toMatch(/Auto threshold:\s+167,000/);
+    expect(text).toMatch(/Hard threshold:\s+177,000/);
+    expect(text).toMatch(/Current tier:\s+warn/);
+    expect(data.breakdown.currentTier).toBe('warn');
+    expect(data.breakdown.thresholds).toEqual({
+      effectiveWindow: 180_000,
+      warn: 147_000,
+      auto: 167_000,
+      hard: 177_000,
+    });
+  });
+
+  it('classifies usage below the warn threshold as the safe tier', async () => {
+    mockGetLastPromptTokenCount.mockReturnValue(50_000);
+    const data = await collectContextData(makeMockConfig(200_000), false);
+    const text = formatContextUsageText(data);
+
+    expect(text).toMatch(/Current tier:\s+safe/);
+    expect(data.breakdown.currentTier).toBe('safe');
+  });
+
+  it('classifies usage at or above the hard threshold as the hard tier', async () => {
+    mockGetLastPromptTokenCount.mockReturnValue(180_000);
+    const data = await collectContextData(makeMockConfig(200_000), false);
+    expect(data.breakdown.currentTier).toBe('hard');
+  });
+
+  it('classifies usage between auto and hard as the auto tier', async () => {
+    // 200K window — between 167K (auto) and 177K (hard) → tier = auto.
+    mockGetLastPromptTokenCount.mockReturnValue(170_000);
+    const data = await collectContextData(makeMockConfig(200_000), false);
+    expect(data.breakdown.currentTier).toBe('auto');
+    const text = formatContextUsageText(data);
+    expect(text).toMatch(/Current tier:\s+auto/);
+  });
+
+  it('treats no-API-data sessions as safe and omits the threshold section from text', async () => {
+    // lastPromptTokenCount = 0 → collectContextData uses the estimated branch
+    // (classifies against `rawOverhead`, not apiTotalTokens). With these
+    // default fixtures rawOverhead lands well below `warn`, so currentTier
+    // resolves to `safe`. On heavy system-prompt / skill / MCP loads the
+    // estimated branch can return warn/auto/hard — this test only covers
+    // the default-fixture safe case. formatContextUsageText must NOT emit
+    // the "Compaction thresholds" section because the estimated path
+    // renders a different layout.
+    mockGetLastPromptTokenCount.mockReturnValue(0);
+    const data = await collectContextData(makeMockConfig(200_000), false);
+    expect(data.breakdown.currentTier).toBe('safe');
+    // Thresholds are still computed and exposed on the breakdown for downstream
+    // consumers, even though the text layout suppresses them.
+    expect(data.breakdown.thresholds.auto).toBe(167_000);
+    const text = formatContextUsageText(data);
+    expect(text).not.toMatch(/Compaction thresholds/);
+  });
+});
diff --git a/packages/cli/src/ui/commands/contextCommand.ts b/packages/cli/src/ui/commands/contextCommand.ts
index a58fc59681..7486230f9e 100644
--- a/packages/cli/src/ui/commands/contextCommand.ts
+++ b/packages/cli/src/ui/commands/contextCommand.ts
@@ -13,6 +13,7 @@ import {
   MessageType,
   type HistoryItemContextUsage,
   type ContextCategoryBreakdown,
+  type ContextTier,
   type ContextToolDetail,
   type ContextMemoryDetail,
   type ContextSkillDetail,
@@ -24,14 +25,26 @@ import {
   DEFAULT_TOKEN_LIMIT,
   ToolNames,
   buildSkillLlmContent,
+  computeThresholds,
+  type CompactionThresholds,
 } from '@qwen-code/qwen-code-core';
 import { t } from '../../i18n/index.js';
 
 /**
- * Default compression token threshold (triggers compression at 70% usage).
- * The autocompact buffer is (1 - threshold) * contextWindowSize.
+ * Classify a token count against the three-tier compaction ladder. Mirrors
+ * the gating logic in `chatCompressionService` / `geminiChat` so the
+ * `/context` output's "current tier" label reflects exactly which tier the
+ * runtime would treat the session as sitting in.
  */
-const DEFAULT_COMPRESSION_THRESHOLD = 0.7;
+function currentTier(
+  tokens: number,
+  thresholds: CompactionThresholds,
+): ContextTier {
+  if (tokens >= thresholds.hard) return 'hard';
+  if (tokens >= thresholds.auto) return 'auto';
+  if (tokens >= thresholds.warn) return 'warn';
+  return 'safe';
+}
 
 /**
  * Estimate token count for a string using a character-based heuristic.
@@ -174,13 +187,16 @@ export async function collectContextData(
 
   const skillsTokens = skillToolDefinitionTokens + loadedBodiesTokens;
 
-  const compressionThreshold =
-    config.getChatCompression()?.contextPercentageThreshold ??
-    DEFAULT_COMPRESSION_THRESHOLD;
-  const autocompactBuffer =
-    compressionThreshold > 0
-      ? Math.round((1 - compressionThreshold) * contextWindowSize)
-      : 0;
+  const thresholds = computeThresholds(contextWindowSize);
+  // Keep the `(window - auto)` buffer for the legacy three-segment progress
+  // bar in ContextUsage.tsx — it visualizes the headroom between the auto
+  // threshold and the window edge, which is exactly `contextWindowSize -
+  // thresholds.auto`. New consumers should read `breakdown.thresholds`
+  // directly.
+  const autocompactBuffer = Math.max(
+    0,
+    Math.round(contextWindowSize - thresholds.auto),
+  );
 
   const rawOverhead =
     systemPromptTokens +
@@ -287,6 +303,26 @@ export async function collectContextData(
         : skills;
   }
 
+  // Tier classification: prefer the API-reported total when available.
+  // When no API call has happened yet (first /context, --continue resume,
+  // sub-agent inheritance), classify against `rawOverhead` so a session
+  // dominated by system prompt / skills / MCP tools doesn't silently show
+  // "safe". (R2.2)
+  //
+  // SCOPE GAP (R5.1): `rawOverhead` excludes `messagesTokens` — the actual
+  // chat history. A `--continue` restore with 100K of historical messages
+  // (but small overhead) will still display "safe" here, even though the
+  // cheap-gate inside chatCompressionService will trigger compression on
+  // the very next send (it uses `estimatePromptTokens(history, ...)` which
+  // walks the real history). This is a UI/runtime divergence — for a
+  // single render — that resolves the moment any send happens.
+  //
+  // TODO: plumb the chat history into collectContextData and use
+  // estimatePromptTokens(history, undefined, 0, imageTokenEstimate) here
+  // for same-source-of-truth as the cheap-gate. Defer because Config
+  // doesn't expose the active chat instance today.
+  const tierTokens = isEstimated ? rawOverhead : apiTotalTokens;
+
   const breakdown: ContextCategoryBreakdown = {
     systemPrompt: displaySystemPrompt,
     builtinTools: displayBuiltinTools,
@@ -296,6 +332,8 @@ export async function collectContextData(
     messages: messagesTokens,
     freeSpace,
     autocompactBuffer,
+    thresholds,
+    currentTier: currentTier(tierTokens, thresholds),
   };
 
   return {
@@ -340,6 +378,11 @@ function fmtCategoryRow(
   return `${leftPart}${' '.repeat(dots)}${right}`;
 }
 
+/** Locale-grouped integer (e.g. 147000 -> "147,000"). */
+function formatNum(n: number): string {
+  return Math.round(n).toLocaleString('en-US');
+}
+
 /**
  * Convert a HistoryItemContextUsage to a human-readable text string,
  * mirroring the layout of the interactive ContextUsage component.
@@ -377,13 +420,15 @@ export function formatContextUsageText(data: HistoryItemContextUsage): string {
     lines.push('');
     lines.push(fmtCategoryRow('Used', totalTokens, contextWindowSize));
     lines.push(fmtCategoryRow('Free', breakdown.freeSpace, contextWindowSize));
+    lines.push('');
+    lines.push('**Compaction thresholds**');
     lines.push(
-      fmtCategoryRow(
-        'Autocompact buffer',
-        breakdown.autocompactBuffer,
-        contextWindowSize,
-      ),
+      `  Effective window:   ${formatNum(breakdown.thresholds.effectiveWindow)}  (window − ${formatNum(contextWindowSize - breakdown.thresholds.effectiveWindow)} reserve)`,
     );
+    lines.push(`  Warn threshold:     ${formatNum(breakdown.thresholds.warn)}`);
+    lines.push(`  Auto threshold:     ${formatNum(breakdown.thresholds.auto)}`);
+    lines.push(`  Hard threshold:     ${formatNum(breakdown.thresholds.hard)}`);
+    lines.push(`  Current tier:       ${breakdown.currentTier}`);
     lines.push('');
     lines.push('**Usage by category**');
   }
diff --git a/packages/cli/src/ui/components/Tips.test.ts b/packages/cli/src/ui/components/Tips.test.ts
index 9a93d7d2f0..418b6ab901 100644
--- a/packages/cli/src/ui/components/Tips.test.ts
+++ b/packages/cli/src/ui/components/Tips.test.ts
@@ -40,6 +40,14 @@ function createContext(overrides: Partial<TipContext> = {}): TipContext {
     sessionPromptCount: 0,
     sessionCount: 1,
     platform: 'linux',
+    // Matches computeThresholds(1_000_000) — kept inline so this test stays
+    // hermetic to the registry's tier logic rather than re-deriving constants.
+    thresholds: {
+      warn: 947_000,
+      auto: 967_000,
+      hard: 977_000,
+      effectiveWindow: 980_000,
+    },
     ...overrides,
   };
 }
@@ -59,7 +67,8 @@ describe('selectTip', () => {
 
   it('returns context-high tip when context usage is high', () => {
     const ctx = createContext({
-      lastPromptTokenCount: 850_000,
+      // Between auto (967K) and hard (977K) — context-high band.
+      lastPromptTokenCount: 970_000,
       contextWindowSize: 1_000_000,
       sessionPromptCount: 10,
     });
@@ -71,7 +80,8 @@ describe('selectTip', () => {
 
   it('returns context-critical tip when context usage is critical', () => {
     const ctx = createContext({
-      lastPromptTokenCount: 960_000,
+      // At/above hard (977K) — context-critical band.
+      lastPromptTokenCount: 980_000,
       contextWindowSize: 1_000_000,
       sessionPromptCount: 10,
     });
@@ -83,7 +93,8 @@ describe('selectTip', () => {
 
   it('returns compress-intro tip when context is moderate and session is long', () => {
     const ctx = createContext({
-      lastPromptTokenCount: 550_000,
+      // Between warn (947K) and auto (967K) — compress-intro band.
+      lastPromptTokenCount: 955_000,
       contextWindowSize: 1_000_000,
       sessionPromptCount: 10,
     });
@@ -106,7 +117,7 @@ describe('selectTip', () => {
 
   it('respects cooldown — does not re-show same tip within cooldown period', () => {
     const ctx = createContext({
-      lastPromptTokenCount: 850_000,
+      lastPromptTokenCount: 970_000,
       contextWindowSize: 1_000_000,
       sessionPromptCount: 10,
     });
diff --git a/packages/cli/src/ui/components/views/ContextUsage.test.tsx b/packages/cli/src/ui/components/views/ContextUsage.test.tsx
new file mode 100644
index 0000000000..6a40e17566
--- /dev/null
+++ b/packages/cli/src/ui/components/views/ContextUsage.test.tsx
@@ -0,0 +1,135 @@
+/**
+ * @license
+ * Copyright 2025 Qwen
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, it, expect, afterEach } from 'vitest';
+import { render, cleanup } from 'ink-testing-library';
+import { ContextUsage } from './ContextUsage.js';
+import type {
+  ContextCategoryBreakdown,
+  ContextThresholds,
+  ContextTier,
+} from '../../types.js';
+
+afterEach(() => {
+  cleanup();
+});
+
+const thresholds: ContextThresholds = {
+  effectiveWindow: 108_000,
+  warn: 76_800,
+  auto: 95_000,
+  hard: 105_000,
+};
+
+function makeBreakdown(
+  currentTier: ContextTier,
+  overrides: Partial<ContextCategoryBreakdown> = {},
+): ContextCategoryBreakdown {
+  return {
+    systemPrompt: 5000,
+    builtinTools: 8000,
+    mcpTools: 0,
+    memoryFiles: 200,
+    skills: 1000,
+    messages: 0,
+    freeSpace: 80_000,
+    autocompactBuffer: 33_000,
+    thresholds,
+    currentTier,
+    ...overrides,
+  };
+}
+
+describe('ContextUsage — CompactionThresholds section (review #4168 R1.6)', () => {
+  it('renders the new three-tier section with all four threshold rows', () => {
+    const { lastFrame } = render(
+      <ContextUsage
+        modelName="qwen3-coder"
+        totalTokens={0}
+        contextWindowSize={128_000}
+        breakdown={makeBreakdown('safe')}
+        builtinTools={[]}
+        mcpTools={[]}
+        memoryFiles={[]}
+        skills={[]}
+        isEstimated={true}
+      />,
+    );
+    const frame = lastFrame() ?? '';
+    expect(frame).toContain('Compaction thresholds');
+    expect(frame).toContain('Effective window');
+    expect(frame).toContain('Warn threshold');
+    expect(frame).toContain('Auto threshold');
+    expect(frame).toContain('Hard threshold');
+    expect(frame).toContain('Current tier');
+  });
+
+  it('shows safe tier without any ▶ marker', () => {
+    const { lastFrame } = render(
+      <ContextUsage
+        modelName="qwen3-coder"
+        totalTokens={50_000}
+        contextWindowSize={128_000}
+        breakdown={makeBreakdown('safe')}
+        builtinTools={[]}
+        mcpTools={[]}
+        memoryFiles={[]}
+        skills={[]}
+      />,
+    );
+    const frame = lastFrame() ?? '';
+    // safe tier → no ▶ marker on any threshold row
+    expect(frame).not.toContain('▶');
+    // The literal word "safe" appears as the Current tier value
+    expect(frame).toMatch(/Current tier[\s\S]*safe/);
+  });
+
+  it('places ▶ on the warn row when currentTier === warn', () => {
+    const { lastFrame } = render(
+      <ContextUsage
+        modelName="qwen3-coder"
+        totalTokens={80_000}
+        contextWindowSize={128_000}
+        breakdown={makeBreakdown('warn')}
+        builtinTools={[]}
+        mcpTools={[]}
+        memoryFiles={[]}
+        skills={[]}
+      />,
+    );
+    const frame = lastFrame() ?? '';
+    expect(frame).toContain('▶');
+    // The ▶ should appear on the Warn-threshold line and nowhere else.
+    const lines = frame.split('\n');
+    const warnLine = lines.find((l) => l.includes('Warn threshold')) ?? '';
+    expect(warnLine).toContain('▶');
+    const autoLine = lines.find((l) => l.includes('Auto threshold')) ?? '';
+    expect(autoLine).not.toContain('▶');
+    const hardLine = lines.find((l) => l.includes('Hard threshold')) ?? '';
+    expect(hardLine).not.toContain('▶');
+  });
+
+  it('places ▶ on the hard row when currentTier === hard', () => {
+    const { lastFrame } = render(
+      <ContextUsage
+        modelName="qwen3-coder"
+        totalTokens={106_000}
+        contextWindowSize={128_000}
+        breakdown={makeBreakdown('hard')}
+        builtinTools={[]}
+        mcpTools={[]}
+        memoryFiles={[]}
+        skills={[]}
+      />,
+    );
+    const frame = lastFrame() ?? '';
+    const lines = frame.split('\n');
+    const hardLine = lines.find((l) => l.includes('Hard threshold')) ?? '';
+    expect(hardLine).toContain('▶');
+    // Current tier reads `hard`
+    expect(frame).toMatch(/Current tier[\s\S]*hard/);
+  });
+});
diff --git a/packages/cli/src/ui/components/views/ContextUsage.tsx b/packages/cli/src/ui/components/views/ContextUsage.tsx
index fefe909564..53ee3333a1 100644
--- a/packages/cli/src/ui/components/views/ContextUsage.tsx
+++ b/packages/cli/src/ui/components/views/ContextUsage.tsx
@@ -9,9 +9,11 @@ import { Box, Text } from 'ink';
 import { theme } from '../../semantic-colors.js';
 import type {
   ContextCategoryBreakdown,
-  ContextToolDetail,
   ContextMemoryDetail,
   ContextSkillDetail,
+  ContextThresholds,
+  ContextTier,
+  ContextToolDetail,
 } from '../../types.js';
 import { t } from '../../../i18n/index.js';
 
@@ -140,6 +142,106 @@ const CategoryRow: React.FC<{
   );
 };
 
+/**
+ * A row inside the "Compaction thresholds" section: label + token count, with
+ * a left-edge marker when the current usage has crossed this tier.
+ */
+const ThresholdRow: React.FC<{
+  label: string;
+  tokens: number;
+  isCurrent?: boolean;
+  hint?: string;
+}> = ({ label, tokens, isCurrent, hint }) => {
+  const tokenStr = `${formatTokens(tokens)} ${t('tokens')}`;
+  return (
+    <Box width={CONTENT_WIDTH}>
+      <Box width={2}>
+        <Text color={isCurrent ? theme.status.warning : theme.text.secondary}>
+          {isCurrent ? '▶' : ' '}
+        </Text>
+      </Box>
+      <Box width={22}>
+        <Text color={theme.text.primary}>{label}</Text>
+      </Box>
+      <Box flexGrow={1} justifyContent="flex-end">
+        <Text color={theme.text.secondary}>
+          {tokenStr}
+          {hint ? `  ${hint}` : ''}
+        </Text>
+      </Box>
+    </Box>
+  );
+};
+
+/**
+ * Color associated with each compaction tier — green for safe, escalating to
+ * red for hard. Keep these aligned with how `theme.status.*` is used elsewhere
+ * so the tier badge feels native to the existing design.
+ */
+function tierColor(tier: ContextTier): string {
+  switch (tier) {
+    case 'safe':
+      return theme.status.success;
+    case 'warn':
+      return theme.status.warning;
+    case 'auto':
+      return theme.status.warning;
+    case 'hard':
+      return theme.status.error;
+    default:
+      return theme.text.secondary;
+  }
+}
+
+/**
+ * Renders the three-tier compaction threshold ladder (warn / auto / hard) with
+ * the effective window and a current-tier marker. Source of the data is
+ * `breakdown.thresholds` + `breakdown.currentTier`, which the context command
+ * derives from `computeThresholds()` in core.
+ */
+const CompactionThresholds: React.FC<{
+  thresholds: ContextThresholds;
+  currentTier: ContextTier;
+}> = ({ thresholds, currentTier }) => (
+  <Box flexDirection="column" marginTop={1}>
+    <Text bold color={theme.text.primary}>
+      {t('Compaction thresholds')}
+    </Text>
+    <ThresholdRow
+      label={t('Effective window')}
+      tokens={thresholds.effectiveWindow}
+    />
+    <ThresholdRow
+      label={t('Warn threshold')}
+      tokens={thresholds.warn}
+      isCurrent={currentTier === 'warn'}
+    />
+    <ThresholdRow
+      label={t('Auto threshold')}
+      tokens={thresholds.auto}
+      isCurrent={currentTier === 'auto'}
+    />
+    <ThresholdRow
+      label={t('Hard threshold')}
+      tokens={thresholds.hard}
+      isCurrent={currentTier === 'hard'}
+    />
+    <Box width={CONTENT_WIDTH}>
+      <Box width={2}>
+        <Text> </Text>
+      </Box>
+      <Box width={22}>
+        <Text color={theme.text.primary}>{t('Current tier')}</Text>
+      </Box>
+      <Box flexGrow={1} justifyContent="flex-end">
+        <Text bold color={tierColor(currentTier)}>
+          {currentTier}
+        </Text>
+      </Box>
+    </Box>
+  </Box>
+);
+
 /**
  * A detail row for individual items (MCP tools, memory files, skills).
  */
@@ -348,6 +450,15 @@ export const ContextUsage: React.FC<ContextUsageProps> = ({
         />
       )}
 
+      {/* Three-tier compaction thresholds — visible even when isEstimated so
+          the user can see the auto-compact landscape before any API call. */}
+      {breakdown.thresholds && breakdown.currentTier && (
+        <CompactionThresholds
+          thresholds={breakdown.thresholds}
+          currentTier={breakdown.currentTier}
+        />
+      )}
+
       {showDetails ? (
         <>
           {/* Built-in tools detail */}
diff --git a/packages/cli/src/ui/hooks/useContextualTips.ts b/packages/cli/src/ui/hooks/useContextualTips.ts
index ecdd706ea2..743d6f4945 100644
--- a/packages/cli/src/ui/hooks/useContextualTips.ts
+++ b/packages/cli/src/ui/hooks/useContextualTips.ts
@@ -10,7 +10,11 @@
  */
 
 import { useEffect, useRef } from 'react';
-import { type Config, DEFAULT_TOKEN_LIMIT } from '@qwen-code/qwen-code-core';
+import {
+  type Config,
+  DEFAULT_TOKEN_LIMIT,
+  computeThresholds,
+} from '@qwen-code/qwen-code-core';
 import {
   StreamingState,
   MessageType,
@@ -81,6 +85,7 @@ export function useContextualTips({
       sessionPromptCount,
       sessionCount: tipHistory.sessionCount,
       platform: process.platform,
+      thresholds: computeThresholds(contextWindowSize),
     };
 
     const tip = selectTip('post-response', tipContext, tipRegistry, tipHistory);
diff --git a/packages/cli/src/ui/types.ts b/packages/cli/src/ui/types.ts
index a39771cb2e..d6433524f8 100644
--- a/packages/cli/src/ui/types.ts
+++ b/packages/cli/src/ui/types.ts
@@ -5,6 +5,7 @@
  */
 
 import type {
+  CompactionThresholds,
   CompressionStatus,
   MCPServerConfig,
   ThoughtSummary,
@@ -342,6 +343,17 @@ export type HistoryItemMcpStatus = HistoryItemBase & {
 
 // --- Context Usage types ---
 
+export type ContextTier = 'safe' | 'warn' | 'auto' | 'hard';
+
+/**
+ * Alias for the core compaction-thresholds shape. Re-exported under the
+ * CLI-friendly name so consumers in this package don't pull on the core
+ * module path; structurally identical to `CompactionThresholds`. The
+ * `readonly` modifiers on the core type are immaterial for UI rendering,
+ * but kept implicitly through the alias.
+ */
+export type ContextThresholds = CompactionThresholds;
+
 export interface ContextCategoryBreakdown {
   systemPrompt: number;
   builtinTools: number;
@@ -350,7 +362,20 @@ export interface ContextCategoryBreakdown {
   skills: number;
   messages: number;
   freeSpace: number;
+  /**
+   * Distance from the auto-compaction threshold to the window edge.
+   * Derived from `thresholds.auto` (= `contextWindowSize - auto`); retained
+   * so the legacy three-segment progress bar in `ContextUsage.tsx` keeps
+   * working without a separate code path.
+   */
   autocompactBuffer: number;
+  /** Three-tier ladder used by auto-compaction (warn / auto / hard) plus the effective window. */
+  thresholds: ContextThresholds;
+  /**
+   * Which tier the current usage sits in. `safe` is below `warn`; `warn` /
+   * `auto` / `hard` mean `totalTokens` has crossed the corresponding tier.
+   */
+  currentTier: ContextTier;
 }
 
 export interface ContextToolDetail {
diff --git a/packages/core/src/config/config.test.ts b/packages/core/src/config/config.test.ts
index a9badd5878..06cf11a4d7 100644
--- a/packages/core/src/config/config.test.ts
+++ b/packages/core/src/config/config.test.ts
@@ -6,7 +6,11 @@
 
 import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
 import type { Mock } from 'vitest';
-import type { ConfigParameters, SandboxConfig } from './config.js';
+import type {
+  ChatCompressionSettings,
+  ConfigParameters,
+  SandboxConfig,
+} from './config.js';
 import {
   Config,
   ApprovalMode,
@@ -3333,4 +3337,55 @@ describe('Model Switching and Config Updates', () => {
       );
     });
   });
+
+  describe('chatCompression.contextPercentageThreshold deprecation', () => {
+    // The proportional-threshold knob `contextPercentageThreshold` was
+    // removed in the auto-compaction threshold redesign (Task 8) — the
+    // value is now derived from `computeThresholds(...)` in the
+    // ChatCompressionService and is no longer user-tunable. Existing
+    // settings.json files that still set the field should keep working
+    // but get a one-time stderr warning so users know to remove it.
+    let warnSpy: ReturnType<typeof vi.spyOn>;
+
+    beforeEach(() => {
+      warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});
+    });
+
+    afterEach(() => {
+      warnSpy.mockRestore();
+    });
+
+    it('logs a stderr warning when the deprecated field is set', () => {
+      new Config({
+        ...baseParams,
+        chatCompression: {
+          contextPercentageThreshold: 0.5,
+        } as ChatCompressionSettings,
+      });
+      expect(warnSpy).toHaveBeenCalledWith(
+        expect.stringContaining(
+          'chatCompression.contextPercentageThreshold has been removed',
+        ),
+      );
+    });
+
+    it('does not warn when chatCompression is absent', () => {
+      new Config({ ...baseParams });
+      const warnCalls = warnSpy.mock.calls.map((c) => String(c[0]));
+      expect(
+        warnCalls.some((m) => m.includes('contextPercentageThreshold')),
+      ).toBe(false);
+    });
+
+    it('does not warn when chatCompression is set without the deprecated field', () => {
+      new Config({
+        ...baseParams,
+        chatCompression: { imageTokenEstimate: 1600 },
+      });
+      const warnCalls = warnSpy.mock.calls.map((c) => String(c[0]));
+      expect(
+        warnCalls.some((m) => m.includes('contextPercentageThreshold')),
+      ).toBe(false);
+    });
+  });
 });
diff --git a/packages/core/src/config/config.ts b/packages/core/src/config/config.ts
index 03ec0cd0e0..b73b8f8917 100644
--- a/packages/core/src/config/config.ts
+++ b/packages/core/src/config/config.ts
@@ -268,7 +268,6 @@ export interface BugCommandSettings {
 }
 
 export interface ChatCompressionSettings {
-  contextPercentageThreshold?: number;
   /**
    * Estimated tokens for a single inline image / document part when
    * apportioning chars across history in `findCompressSplitPoint`.
@@ -1038,6 +1037,24 @@ export class Config {
     this.loadMemoryFromIncludeDirectories =
       params.loadMemoryFromIncludeDirectories ?? false;
     this.importFormat = params.importFormat ?? 'tree';
+    // Auto-compaction threshold moved to built-in constants (computeThresholds
+    // in chatCompressionService.ts). The old `contextPercentageThreshold`
+    // field is deprecated; if present in user settings, emit a one-time
+    // warning and ignore the value.
+    if (
+      params.chatCompression &&
+      typeof (params.chatCompression as Record<string, unknown>)[
+        'contextPercentageThreshold'
+      ] !== 'undefined'
+    ) {
+      // eslint-disable-next-line no-console
+      console.warn(
+        '[qwen-code] chatCompression.contextPercentageThreshold has been removed ' +
+          'and is now controlled by built-in thresholds. Setting will be ignored. ' +
+          'Remove this key from your settings.json to silence this warning; ' +
+          'see docs/users/configuration/settings.md for current compaction behavior.',
+      );
+    }
     this.chatCompression = params.chatCompression;
     this.interactive = params.interactive ?? false;
     this.trustedFolder = params.trustedFolder;
diff --git a/packages/core/src/core/client.test.ts b/packages/core/src/core/client.test.ts
index 95b2e7c254..0d3a378a4b 100644
--- a/packages/core/src/core/client.test.ts
+++ b/packages/core/src/core/client.test.ts
@@ -1909,7 +1909,7 @@ describe('Gemini Client (client.ts)', () => {
   // tryCompressChat is now a thin wrapper around GeminiChat.tryCompress.
   // The compression logic itself is exercised in chatCompressionService.test.ts
   // (token math, threshold checks, hook firing) and geminiChat.test.ts (history
-  // mutation, recording, hasFailedCompressionAttempt). The tests below cover
+  // mutation, recording, consecutiveFailures circuit breaker). The tests below cover
   // only what the wrapper itself adds: argument forwarding and the IDE-context
   // flag flip.
   describe('tryCompressChat (delegation)', () => {
diff --git a/packages/core/src/core/client.ts b/packages/core/src/core/client.ts
index 14cc4421f3..07302a8569 100644
--- a/packages/core/src/core/client.ts
+++ b/packages/core/src/core/client.ts
@@ -47,10 +47,7 @@ import {
 } from './turn.js';
 
 // Services
-import {
-  COMPRESSION_PRESERVE_THRESHOLD,
-  COMPRESSION_TOKEN_THRESHOLD,
-} from '../services/chatCompressionService.js';
+import { COMPRESSION_PRESERVE_THRESHOLD } from '../services/chatCompressionService.js';
 import { LoopDetectionService } from '../services/loopDetectionService.js';
 import { CommitAttributionService } from '../services/commitAttribution.js';
 
@@ -2095,5 +2092,4 @@ export class GeminiClient {
 
 export const TEST_ONLY = {
   COMPRESSION_PRESERVE_THRESHOLD,
-  COMPRESSION_TOKEN_THRESHOLD,
 };
diff --git a/packages/core/src/core/geminiChat.test.ts b/packages/core/src/core/geminiChat.test.ts
index 5f3caa976b..c77880a517 100644
--- a/packages/core/src/core/geminiChat.test.ts
+++ b/packages/core/src/core/geminiChat.test.ts
@@ -24,7 +24,10 @@ import type { Config } from '../config/config.js';
 import { setSimulate429 } from '../utils/testUtils.js';
 import { uiTelemetryService } from '../telemetry/uiTelemetry.js';
 import { CompressionStatus, type ChatCompressionInfo } from './turn.js';
-import { ChatCompressionService } from '../services/chatCompressionService.js';
+import {
+  ChatCompressionService,
+  MAX_CONSECUTIVE_FAILURES,
+} from '../services/chatCompressionService.js';
 import { SessionStartSource } from '../hooks/types.js';
 
 // Mock fs module to prevent actual file system operations during tests
@@ -75,6 +78,10 @@ const { mockLogContentRetry, mockLogContentRetryFailure } = vi.hoisted(() => ({
 vi.mock('../telemetry/loggers.js', () => ({
   logContentRetry: mockLogContentRetry,
   logContentRetryFailure: mockLogContentRetryFailure,
+  // Real ChatCompressionService.compress() calls logChatCompression on
+  // every attempt; the R3.4 integration test exercises that path, so the
+  // mock has to expose it (no-op).
+  logChatCompression: vi.fn(),
 }));
 
 vi.mock('../telemetry/uiTelemetry.js', () => ({
@@ -123,6 +130,7 @@ describe('GeminiChat', async () => {
         getTool: vi.fn(),
       }),
       getContentGenerator: vi.fn().mockReturnValue(mockContentGenerator),
+      getBaseLlmClient: vi.fn().mockReturnValue(undefined),
       getChatCompression: vi.fn().mockReturnValue(undefined),
       getHookSystem: vi.fn().mockReturnValue(undefined),
       getDebugLogger: vi
@@ -1262,6 +1270,10 @@ describe('GeminiChat', async () => {
             compressionStatus: CompressionStatus.NOOP,
           },
         });
+      // The hard-tier rescue calls getHistoryShallow(true) (when
+      // lastPromptTokenCount=0) for its estimator; the post-compression
+      // history-load is getRequestHistory(). The "after compression" failure
+      // scenario this test targets is the latter — mock that call to throw.
       vi.spyOn(
         chat as unknown as { getRequestHistory: () => Content[] },
         'getRequestHistory',
@@ -1378,13 +1390,127 @@ describe('GeminiChat', async () => {
       ).toBe(200);
     });
 
-    it('clears hasFailedCompressionAttempt after a forced successful compression', async () => {
+    it('forwards the pending user message to the compression cheap-gate', async () => {
+      // The cheap-gate inside ChatCompressionService.compress uses
+      // estimatePromptTokens(history, pendingUserMessage, lastPromptTokenCount)
+      // so the very first send after inherited history (where
+      // lastPromptTokenCount === 0) can still trigger compaction. This test
+      // pins the wiring: sendMessageStream MUST pass the user message it just
+      // built through to tryCompress -> service.compress.
+      expect(chat.getLastPromptTokenCount()).toBe(0);
+
+      const compressedHistory: Content[] = [
+        { role: 'user', parts: [{ text: 'summary' }] },
+        { role: 'model', parts: [{ text: 'ack' }] },
+      ];
+      const compressSpy = vi
+        .spyOn(ChatCompressionService.prototype, 'compress')
+        .mockResolvedValueOnce({
+          newHistory: compressedHistory,
+          info: {
+            originalTokenCount: 150_000,
+            newTokenCount: 40_000,
+            compressionStatus: CompressionStatus.COMPRESSED,
+          },
+        });
+      vi.mocked(mockContentGenerator.generateContentStream).mockResolvedValue(
+        makeStreamResponse('answer'),
+      );
+
+      const userMessageText = 'next user prompt';
+      const stream = await chat.sendMessageStream(
+        'test-model',
+        { message: userMessageText },
+        'prompt-id-first-turn',
+      );
+      // The first event in the stream should be COMPRESSED because the
+      // cheap-gate, fed the pending user message, can now size the prompt.
+      const first = await stream.next();
+      expect(first.done).toBe(false);
+      expect(first.value?.type).toBe(StreamEventType.COMPRESSED);
+
+      // Drain the rest so the send-lock releases cleanly.
+      for await (const _ of stream) {
+        /* consume */
+      }
+
+      expect(compressSpy).toHaveBeenCalledTimes(1);
+      const passedOpts = compressSpy.mock.calls[0][1];
+      expect(passedOpts.pendingUserMessage).toBeDefined();
+      expect(passedOpts.pendingUserMessage?.role).toBe('user');
+      expect(
+        passedOpts.pendingUserMessage?.parts?.some(
+          (part) => part.text === userMessageText,
+        ),
+      ).toBe(true);
+    });
+
+    it('triggers compaction end-to-end through the real ChatCompressionService when lastPromptTokenCount === 0 and inherited history is large (R3.4)', async () => {
+      // Reviewer R3.4: the "forwards the pending user message" test above
+      // mocks the service entirely, so the real cheap-gate (the actual
+      // estimatePromptTokens fallback branch when lastPromptTokenCount===0)
+      // never runs. Exercise the full chain here:
+      //   sendMessageStream → tryCompress → service.compress (REAL) →
+      //   cheap-gate (real estimate via getHistory + userMessage) →
+      //   splitter (real) → runSideQuery (mocked at baseLlmClient) →
+      //   persistence.
+      const largeChars = 'x'.repeat(400_000); // ~100K estimated tokens
+      const inheritedHistory: Content[] = [
+        { role: 'user', parts: [{ text: largeChars }] },
+        { role: 'model', parts: [{ text: 'ack' }] },
+        { role: 'user', parts: [{ text: 'follow up' }] },
+        { role: 'model', parts: [{ text: 'response' }] },
+      ];
+      chat.setHistory(inheritedHistory);
+      expect(chat.getLastPromptTokenCount()).toBe(0);
+
+      // Default DEFAULT_TOKEN_LIMIT = 128K → auto ≈ 95K. 100K estimate
+      // crosses, so cheap-gate must let compaction proceed.
+      const generateText = vi.fn().mockResolvedValue({
+        text: '<state_snapshot>compressed</state_snapshot>',
+        usage: {
+          promptTokenCount: 99_000,
+          candidatesTokenCount: 1500,
+          totalTokenCount: 100_500,
+        },
+      });
+      vi.mocked(mockConfig.getBaseLlmClient).mockReturnValue({
+        generateText,
+      } as unknown as ReturnType<typeof mockConfig.getBaseLlmClient>);
+      vi.mocked(mockContentGenerator.generateContentStream).mockResolvedValue(
+        makeStreamResponse('done'),
+      );
+
+      const stream = await chat.sendMessageStream(
+        'test-model',
+        { message: 'follow-up after restore' },
+        'prompt-r3-4',
+      );
+      const events: StreamEvent[] = [];
+      for await (const event of stream) {
+        events.push(event);
+      }
+
+      const compressed = events.find(
+        (e) => e.type === StreamEventType.COMPRESSED,
+      );
+      expect(compressed).toBeDefined();
+      expect(
+        (compressed as { type: StreamEventType; info: ChatCompressionInfo })
+          .info.compressionStatus,
+      ).toBe(CompressionStatus.COMPRESSED);
+      // Real runSideQuery was hit (proves the cheap-gate didn't short-circuit
+      // and the splitter produced a non-empty historyToCompress).
+      expect(generateText).toHaveBeenCalled();
+    });
+
+    it('clears consecutiveFailures after a forced successful compression', async () => {
       const compressSpy = vi.spyOn(
         ChatCompressionService.prototype,
         'compress',
       );
 
-      // Step 1: auto-compression fails — latch is set on the chat.
+      // Step 1: auto-compression fails — counter increments on the chat.
       compressSpy.mockResolvedValueOnce({
         newHistory: null,
         info: {
@@ -1405,14 +1531,12 @@ describe('GeminiChat', async () => {
       for await (const _ of stream1) {
         /* consume */
       }
-      // Latch passed to service was false on this attempt; service marks it
-      // failed and tryCompress flips the chat's flag to true.
-      expect(compressSpy.mock.calls[0][1].hasFailedCompressionAttempt).toBe(
-        false,
-      );
+      // Counter passed to service was 0 on this attempt; the failure branch
+      // in tryCompress then increments it to 1.
+      expect(compressSpy.mock.calls[0][1].consecutiveFailures).toBe(0);
 
-      // Step 2: a forced /compress succeeds. After this, the latch must
-      // be cleared so future auto-compressions are not suppressed.
+      // Step 2: a forced /compress succeeds. After this, the counter must
+      // be reset so future auto-compressions are not suppressed.
       compressSpy.mockResolvedValueOnce({
         newHistory: [
           { role: 'user', parts: [{ text: 'summary' }] },
@@ -1425,13 +1549,12 @@ describe('GeminiChat', async () => {
         },
       });
       await chat.tryCompress('prompt-latch-force', 'test-model', true);
-      // tryCompress was called with force=true, so the service got latch=true
-      // (the gate is `hasFailedCompressionAttempt && !force`, force overrides).
-      expect(compressSpy.mock.calls[1][1].hasFailedCompressionAttempt).toBe(
-        true,
-      );
+      // tryCompress was called with force=true, so the service got
+      // consecutiveFailures=1 (carried from step 1's increment); force
+      // bypasses the breaker, but the counter was still forwarded as-is.
+      expect(compressSpy.mock.calls[1][1].consecutiveFailures).toBe(1);
 
-      // Step 3: next auto-compression sees the cleared latch.
+      // Step 3: next auto-compression sees the reset counter.
       compressSpy.mockResolvedValueOnce({
         newHistory: null,
         info: {
@@ -1451,9 +1574,7 @@ describe('GeminiChat', async () => {
       for await (const _ of stream2) {
         /* consume */
       }
-      expect(compressSpy.mock.calls[2][1].hasFailedCompressionAttempt).toBe(
-        false,
-      );
+      expect(compressSpy.mock.calls[2][1].consecutiveFailures).toBe(0);
     });
 
     it('reactively compresses and retries once after a context overflow error', async () => {
@@ -1829,9 +1950,12 @@ describe('GeminiChat', async () => {
       }
 
       expect(compressSpy).toHaveBeenCalledTimes(3);
-      expect(compressSpy.mock.calls[2][1].hasFailedCompressionAttempt).toBe(
-        true,
-      );
+      // Reactive compression is force=true, so tryCompress's own failure
+      // branch doesn't increment the counter (force=true skips it). The
+      // reactive overflow handler bumps the counter by 1 so a transient
+      // network error doesn't permanently latch the breaker; only
+      // MAX_CONSECUTIVE_FAILURES repeated reactive failures will. (R1.2)
+      expect(compressSpy.mock.calls[2][1].consecutiveFailures).toBe(1);
     });
 
     it('releases the send-lock when reactive compression throws', async () => {
@@ -1896,6 +2020,238 @@ describe('GeminiChat', async () => {
     });
   });
 
+  // Task 9 (P3): the hard-tier rescue pulls reactive overflow recovery
+  // forward to BEFORE the API call. When the estimated prompt size already
+  // crosses `computeThresholds(window).hard`, sendMessageStream must:
+  //   1) reset consecutiveFailures (so a latched circuit breaker can recover)
+  //   2) call tryCompress with force=true (so MAX_CONSECUTIVE_FAILURES does
+  //      not gate the only attempt that can save the next round-trip).
+  describe('sendMessageStream hard-tier rescue', () => {
+    function makeStreamResponse(text = 'ok') {
+      return (async function* () {
+        yield {
+          candidates: [
+            {
+              content: { parts: [{ text }], role: 'model' },
+              finishReason: 'STOP',
+              index: 0,
+              safetyRatings: [],
+            },
+          ],
+          text: () => text,
+        } as unknown as GenerateContentResponse;
+      })();
+    }
+
+    /**
+     * Default 200K window in our mocks; computeThresholds:
+     *   effectiveWindow = 200K - 20K (SUMMARY_RESERVE) = 180K
+     *   hard            = max(180K - 3K, auto) = 177K
+     * So lastPromptTokenCount=176K + a small user message tips over 177K.
+     */
+    beforeEach(() => {
+      vi.mocked(mockConfig.getContentGeneratorConfig).mockReturnValue({
+        authType: AuthType.USE_GEMINI,
+        model: 'test-model',
+        contextWindowSize: 200_000,
+      });
+    });
+
+    it('forces compaction with force=true when estimated tokens cross hard threshold', async () => {
+      const compressedHistory: Content[] = [
+        { role: 'user', parts: [{ text: 'summary' }] },
+        { role: 'model', parts: [{ text: 'ack' }] },
+      ];
+      const compressSpy = vi
+        .spyOn(ChatCompressionService.prototype, 'compress')
+        .mockResolvedValueOnce({
+          newHistory: compressedHistory,
+          info: {
+            originalTokenCount: 176_000,
+            newTokenCount: 40_000,
+            compressionStatus: CompressionStatus.COMPRESSED,
+          },
+        });
+      vi.mocked(mockContentGenerator.generateContentStream).mockResolvedValue(
+        makeStreamResponse('after rescue'),
+      );
+
+      // Seed lastPromptTokenCount JUST under the 177K hard threshold; the
+      // pending user message adds a handful of estimate-tokens that pushes
+      // effective >= 177K, so the rescue must trigger.
+      chat.setLastPromptTokenCount(176_999);
+
+      const userMessage = 'this is the next user message';
+      const stream = await chat.sendMessageStream(
+        'test-model',
+        { message: userMessage },
+        'prompt-id-hard-rescue-forces',
+      );
+      for await (const _ of stream) {
+        /* consume */
+      }
+
+      expect(compressSpy).toHaveBeenCalledTimes(1);
+      const passedOpts = compressSpy.mock.calls[0][1];
+      expect(passedOpts.force).toBe(true);
+      // trigger='auto' is the orphan-strip safety wire: without it the
+      // service would see force=true, default compactTrigger to 'manual',
+      // and strip the trailing model+functionCall mid tool-loop. Asserting
+      // the wiring here guards C1 from silent regression.
+      expect(passedOpts.trigger).toBe('auto');
+      expect(passedOpts.pendingUserMessage).toBeDefined();
+      expect(passedOpts.pendingUserMessage?.role).toBe('user');
+      expect(
+        passedOpts.pendingUserMessage?.parts?.some(
+          (part) => part.text === userMessage,
+        ),
+      ).toBe(true);
+    });
+
+    it('forwards latched consecutiveFailures into hard-rescue (no pre-call reset); success recovers via the post-call branch', async () => {
+      // Hard-rescue uses force=true, which already bypasses the
+      // chatCompressionService breaker (the `!force` check in compress's
+      // cheap-gate) regardless of the counter value — so a pre-call reset
+      // is unnecessary for "let the latched breaker recover".
+      //
+      // Pre-resetting would in fact DEFEAT the breaker on
+      // persistent-failure sessions: hard-rescue failures don't increment
+      // via tryCompress (force=true skips the `if (!force)` increment in
+      // the failure branch), and only the reactive overflow handler
+      // explicitly increments. If hard-rescue zeroed the counter on every
+      // send, the reactive-overflow increment would be wiped next send
+      // and the counter would oscillate 0↔1 indefinitely.
+      //
+      // Correct behavior asserted here: hard-rescue forwards the existing
+      // counter value as-is; on COMPRESSED success the post-call branch
+      // in tryCompress's COMPRESSED handler resets to 0 (recovering a
+      // latched session).
+      const compressSpy = vi.spyOn(
+        ChatCompressionService.prototype,
+        'compress',
+      );
+
+      // Step 1: latch the breaker via MAX_CONSECUTIVE_FAILURES below-hard
+      // failures (cheap-gate path, force=false).
+      compressSpy.mockResolvedValue({
+        newHistory: null,
+        info: {
+          originalTokenCount: 100_000,
+          newTokenCount: 100_000,
+          compressionStatus:
+            CompressionStatus.COMPRESSION_FAILED_INFLATED_TOKEN_COUNT,
+        },
+      });
+      vi.mocked(mockContentGenerator.generateContentStream).mockImplementation(
+        async () => makeStreamResponse(),
+      );
+      chat.setLastPromptTokenCount(50_000);
+      for (let i = 0; i < MAX_CONSECUTIVE_FAILURES; i++) {
+        const s = await chat.sendMessageStream(
+          'test-model',
+          { message: `latch-${i}` },
+          `prompt-latch-${i}`,
+        );
+        for await (const _ of s) {
+          /* consume */
+        }
+        expect(compressSpy.mock.calls[i][1].force).toBe(false);
+      }
+      // Pre-increment semantic: i-th call sees i; counter on chat is now
+      // MAX_CONSECUTIVE_FAILURES (latched).
+      expect(compressSpy.mock.calls.at(-1)![1].consecutiveFailures).toBe(
+        MAX_CONSECUTIVE_FAILURES - 1,
+      );
+
+      // Step 2: bump lastPromptTokenCount into hard tier and send again.
+      // Hard-rescue fires (force=true) and the COMPRESSED result triggers
+      // the post-call reset in tryCompress's COMPRESSED handler.
+      compressSpy.mockClear();
+      compressSpy.mockResolvedValueOnce({
+        newHistory: [
+          { role: 'user', parts: [{ text: 'summary' }] },
+          { role: 'model', parts: [{ text: 'ack' }] },
+        ],
+        info: {
+          originalTokenCount: 178_000,
+          newTokenCount: 40_000,
+          compressionStatus: CompressionStatus.COMPRESSED,
+        },
+      });
+      chat.setLastPromptTokenCount(176_999);
+      const rescueStream = await chat.sendMessageStream(
+        'test-model',
+        { message: 'rescue me' },
+        'prompt-hard-rescue-no-prereset',
+      );
+      for await (const _ of rescueStream) {
+        /* consume */
+      }
+
+      expect(compressSpy).toHaveBeenCalledTimes(1);
+      expect(compressSpy.mock.calls[0][1].force).toBe(true);
+      // Counter forwarded as-is — the LATCHED value, NOT zero.
+      expect(compressSpy.mock.calls[0][1].consecutiveFailures).toBe(
+        MAX_CONSECUTIVE_FAILURES,
+      );
+
+      // Step 3: verify the post-call reset took effect on the chat. A
+      // follow-up below-hard send (cheap-gate path, force=false) should
+      // forward consecutiveFailures=0, proving the post-call reset in
+      // tryCompress's COMPRESSED handler ran on the Step 2 result.
+      compressSpy.mockClear();
+      compressSpy.mockResolvedValueOnce({
+        newHistory: null,
+        info: {
+          originalTokenCount: 40_000,
+          newTokenCount: 40_000,
+          compressionStatus: CompressionStatus.NOOP,
+        },
+      });
+      chat.setLastPromptTokenCount(50_000);
+      const followUpStream = await chat.sendMessageStream(
+        'test-model',
+        { message: 'after recovery' },
+        'prompt-hard-rescue-after-recovery',
+      );
+      for await (const _ of followUpStream) {
+        /* consume */
+      }
+      expect(compressSpy.mock.calls[0][1].consecutiveFailures).toBe(0);
+      expect(compressSpy.mock.calls[0][1].force).toBe(false);
+    });
+
+    it('does not force when tokens are below hard threshold (normal auto path)', async () => {
+      const compressSpy = vi
+        .spyOn(ChatCompressionService.prototype, 'compress')
+        .mockResolvedValueOnce({
+          newHistory: null,
+          info: {
+            originalTokenCount: 0,
+            newTokenCount: 0,
+            compressionStatus: CompressionStatus.NOOP,
+          },
+        });
+      vi.mocked(mockContentGenerator.generateContentStream).mockResolvedValue(
+        makeStreamResponse(),
+      );
+
+      // Well below 177K hard threshold — normal auto path.
+      chat.setLastPromptTokenCount(50_000);
+      const stream = await chat.sendMessageStream(
+        'test-model',
+        { message: 'small message' },
+        'prompt-id-hard-rescue-below',
+      );
+      for await (const _ of stream) {
+        /* consume */
+      }
+
+      expect(compressSpy).toHaveBeenCalledTimes(1);
+      expect(compressSpy.mock.calls[0][1].force).toBe(false);
+    });
+  });
+
   describe('addHistory', () => {
     it('should add a new content item to the history', () => {
       const newContent: Content = {
@@ -3699,9 +4055,9 @@ describe('GeminiChat', async () => {
   });
 
   // Compression logic is tested in chatCompressionService.test.ts; this
-  // suite covers per-chat state on GeminiChat: hasFailedCompressionAttempt
-  // stickiness, token-count mutation, history replacement, and conditional
-  // telemetry mirroring.
+  // suite covers per-chat state on GeminiChat: consecutiveFailures
+  // circuit breaker, token-count mutation, history replacement, and
+  // conditional telemetry mirroring.
   describe('tryCompress (per-chat state)', () => {
     const userMsg = (text: string) => ({
       role: 'user' as const,
@@ -3789,7 +4145,7 @@ describe('GeminiChat', async () => {
       expect(uiTelemetryService.setLastPromptTokenCount).not.toHaveBeenCalled();
     });
 
-    it('marks hasFailedCompressionAttempt and suppresses subsequent unforced auto-compactions', async () => {
+    it('increments consecutiveFailures and forwards it to subsequent unforced auto-compactions', async () => {
       const compressSpy = mockCompressionService('failed-inflated');
 
       const first = await chat.tryCompress('p1', 'm1');
@@ -3799,9 +4155,10 @@ describe('GeminiChat', async () => {
       expect(compressSpy).toHaveBeenCalledTimes(1);
 
       // The next unforced call should reach the service with
-      // hasFailedCompressionAttempt=true; the service's threshold check then
-      // returns NOOP. The important thing here is that GeminiChat actually
-      // forwards the sticky flag.
+      // consecutiveFailures=1 (incremented after the first failure). The
+      // important thing here is that GeminiChat actually forwards the
+      // updated counter — the service's own threshold logic is tested
+      // separately in chatCompressionService.test.ts.
       compressSpy.mockClear();
       compressSpy.mockResolvedValue({
         newHistory: null,
@@ -3813,9 +4170,7 @@ describe('GeminiChat', async () => {
       });
       await chat.tryCompress('p2', 'm1');
       expect(compressSpy).toHaveBeenCalledTimes(1);
-      expect(compressSpy.mock.calls[0][1].hasFailedCompressionAttempt).toBe(
-        true,
-      );
+      expect(compressSpy.mock.calls[0][1].consecutiveFailures).toBe(1);
     });
 
     it('forwards force=true to the compression service', async () => {
@@ -3825,4 +4180,146 @@ describe('GeminiChat', async () => {
       expect(compressSpy.mock.calls[0][1].force).toBe(true);
     });
   });
+
+  // The circuit breaker is the three-strike replacement for the old
+  // single-shot hasFailedCompressionAttempt lock. After
+  // MAX_CONSECUTIVE_FAILURES failures the chat stops trying to auto-compact
+  // until a successful force compress (or any successful compress) resets
+  // the counter.
+  describe('compression failure circuit breaker', () => {
+    const userMsg = (text: string) => ({
+      role: 'user' as const,
+      parts: [{ text }],
+    });
+    const modelMsg = (text: string) => ({
+      role: 'model' as const,
+      parts: [{ text }],
+    });
+
+    it('tolerates MAX_CONSECUTIVE_FAILURES - 1 failures and increments the counter each time', async () => {
+      // Mock the service to "fail" every call (the chat's counter increments
+      // each time). After (MAX - 1) failures, the next tryCompress should
+      // still call the service. The actual NOOP-at-threshold gating is the
+      // service's job (and verified separately) — here we just observe that
+      // GeminiChat keeps forwarding the incremented counter.
+      const compressSpy = vi.spyOn(
+        ChatCompressionService.prototype,
+        'compress',
+      );
+      compressSpy.mockResolvedValue({
+        newHistory: null,
+        info: {
+          originalTokenCount: 100_000,
+          newTokenCount: 100_000,
+          compressionStatus:
+            CompressionStatus.COMPRESSION_FAILED_INFLATED_TOKEN_COUNT,
+        },
+      });
+      chat.setHistory([userMsg('a'), modelMsg('b'), userMsg('c')]);
+
+      for (let i = 0; i < MAX_CONSECUTIVE_FAILURES; i++) {
+        await chat.tryCompress(`p${i}`, 'm1');
+        // The i-th call sees consecutiveFailures = i (counter pre-increment).
+        expect(compressSpy.mock.calls[i][1].consecutiveFailures).toBe(i);
+      }
+      // After MAX_CONSECUTIVE_FAILURES failures, the breaker is tripped.
+      // The next call will still be made by GeminiChat (it does not
+      // short-circuit on its side), but the service's cheap-gate will NOOP.
+      expect(compressSpy).toHaveBeenCalledTimes(MAX_CONSECUTIVE_FAILURES);
+      await chat.tryCompress('p-last', 'm1');
+      expect(
+        compressSpy.mock.calls[MAX_CONSECUTIVE_FAILURES][1].consecutiveFailures,
+      ).toBe(MAX_CONSECUTIVE_FAILURES);
+    });
+
+    it('does not increment the counter on forced-call failures', async () => {
+      // Forced compressions (manual /compress, reactive overflow) bypass
+      // the breaker AND must not count toward it. Otherwise a flaky
+      // manual /compress would burn the breaker for auto-compaction.
+      const compressSpy = vi.spyOn(
+        ChatCompressionService.prototype,
+        'compress',
+      );
+      compressSpy.mockResolvedValue({
+        newHistory: null,
+        info: {
+          originalTokenCount: 100_000,
+          newTokenCount: 100_000,
+          compressionStatus: CompressionStatus.COMPRESSION_FAILED_EMPTY_SUMMARY,
+        },
+      });
+      for (let i = 0; i < 5; i++) {
+        await chat.tryCompress(`p-force-${i}`, 'm1', true);
+      }
+      // After 5 forced failures, an unforced call must still see counter=0.
+      compressSpy.mockResolvedValueOnce({
+        newHistory: null,
+        info: {
+          originalTokenCount: 0,
+          newTokenCount: 0,
+          compressionStatus: CompressionStatus.NOOP,
+        },
+      });
+      await chat.tryCompress('p-unforced', 'm1');
+      const lastCall = compressSpy.mock.calls.at(-1);
+      expect(lastCall![1].consecutiveFailures).toBe(0);
+    });
+
+    it('resets the counter to 0 on a successful (forced) compress', async () => {
+      // After two failures, a successful force compress should reset the
+      // counter — the next unforced send tries again with consecutiveFailures=0.
+      const compressSpy = vi.spyOn(
+        ChatCompressionService.prototype,
+        'compress',
+      );
+      compressSpy
+        .mockResolvedValueOnce({
+          newHistory: null,
+          info: {
+            originalTokenCount: 100_000,
+            newTokenCount: 100_000,
+            compressionStatus:
+              CompressionStatus.COMPRESSION_FAILED_INFLATED_TOKEN_COUNT,
+          },
+        })
+        .mockResolvedValueOnce({
+          newHistory: null,
+          info: {
+            originalTokenCount: 100_000,
+            newTokenCount: 100_000,
+            compressionStatus:
+              CompressionStatus.COMPRESSION_FAILED_EMPTY_SUMMARY,
+          },
+        })
+        .mockResolvedValueOnce({
+          newHistory: [userMsg('summary'), modelMsg('ack')],
+          info: {
+            originalTokenCount: 100_000,
+            newTokenCount: 30_000,
+            compressionStatus: CompressionStatus.COMPRESSED,
+          },
+        })
+        .mockResolvedValueOnce({
+          newHistory: null,
+          info: {
+            originalTokenCount: 0,
+            newTokenCount: 0,
+            compressionStatus: CompressionStatus.NOOP,
+          },
+        });
+
+      // Two failures → counter is 2.
+      await chat.tryCompress('p1', 'm1');
+      await chat.tryCompress('p2', 'm1');
+      expect(compressSpy.mock.calls[1][1].consecutiveFailures).toBe(1);
+
+      // Forced successful compress → counter resets to 0.
+      await chat.tryCompress('p-force', 'm1', true);
+      expect(compressSpy.mock.calls[2][1].consecutiveFailures).toBe(2);
+
+      // Next unforced call: counter is back to 0.
+      await chat.tryCompress('p3', 'm1');
+      expect(compressSpy.mock.calls[3][1].consecutiveFailures).toBe(0);
+    });
+  });
 });
diff --git a/packages/core/src/core/geminiChat.ts b/packages/core/src/core/geminiChat.ts
index 2655acd61c..8684e818d5 100644
--- a/packages/core/src/core/geminiChat.ts
+++ b/packages/core/src/core/geminiChat.ts
@@ -44,8 +44,12 @@ import {
 import { type ChatRecordingService } from '../services/chatRecordingService.js';
 import {
   ChatCompressionService,
+  computeThresholds,
+  MAX_CONSECUTIVE_FAILURES,
   type CompactTrigger,
 } from '../services/chatCompressionService.js';
+import { resolveSlimmingConfig } from '../services/compactionInputSlimming.js';
+import { estimatePromptTokens } from '../services/tokenEstimation.js';
 import {
   ContentRetryEvent,
   ContentRetryFailureEvent,
@@ -97,7 +101,8 @@ function isCompressionFailureStatus(status: CompressionStatus): boolean {
   return (
     status === CompressionStatus.COMPRESSION_FAILED_INFLATED_TOKEN_COUNT ||
     status === CompressionStatus.COMPRESSION_FAILED_EMPTY_SUMMARY ||
-    status === CompressionStatus.COMPRESSION_FAILED_TOKEN_COUNT_ERROR
+    status === CompressionStatus.COMPRESSION_FAILED_TOKEN_COUNT_ERROR ||
+    status === CompressionStatus.COMPRESSION_FAILED_OUTPUT_TRUNCATED
   );
 }
 
@@ -139,6 +144,19 @@ interface ContentRetryOptions {
 interface TryCompressOptions {
   originalTokenCountOverride?: number;
   trigger?: CompactTrigger;
+  /**
+   * Pending user message about to be sent. Threaded through to the
+   * compression service's cheap-gate so it can see the real prompt size
+   * even when `lastPromptTokenCount === 0` (first send after inherited
+   * history). See `estimatePromptTokens` for the fallback math.
+   */
+  pendingUserMessage?: Content;
+  /**
+   * Pre-computed `estimatePromptTokens` value from the caller. When set,
+   * the cheap-gate uses this instead of recomputing — avoids a second
+   * `getHistory(true)` clone per send. (review #4168 R1.3 / R1.4)
+   */
+  precomputedEffectiveTokens?: number;
 }
 
 const INVALID_CONTENT_RETRY_OPTIONS: ContentRetryOptions = {
@@ -436,12 +454,35 @@ export class GeminiChat {
   private lastPromptTokenCount = 0;
 
   /**
-   * Per-chat sticky flag. After an unforced compression attempt fails (empty
-   * summary or inflated token count), automatic compaction is suppressed
-   * for the remainder of this chat to avoid burning compression API calls
-   * in a loop. Manual `/compress` still works (it passes `force=true`).
+   * Number of consecutive auto-compaction failures for this chat. The
+   * cheap-gate NOOPs once this reaches MAX_CONSECUTIVE_FAILURES (default 3)
+   * until a successful compress (forced or not) resets it to 0. Replaces the
+   * single-shot hasFailedCompressionAttempt lock that previously disabled
+   * auto-compaction for the rest of the session on any failure.
+   *
+   * SEMANTICS (R5.3): this counter tracks "non-force, non-hard-rescue
+   * consecutive failures", NOT every failure literally.
+   *   - Auto-compaction failures (cheap-gate path): increment by 1.
+   *   - Manual `/compress` failures: skipped (`force=true` → `!force`
+   *     guard in the failure branch).
+   *   - Hard-tier rescue failures: skipped (force=true → `!force` guard
+   *     in tryCompress's failure branch). The counter is NOT pre-reset
+   *     before the rescue call — force=true already bypasses the breaker
+   *     check in compress's cheap-gate, and pre-resetting would in fact
+   *     defeat the breaker entirely (hard-rescue failures don't increment
+   *     via tryCompress, and a pre-reset every send would wipe the
+   *     reactive-overflow increment). The forwarded counter value is
+   *     whatever the chat carried; on COMPRESSED success the post-call
+   *     branch in tryCompress's COMPRESSED handler resets to 0, which is
+   *     the correct recovery path for a previously-latched session.
+   *     Reactive overflow remains the explicit-increment safety net for
+   *     the force=true path — its handler bumps the counter by +1 so N
+   *     reactive failures will still trip the breaker.
+   *
+   * If you're debugging "why is hard-rescue firing but the counter is 0",
+   * that's by design.
    */
-  private hasFailedCompressionAttempt = false;
+  private consecutiveFailures = 0;
 
   /**
    * Creates a new GeminiChat instance.
@@ -521,9 +562,11 @@ export class GeminiChat {
       force,
       model,
       config: this.config,
-      hasFailedCompressionAttempt: this.hasFailedCompressionAttempt,
+      consecutiveFailures: this.consecutiveFailures,
       originalTokenCount:
         options?.originalTokenCountOverride ?? this.lastPromptTokenCount,
+      pendingUserMessage: options?.pendingUserMessage,
+      precomputedEffectiveTokens: options?.precomputedEffectiveTokens,
       trigger: options?.trigger,
       signal,
     });
@@ -538,10 +581,21 @@ export class GeminiChat {
       this.config.getFileReadCache().clear();
       this.lastPromptTokenCount = info.newTokenCount;
       this.telemetryService?.setLastPromptTokenCount(info.newTokenCount);
-      this.hasFailedCompressionAttempt = false;
+      // Reset the consecutive-failure counter on success so a forced /compress
+      // (or any successful compaction) recovers a chat whose breaker had
+      // tripped.
+      this.consecutiveFailures = 0;
     } else if (isCompressionFailureStatus(info.compressionStatus)) {
+      // Track failed attempts (only count if not forced) so we stop spending
+      // compression-API calls on a chat that can't shrink after
+      // MAX_CONSECUTIVE_FAILURES strikes in a row.
       if (!force) {
-        this.hasFailedCompressionAttempt = true;
+        this.consecutiveFailures += 1;
+        if (this.consecutiveFailures >= MAX_CONSECUTIVE_FAILURES) {
+          debugLogger.warn(
+            `[compaction] circuit breaker tripped after ${this.consecutiveFailures} consecutive failures (cheap-gate path); auto-compaction will NOOP until a successful force compaction resets the counter.`,
+          );
+        }
       }
     }
 
@@ -625,15 +679,87 @@ export class GeminiChat {
       // resolves it) has not run yet. Any setup error before returning the
       // generator must release the lock or subsequent sends will block forever
       // at `await this.sendPromise`.
+      // Build the user content BEFORE compression so the cheap-gate can size
+      // the upcoming prompt — closes the "first send after inherited history"
+      // gap where `lastPromptTokenCount === 0` and the gate would otherwise
+      // see only the stale prior-turn count (0).
+      const userContent = createUserContent(params.message);
+
+      // Hard-tier rescue: when the estimated prompt size is at or above the
+      // hard threshold (effectiveWindow - HARD_BUFFER), force compaction in
+      // this send instead of waiting for the API to reject the request as too
+      // large.
+      //
+      // We compute `effectiveTokens` ONCE here and pass it through to
+      // tryCompress → service.compress so the cheap-gate doesn't redo the
+      // estimation (which involves another `getHistory(true)` clone). This
+      // reuse also fixes a per-config-knob inconsistency: previously the
+      // hard-tier rescue used the default imageTokenEstimate while the
+      // cheap-gate inside tryCompress used the user's resolved value.
+      // (review #4168 R1.3 + R1.4)
+      //
+      // The consecutive-failure counter is NOT pre-reset here. force=true
+      // already bypasses the breaker (the `!force` check in
+      // `chatCompressionService.compress`'s cheap-gate), so a latched session
+      // can still attempt hard-rescue; pre-resetting would defeat the breaker
+      // entirely because hard-rescue failures don't increment via tryCompress
+      // (force=true skips the `if (!force)` increment in the failure branch),
+      // and only the reactive overflow handler explicitly increments. With a
+      // pre-reset the counter would oscillate 0↔1 across sends and never trip.
+      // On COMPRESSED success, the post-call branch in `tryCompress` (the
+      // `consecutiveFailures = 0` line in the COMPRESSED handler) still resets
+      // to 0, which is the correct recovery path for a previously-latched
+      // session.
+      const contextLimit =
+        this.config.getContentGeneratorConfig()?.contextWindowSize ??
+        DEFAULT_TOKEN_LIMIT;
+      const { hard } = computeThresholds(contextLimit);
+      const imageTokenEstimate = resolveSlimmingConfig(
+        this.config.getChatCompression(),
+      ).imageTokenEstimate;
+      // When lastPromptTokenCount > 0, estimatePromptTokens uses the
+      // API-authoritative count + a tiny estimate of just the new user
+      // message — it does NOT touch the history at all in that branch, so
+      // skip the costly `getHistory(true)` clone on the steady-state path.
+      // The lastPromptTokenCount=0 branch (first send after --continue
+      // restore / subagent inheritance) walks history with a char/4
+      // heuristic that can under-count by ~15-20K tokens; the reactive
+      // overflow recovery path inside the async iterator below (the
+      // `getContextLengthExceededInfo` → `tryCompress` → RETRY branch)
+      // is the documented safety net when this under-count causes
+      // hard-rescue to miss.
+      const effectiveTokens = estimatePromptTokens(
+        this.lastPromptTokenCount > 0 ? [] : this.getHistoryShallow(true),
+        userContent,
+        this.lastPromptTokenCount,
+        imageTokenEstimate,
+      );
+      const shouldForceFromHard = effectiveTokens >= hard;
+      if (shouldForceFromHard) {
+        debugLogger.warn(
+          `[compaction] hard-tier rescue triggered: effectiveTokens=${effectiveTokens}, hard=${hard}, consecutiveFailures=${this.consecutiveFailures}.`,
+        );
+      }
+
       compressionInfo = await this.tryCompress(
         prompt_id,
         model,
-        false,
+        shouldForceFromHard,
         params.config?.abortSignal,
+        {
+          pendingUserMessage: userContent,
+          precomputedEffectiveTokens: effectiveTokens,
+          // Hard-rescue is force=true to bypass the cheap-gate breaker
+          // but it's an AUTOMATIC trigger. Explicit trigger='auto' tells
+          // the service to skip the manual-only orphan-strip that would
+          // otherwise drop the active funcCall whose matching
+          // funcResponse is sitting in `pendingUserMessage` waiting to
+          // be pushed. Without this, hard-rescue mid tool-use loop
+          // corrupts the next API request's tool-call/response pairing.
+          trigger: shouldForceFromHard ? 'auto' : undefined,
+        },
       );
 
-      const userContent = createUserContent(params.message);
-
       // Add user content to history ONCE before any attempts.
       this.history.push(userContent);
       userContentAdded = true;
@@ -827,7 +953,21 @@ export class GeminiChat {
                   if (
                     isCompressionFailureStatus(reactiveInfo.compressionStatus)
                   ) {
-                    self.hasFailedCompressionAttempt = true;
+                    // Reactive compression is force=true so tryCompress's
+                    // failure branch did not increment the counter. Count it
+                    // explicitly as one strike — a single transient error
+                    // (network blip, model 5xx) should not permanently latch
+                    // the breaker; only repeated reactive failures should.
+                    // The only recovery path for a latched counter is a
+                    // successful compaction (post-call reset at the COMPRESSED
+                    // branch in tryCompress); hard-rescue forwards the counter
+                    // as-is since force=true bypasses the breaker.
+                    self.consecutiveFailures += 1;
+                    if (self.consecutiveFailures >= MAX_CONSECUTIVE_FAILURES) {
+                      debugLogger.warn(
+                        `[compaction] circuit breaker tripped after ${self.consecutiveFailures} consecutive failures (reactive overflow path); auto-compaction will NOOP on the cheap-gate until a successful force compaction resets the counter.`,
+                      );
+                    }
                   }
                 } catch (compressionError) {
                   if (
diff --git a/packages/core/src/core/turn.ts b/packages/core/src/core/turn.ts
index 8847120a84..9a46509c43 100644
--- a/packages/core/src/core/turn.ts
+++ b/packages/core/src/core/turn.ts
@@ -171,6 +171,19 @@ export enum CompressionStatus {
 
   /** The compression was not necessary and no action was taken */
   NOOP,
+
+  /**
+   * The compression call produced a summary, but the output hit
+   * COMPACT_MAX_OUTPUT_TOKENS, indicating likely truncation. The summary
+   * is dropped (newHistory=null) and the attempt is treated as a failure:
+   * `isCompressionFailureStatus` returns true so it counts toward the
+   * per-chat circuit breaker. Kept distinct from
+   * `COMPRESSION_FAILED_EMPTY_SUMMARY` so telemetry can separate
+   * prompt-quality failures (empty / nonsensical summary) from capacity
+   * failures (output cap hit, may need a higher cap or finer-grained
+   * splitter). (R5.2)
+   */
+  COMPRESSION_FAILED_OUTPUT_TRUNCATED,
 }
 
 export interface ChatCompressionInfo {
diff --git a/packages/core/src/index.ts b/packages/core/src/index.ts
index fa6b98522d..336f23ff33 100644
--- a/packages/core/src/index.ts
+++ b/packages/core/src/index.ts
@@ -137,6 +137,10 @@ export * from './providers/index.js';
 // Services
 // ============================================================================
 
+export {
+  computeThresholds,
+  type CompactionThresholds,
+} from './services/chatCompressionService.js';
 export * from './services/chatRecordingService.js';
 export * from './services/cronScheduler.js';
 export * from './services/fileDiscoveryService.js';
diff --git a/packages/core/src/services/chatCompressionService.test.ts b/packages/core/src/services/chatCompressionService.test.ts
index e42d6e80d4..c73f08fcd7 100644
--- a/packages/core/src/services/chatCompressionService.test.ts
+++ b/packages/core/src/services/chatCompressionService.test.ts
@@ -7,7 +7,9 @@
 import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
 import {
   ChatCompressionService,
+  computeThresholds,
   findCompressSplitPoint,
+  MAX_CONSECUTIVE_FAILURES,
   TOOL_ROUND_RETAIN_COUNT,
 } from './chatCompressionService.js';
 import type { Content } from '@google/genai';
@@ -18,6 +20,7 @@ import type { GeminiChat } from '../core/geminiChat.js';
 import type { Config } from '../config/config.js';
 import type { BaseLlmClient } from '../core/baseLlmClient.js';
 import { PreCompactTrigger, PostCompactTrigger } from '../hooks/types.js';
+import * as sideQueryModule from '../utils/sideQuery.js';
 
 vi.mock('../telemetry/uiTelemetry.js');
 vi.mock('../core/tokenLimits.js');
@@ -423,27 +426,94 @@ describe('ChatCompressionService', () => {
       force: false,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
     expect(result.info.compressionStatus).toBe(CompressionStatus.NOOP);
     expect(result.newHistory).toBeNull();
   });
 
-  it('should return NOOP if previously failed and not forced', async () => {
+  it('should return NOOP when consecutiveFailures has hit the breaker and not forced', async () => {
     vi.mocked(mockChat.getHistory).mockReturnValue([
       { role: 'user', parts: [{ text: 'hi' }] },
     ]);
+    // Seed a non-zero originalTokenCount so we can assert the breaker-NOOP
+    // path forwards it (rather than zeroing the field — see R4-1). Telemetry
+    // consumers rely on this to distinguish "breaker tripped at N tokens"
+    // from "empty session".
+    vi.mocked(uiTelemetryService.getLastPromptTokenCount).mockReturnValue(
+      120_000,
+    );
     const result = await service.compress(mockChat, {
       promptId: mockPromptId,
       force: false,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: true,
+      consecutiveFailures: MAX_CONSECUTIVE_FAILURES,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
     expect(result.info.compressionStatus).toBe(CompressionStatus.NOOP);
     expect(result.newHistory).toBeNull();
+    expect(result.info.originalTokenCount).toBe(120_000);
+    expect(result.info.newTokenCount).toBe(120_000);
+  });
+
+  it('falls through when consecutiveFailures is below the breaker threshold', async () => {
+    // Below MAX_CONSECUTIVE_FAILURES, the cheap-gate must NOT NOOP on the
+    // failure counter alone — it should fall through. Use force=true to
+    // bypass the token-threshold check too, then prove we reached the
+    // post-cheap-gate path by observing chat.getHistory(true) being called.
+    vi.mocked(mockChat.getHistory).mockReturnValue([
+      { role: 'user', parts: [{ text: 'hi' }] },
+    ]);
+
+    await service.compress(mockChat, {
+      promptId: mockPromptId,
+      // force=true so the only thing that could NOOP us up front is the
+      // circuit-breaker. At MAX-1, the breaker must NOT trip.
+      force: true,
+      model: mockModel,
+      config: mockConfig,
+      consecutiveFailures: MAX_CONSECUTIVE_FAILURES - 1,
+      originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
+    });
+    // Reaching the curated-history clone is the proof we got past the
+    // cheap-gate. The service calls chat.getHistory(true) once it falls
+    // through — if the breaker had tripped, it would have returned the
+    // cheap-gate NOOP without ever touching the history clone.
+    expect(mockChat.getHistory).toHaveBeenCalledWith(true);
+  });
+
+  it('trips the circuit breaker only when consecutiveFailures has reached MAX_CONSECUTIVE_FAILURES', async () => {
+    vi.mocked(mockChat.getHistory).mockReturnValue([
+      { role: 'user', parts: [{ text: 'hi' }] },
+    ]);
+    // At exactly MAX (unforced) -> NOOP at cheap-gate.
+    const tripped = await service.compress(mockChat, {
+      promptId: mockPromptId,
+      force: false,
+      model: mockModel,
+      config: mockConfig,
+      consecutiveFailures: MAX_CONSECUTIVE_FAILURES,
+      originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
+    });
+    expect(tripped.info.compressionStatus).toBe(CompressionStatus.NOOP);
+
+    // force=true bypasses the breaker even when tripped.
+    vi.mocked(mockChat.getHistory).mockClear();
+    vi.mocked(mockChat.getHistory).mockReturnValue([
+      { role: 'user', parts: [{ text: 'hi' }] },
+    ]);
+    await service.compress(mockChat, {
+      promptId: mockPromptId,
+      force: true,
+      model: mockModel,
+      config: mockConfig,
+      consecutiveFailures: MAX_CONSECUTIVE_FAILURES,
+      originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
+    });
+    // Force bypasses the cheap-gate; service reaches the curated-history clone.
+    expect(mockChat.getHistory).toHaveBeenCalledWith(true);
   });
 
   it('should return NOOP if under token threshold and not forced', async () => {
@@ -459,25 +529,51 @@ describe('ChatCompressionService', () => {
       force: false,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
     expect(result.info.compressionStatus).toBe(CompressionStatus.NOOP);
     expect(result.newHistory).toBeNull();
   });
 
-  it('should return NOOP when contextPercentageThreshold is 0', async () => {
+  it('silently ignores the deprecated chatCompression.contextPercentageThreshold = 0 (no longer disables compaction)', async () => {
+    // Pre-PR #4168, setting contextPercentageThreshold = 0 short-circuited
+    // compress() at the cheap-gate (NOOP). The field was removed from
+    // ChatCompressionSettings as part of the redesign; leftover values
+    // in stale settings.json must be ignored without suppressing the gate.
+    // Drive the non-force path with originalTokenCount above auto so the
+    // gate would have to actively pass, and verify the side-query fires.
     const history: Content[] = [
       { role: 'user', parts: [{ text: 'msg1' }] },
       { role: 'model', parts: [{ text: 'msg2' }] },
+      { role: 'user', parts: [{ text: 'msg3' }] },
+      { role: 'model', parts: [{ text: 'msg4' }] },
     ];
     vi.mocked(mockChat.getHistory).mockReturnValue(history);
-    vi.mocked(uiTelemetryService.getLastPromptTokenCount).mockReturnValue(800);
+    vi.mocked(uiTelemetryService.getLastPromptTokenCount).mockReturnValue(
+      100_000,
+    );
+    // The deprecated field is no longer in ChatCompressionSettings; cast so
+    // we can simulate a leftover value coming from a stale settings.json.
     vi.mocked(mockConfig.getChatCompression).mockReturnValue({
       contextPercentageThreshold: 0,
-    });
+    } as unknown as ReturnType<typeof mockConfig.getChatCompression>);
+    // 128K window → auto ≈ 95K; originalTokenCount 100K crosses.
+    vi.mocked(mockConfig.getContentGeneratorConfig).mockReturnValue({
+      model: 'gemini-pro',
+      contextWindowSize: 128_000,
+    } as unknown as ReturnType<typeof mockConfig.getContentGeneratorConfig>);
 
-    const mockGenerateContent = vi.fn();
+    const mockGenerateContent = vi.fn().mockResolvedValue({
+      text: 'Summary',
+      usage: {
+        // Realistic compression usage so the inflation guard doesn't fire:
+        //   newTokens = max(0, 100000 - (99000 - 1000) + 1500) = 3500 → COMPRESSED
+        promptTokenCount: 99_000,
+        candidatesTokenCount: 1500,
+        totalTokenCount: 100_500,
+      },
+    });
     vi.mocked(mockConfig.getBaseLlmClient).mockReturnValue({
       generateText: mockGenerateContent,
     } as unknown as BaseLlmClient);
@@ -487,33 +583,12 @@ describe('ChatCompressionService', () => {
       force: false,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
-    expect(result.info).toMatchObject({
-      compressionStatus: CompressionStatus.NOOP,
-      originalTokenCount: 0,
-      newTokenCount: 0,
-    });
-    expect(mockGenerateContent).not.toHaveBeenCalled();
-    expect(tokenLimit).not.toHaveBeenCalled();
-
-    const forcedResult = await service.compress(mockChat, {
-      promptId: mockPromptId,
-      force: true,
-      model: mockModel,
-      config: mockConfig,
-      hasFailedCompressionAttempt: false,
-      originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
-    });
-    expect(forcedResult.info).toMatchObject({
-      compressionStatus: CompressionStatus.NOOP,
-      originalTokenCount: 0,
-      newTokenCount: 0,
-    });
-    expect(mockGenerateContent).not.toHaveBeenCalled();
-    expect(tokenLimit).not.toHaveBeenCalled();
+    expect(result.info.compressionStatus).toBe(CompressionStatus.COMPRESSED);
+    expect(mockGenerateContent).toHaveBeenCalled();
   });
 
   it('should return NOOP when historyToCompress is below MIN_COMPRESSION_FRACTION of total', async () => {
@@ -548,7 +623,7 @@ describe('ChatCompressionService', () => {
       force: true,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -589,7 +664,7 @@ describe('ChatCompressionService', () => {
       force: false,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -658,7 +733,7 @@ describe('ChatCompressionService', () => {
       force: false,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -697,7 +772,7 @@ describe('ChatCompressionService', () => {
       // forced
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -733,7 +808,7 @@ describe('ChatCompressionService', () => {
       force: true,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -769,7 +844,7 @@ describe('ChatCompressionService', () => {
       force: true,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       signal: abortController.signal,
     });
@@ -818,7 +893,7 @@ describe('ChatCompressionService', () => {
       force: true,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -860,19 +935,21 @@ describe('ChatCompressionService', () => {
       force: true,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
-    // Compression quality depends on thinkingConfig.includeThoughts being on
-    // and maxAttempts being short (best-effort); a future refactor that drops
-    // any of these would silently regress quality without this assertion.
+    // Thinking is intentionally disabled (per-provider budget semantics are
+    // inconsistent) and the output is hard-capped by COMPACT_MAX_OUTPUT_TOKENS
+    // so subsequent threshold math has a predictable reserve. maxAttempts=1
+    // keeps the call best-effort (next turn re-triggers on failure).
     expect(mockGenerateText).toHaveBeenCalledWith(
       expect.objectContaining({
         model: mockModel,
         maxAttempts: 1,
         config: expect.objectContaining({
-          thinkingConfig: { includeThoughts: true },
+          thinkingConfig: { includeThoughts: false },
+          maxOutputTokens: 20_000,
         }),
       }),
     );
@@ -904,7 +981,7 @@ describe('ChatCompressionService', () => {
       force: true,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -942,7 +1019,7 @@ describe('ChatCompressionService', () => {
       force: false,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -976,7 +1053,7 @@ describe('ChatCompressionService', () => {
       force: true,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -1010,7 +1087,7 @@ describe('ChatCompressionService', () => {
       force: true,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -1046,7 +1123,7 @@ describe('ChatCompressionService', () => {
       force: true,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -1087,7 +1164,7 @@ describe('ChatCompressionService', () => {
       force: false,
       model: mockModel,
       config: mockConfig,
-      hasFailedCompressionAttempt: false,
+      consecutiveFailures: 0,
       originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
     });
 
@@ -1140,7 +1217,7 @@ describe('ChatCompressionService', () => {
         // force = true -> Manual trigger
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1185,7 +1262,7 @@ describe('ChatCompressionService', () => {
         // force = false -> Auto trigger
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1204,30 +1281,7 @@ describe('ChatCompressionService', () => {
         force: true,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
-        originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
-      });
-
-      expect(result.info.compressionStatus).toBe(CompressionStatus.NOOP);
-      expect(mockFirePreCompactEvent).not.toHaveBeenCalled();
-    });
-
-    it('should not fire PreCompact hook when threshold is 0', async () => {
-      const history: Content[] = [
-        { role: 'user', parts: [{ text: 'msg1' }] },
-        { role: 'model', parts: [{ text: 'msg2' }] },
-      ];
-      vi.mocked(mockChat.getHistory).mockReturnValue(history);
-      vi.mocked(mockConfig.getChatCompression).mockReturnValue({
-        contextPercentageThreshold: 0,
-      });
-
-      const result = await service.compress(mockChat, {
-        promptId: mockPromptId,
-        force: true,
-        model: mockModel,
-        config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1251,7 +1305,7 @@ describe('ChatCompressionService', () => {
         force: false,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1296,7 +1350,7 @@ describe('ChatCompressionService', () => {
         force: false,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1344,7 +1398,7 @@ describe('ChatCompressionService', () => {
         force: false,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1386,7 +1440,7 @@ describe('ChatCompressionService', () => {
         force: false,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1442,7 +1496,7 @@ describe('ChatCompressionService', () => {
         // force = true -> Manual trigger
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1487,7 +1541,7 @@ describe('ChatCompressionService', () => {
         // force = false -> Auto trigger
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1528,7 +1582,7 @@ describe('ChatCompressionService', () => {
         force: true,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1575,7 +1629,7 @@ describe('ChatCompressionService', () => {
         force: false,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1626,7 +1680,7 @@ describe('ChatCompressionService', () => {
         force: false,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1669,7 +1723,7 @@ describe('ChatCompressionService', () => {
         force: false,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1749,7 +1803,7 @@ describe('ChatCompressionService', () => {
         // force=true (manual /compress)
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1767,12 +1821,14 @@ describe('ChatCompressionService', () => {
       expect(optionsArg.contents.length).toBe(history.length); // (history.length - 1) messages + 1 instruction
     });
 
-    it('compresses-most without orphaning when last entry is in-flight funcCall (auto-compress)', async () => {
-      // Auto-compress fires BEFORE the matching funcResponse is sent back to
-      // the model. The trailing funcCall must be retained (its response is
-      // coming); the in-flight fallback compresses everything safely before
-      // it. Pre-refactor this returned NOOP, leaving the chat to grow until
-      // it 400'd.
+    // Shared fixture for the two trailing-in-flight-funcCall scenarios below:
+    // both auto-compress (force=false) and hard-rescue (force=true,
+    // trigger='auto') see the same history snapshot — a tool loop where the
+    // last message is a model funcCall whose matching funcResponse is about
+    // to arrive in the pending userContent (not in history yet). The only
+    // thing that differs between the two tests is the `compress(...)` call
+    // options and the per-test assertions.
+    const setupInFlightFuncCallFixture = () => {
       const history: Content[] = [
         { role: 'user', parts: [{ text: 'Fix all TypeScript errors.' }] },
         {
@@ -1790,7 +1846,8 @@ describe('ChatCompressionService', () => {
             },
           ],
         },
-        // Pending funcCall: tool is currently executing, funcResponse is coming
+        // Trailing funcCall: matching funcResponse is in the pending
+        // userContent, not in history yet — active, not orphaned.
         {
           role: 'model',
           parts: [{ functionCall: { name: 'readFile', args: {} } }],
@@ -1817,12 +1874,23 @@ describe('ChatCompressionService', () => {
         generateText: mockGenerateContent,
       } as unknown as BaseLlmClient);
 
+      return { history, mockGenerateContent };
+    };
+
+    it('compresses-most without orphaning when last entry is in-flight funcCall (auto-compress)', async () => {
+      // Auto-compress fires BEFORE the matching funcResponse is sent back to
+      // the model. The trailing funcCall must be retained (its response is
+      // coming); the in-flight fallback compresses everything safely before
+      // it. Pre-refactor this returned NOOP, leaving the chat to grow until
+      // it 400'd.
+      const { mockGenerateContent } = setupInFlightFuncCallFixture();
+
       const result = await service.compress(mockChat, {
         promptId: mockPromptId,
         force: false,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1839,6 +1907,39 @@ describe('ChatCompressionService', () => {
         expect(newHistory[i].role).not.toBe(newHistory[i - 1].role);
       }
     });
+
+    it('preserves trailing model+funcCall under hard-rescue (force=true + trigger=auto)', async () => {
+      // Hard-rescue fires from inside sendMessageStream() BEFORE the pending
+      // userContent (a funcResponse) is pushed onto history. At that moment
+      // the trailing model+funcCall is ACTIVE, not orphaned — its matching
+      // funcResponse is sitting in the pending message about to be appended.
+      //
+      // Pre-fix, the service's orphan-strip predicate gated on `force` alone,
+      // which meant hard-rescue (force=true, trigger='auto') was conflated
+      // with manual /compress and stripped the active funcCall — corrupting
+      // tool-call/response pairing on the next API send. Fix: gate the strip
+      // on `trigger === 'manual'` so only the explicit user-initiated
+      // /compress path performs the orphan cleanup.
+      setupInFlightFuncCallFixture();
+
+      const result = await service.compress(mockChat, {
+        promptId: mockPromptId,
+        force: true,
+        trigger: 'auto', // hard-rescue explicitly signals automatic intent
+        model: mockModel,
+        config: mockConfig,
+        consecutiveFailures: 0,
+        originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
+      });
+
+      expect(result.info.compressionStatus).toBe(CompressionStatus.COMPRESSED);
+      // The active funcCall must survive in the post-compression history so
+      // the about-to-be-pushed funcResponse has its matching tool_use.
+      const newHistory = result.newHistory!;
+      const last = newHistory[newHistory.length - 1];
+      expect(last.role).toBe('model');
+      expect(last.parts?.some((p) => p.functionCall)).toBe(true);
+    });
   });
 
   describe('tool-loop subagent absorption', () => {
@@ -1914,7 +2015,7 @@ describe('ChatCompressionService', () => {
         force: false,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1985,7 +2086,7 @@ describe('ChatCompressionService', () => {
         force: false,
         model: mockModel,
         config: mockConfig,
-        hasFailedCompressionAttempt: false,
+        consecutiveFailures: 0,
         originalTokenCount: uiTelemetryService.getLastPromptTokenCount(),
       });
 
@@ -1994,3 +2095,361 @@ describe('ChatCompressionService', () => {
     });
   });
 });
+
+describe('ChatCompressionService.compress sideQuery config', () => {
+  afterEach(() => {
+    vi.restoreAllMocks();
+  });
+
+  it('passes maxOutputTokens=20_000 and includeThoughts=false to runSideQuery', async () => {
+    const spy = vi.spyOn(sideQueryModule, 'runSideQuery').mockResolvedValue({
+      text: '<state_snapshot>summary</state_snapshot>',
+      usage: {
+        promptTokenCount: 1000,
+        candidatesTokenCount: 500,
+        totalTokenCount: 1500,
+      },
+    } as never);
+
+    const history: Content[] = [
+      { role: 'user', parts: [{ text: 'msg1' }] },
+      { role: 'model', parts: [{ text: 'msg2' }] },
+      { role: 'user', parts: [{ text: 'msg3' }] },
+      { role: 'model', parts: [{ text: 'msg4' }] },
+    ];
+    const getHistoryMock = vi.fn().mockReturnValue(history);
+    const mockChat = {
+      getHistory: getHistoryMock,
+      getHistoryShallow: getHistoryMock,
+    } as unknown as GeminiChat;
+    const mockConfig = {
+      getChatCompression: vi.fn(),
+      getBaseLlmClient: vi.fn(),
+      getContentGeneratorConfig: vi
+        .fn()
+        .mockReturnValue({ contextWindowSize: 200_000 }),
+      getHookSystem: vi.fn().mockReturnValue({
+        fireSessionStartEvent: vi.fn().mockResolvedValue(undefined),
+        firePreCompactEvent: vi.fn().mockResolvedValue(undefined),
+        firePostCompactEvent: vi.fn().mockResolvedValue(undefined),
+      }),
+      getModel: () => 'test-model',
+      getApprovalMode: () => 'default',
+      getDebugLogger: () => ({ warn: vi.fn(), debug: vi.fn() }),
+    } as unknown as Config;
+
+    const service = new ChatCompressionService();
+    await service.compress(mockChat, {
+      promptId: 'p',
+      force: true,
+      model: 'qwen-test',
+      config: mockConfig,
+      consecutiveFailures: 0,
+      originalTokenCount: 180_000,
+    });
+
+    expect(spy).toHaveBeenCalledTimes(1);
+    const callArg = spy.mock.calls[0]![1] as {
+      config?: {
+        thinkingConfig?: { includeThoughts?: boolean };
+        maxOutputTokens?: number;
+      };
+    };
+    expect(callArg.config?.thinkingConfig?.includeThoughts).toBe(false);
+    expect(callArg.config?.maxOutputTokens).toBe(20_000);
+  });
+
+  it('returns FAILED_OUTPUT_TRUNCATED when the summary output hits the COMPACT_MAX_OUTPUT_TOKENS cap (likely truncated)', async () => {
+    // Mock the side-query to return a non-empty summary that exactly hits the
+    // 20K cap — the guard should drop the result and surface it as a failure
+    // with a status distinct from EMPTY_SUMMARY so telemetry can separate
+    // prompt-quality failures (empty) from capacity failures (truncated).
+    // (R1.1 made the breaker tick; R5.2 split the status.)
+    vi.spyOn(sideQueryModule, 'runSideQuery').mockResolvedValue({
+      text: '<state_snapshot>truncated...',
+      usage: {
+        promptTokenCount: 50_000,
+        candidatesTokenCount: 20_000, // ← exactly at COMPACT_MAX_OUTPUT_TOKENS
+        totalTokenCount: 70_000,
+      },
+    } as never);
+
+    const history: Content[] = [
+      { role: 'user', parts: [{ text: 'msg1' }] },
+      { role: 'model', parts: [{ text: 'msg2' }] },
+      { role: 'user', parts: [{ text: 'msg3' }] },
+      { role: 'model', parts: [{ text: 'msg4' }] },
+    ];
+    const getHistoryMock = vi.fn().mockReturnValue(history);
+    const mockChat = {
+      getHistory: getHistoryMock,
+      getHistoryShallow: getHistoryMock,
+    } as unknown as GeminiChat;
+    const warn = vi.fn();
+    const mockConfig = {
+      getChatCompression: vi.fn(),
+      getBaseLlmClient: vi.fn(),
+      getContentGeneratorConfig: vi
+        .fn()
+        .mockReturnValue({ contextWindowSize: 200_000 }),
+      getHookSystem: vi.fn().mockReturnValue({
+        fireSessionStartEvent: vi.fn().mockResolvedValue(undefined),
+        firePreCompactEvent: vi.fn().mockResolvedValue(undefined),
+        firePostCompactEvent: vi.fn().mockResolvedValue(undefined),
+      }),
+      getModel: () => 'test-model',
+      getApprovalMode: () => 'default',
+      getDebugLogger: () => ({ warn, debug: vi.fn() }),
+    } as unknown as Config;
+
+    const result = await new ChatCompressionService().compress(mockChat, {
+      promptId: 'p',
+      force: true,
+      model: 'qwen-test',
+      config: mockConfig,
+      consecutiveFailures: 0,
+      originalTokenCount: 180_000,
+    });
+
+    expect(result.info.compressionStatus).toBe(
+      CompressionStatus.COMPRESSION_FAILED_OUTPUT_TRUNCATED,
+    );
+    expect(result.newHistory).toBeNull();
+    expect(warn).toHaveBeenCalledWith(
+      expect.stringContaining('COMPACT_MAX_OUTPUT_TOKENS'),
+    );
+  });
+});
+
+describe('ChatCompressionService.compress cheap-gate uses estimated tokens', () => {
+  afterEach(() => {
+    vi.restoreAllMocks();
+  });
+
+  // Inline helpers (Task 3): the existing file uses per-block inline
+  // mockChat/mockConfig rather than shared factories, so we follow that
+  // pattern here. getHistory(true) returns a non-empty array so the cheap-
+  // gate flow can reach the spy when the threshold is crossed.
+  function makeFakeChat(): GeminiChat {
+    const history: Content[] = [
+      { role: 'user', parts: [{ text: 'msg1' }] },
+      { role: 'model', parts: [{ text: 'msg2' }] },
+    ];
+    const getHistoryMock = vi.fn().mockReturnValue(history);
+    return {
+      getHistory: getHistoryMock,
+      getHistoryShallow: getHistoryMock,
+    } as unknown as GeminiChat;
+  }
+
+  function makeFakeConfig(opts: { contextWindowSize: number }): Config {
+    return {
+      getChatCompression: vi.fn(),
+      getBaseLlmClient: vi.fn(),
+      getContentGeneratorConfig: vi
+        .fn()
+        .mockReturnValue({ contextWindowSize: opts.contextWindowSize }),
+      getHookSystem: vi.fn().mockReturnValue({
+        fireSessionStartEvent: vi.fn().mockResolvedValue(undefined),
+        firePreCompactEvent: vi.fn().mockResolvedValue(undefined),
+        firePostCompactEvent: vi.fn().mockResolvedValue(undefined),
+      }),
+      getModel: () => 'test-model',
+      getApprovalMode: () => 'default',
+      getDebugLogger: () => ({ warn: vi.fn(), debug: vi.fn() }),
+    } as unknown as Config;
+  }
+
+  it('triggers compaction when API-reported tokens are below threshold but estimated tokens with the pending user message exceed it', async () => {
+    // 200K window, computeThresholds(200K).auto = 167K
+    // originalTokenCount = 160K (under by 7K)
+    // user message ~ 10K tokens (40K chars / 4) -> effectiveTokens = 170K, crosses 167K
+    const userMessage: Content = {
+      role: 'user',
+      parts: [{ text: 'x'.repeat(40_000) }],
+    };
+
+    const spy = vi.spyOn(sideQueryModule, 'runSideQuery').mockResolvedValue({
+      text: '<state_snapshot>x</state_snapshot>',
+      usage: {
+        promptTokenCount: 100,
+        candidatesTokenCount: 50,
+        totalTokenCount: 150,
+      },
+    } as never);
+
+    const result = await new ChatCompressionService().compress(makeFakeChat(), {
+      promptId: 'p',
+      force: false,
+      model: 'qwen-test',
+      config: makeFakeConfig({ contextWindowSize: 200_000 }),
+      consecutiveFailures: 0,
+      originalTokenCount: 160_000,
+      pendingUserMessage: userMessage,
+    });
+
+    // cheap-gate let it through (not NOOP), so spy was called
+    expect(spy).toHaveBeenCalled();
+    expect(result.info.compressionStatus).not.toBe(CompressionStatus.NOOP);
+  });
+
+  it('NOOPs when neither originalTokenCount nor estimated total reaches threshold', async () => {
+    const spy = vi
+      .spyOn(sideQueryModule, 'runSideQuery')
+      .mockResolvedValue({ text: 's', usage: {} } as never);
+
+    const result = await new ChatCompressionService().compress(makeFakeChat(), {
+      promptId: 'p',
+      force: false,
+      model: 'qwen-test',
+      config: makeFakeConfig({ contextWindowSize: 200_000 }),
+      consecutiveFailures: 0,
+      originalTokenCount: 80_000,
+      pendingUserMessage: {
+        role: 'user',
+        parts: [{ text: 'short' }],
+      },
+    });
+
+    expect(spy).not.toHaveBeenCalled();
+    expect(result.info.compressionStatus).toBe(CompressionStatus.NOOP);
+  });
+});
+
+describe('computeThresholds', () => {
+  it('32K window — proportional fallback for all tiers, hard degrades to auto', () => {
+    const t = computeThresholds(32_000);
+    expect(t.warn).toBe(19_200); // 0.6 * 32K
+    expect(t.auto).toBe(22_400); // 0.7 * 32K
+    expect(t.hard).toBe(22_400); // max(window-23K=9K, auto=22.4K) = auto
+    expect(t.effectiveWindow).toBe(12_000);
+  });
+
+  it('128K window — mixed (warn=pct, auto/hard=abs)', () => {
+    const t = computeThresholds(128_000);
+    expect(t.warn).toBe(76_800); // 0.6 * 128K (pct wins: 76.8K vs auto-20K=75K)
+    expect(t.auto).toBe(95_000); // abs: effectiveWindow-13K = 108-13 = 95K (abs wins: 95K vs 0.7*128K=89.6K)
+    expect(t.hard).toBe(105_000); // abs: effectiveWindow-3K = 108-3 = 105K
+    expect(t.effectiveWindow).toBe(108_000);
+  });
+
+  it('200K window — absolute takes over all tiers', () => {
+    const t = computeThresholds(200_000);
+    expect(t.warn).toBe(147_000); // abs: auto-20K (abs wins: 147K vs 0.6*200K=120K)
+    expect(t.auto).toBe(167_000); // abs: effectiveWindow-13K = 180-13 = 167K
+    expect(t.hard).toBe(177_000); // abs: effectiveWindow-3K = 180-3 = 177K
+  });
+
+  it('1M window — fully absolute', () => {
+    const t = computeThresholds(1_000_000);
+    expect(t.warn).toBe(947_000);
+    expect(t.auto).toBe(967_000);
+    expect(t.hard).toBe(977_000);
+  });
+
+  it('extreme small window (10K) does not crash; returns sane values', () => {
+    const t = computeThresholds(10_000);
+    expect(t.warn).toBeGreaterThan(0);
+    expect(t.auto).toBeGreaterThan(0);
+    expect(t.warn).toBeLessThanOrEqual(t.auto);
+    expect(t.auto).toBeLessThanOrEqual(t.hard);
+    // window < SUMMARY_RESERVE: effectiveWindow is clamped to 0, not negative.
+    // auto/warn/hard remain positive because each is `Math.max(proportional, absolute)`
+    // and the proportional branch dominates whenever the absolute branch goes ≤ 0.
+    expect(t.effectiveWindow).toBe(0);
+  });
+
+  it('zero window returns effectiveWindow=0 and non-negative tiers', () => {
+    const t = computeThresholds(0);
+    expect(t.effectiveWindow).toBe(0);
+    expect(t.warn).toBe(0);
+    expect(t.auto).toBe(0);
+    expect(t.hard).toBe(0);
+  });
+
+  it('thresholds always satisfy warn <= auto <= hard', () => {
+    for (const w of [32_000, 64_000, 128_000, 200_000, 256_000, 1_000_000]) {
+      const t = computeThresholds(w);
+      expect(t.warn).toBeLessThanOrEqual(t.auto);
+      expect(t.auto).toBeLessThanOrEqual(t.hard);
+    }
+  });
+});
+
+describe('ChatCompressionService.compress cheap-gate uses computeThresholds.auto', () => {
+  afterEach(() => {
+    vi.restoreAllMocks();
+  });
+
+  function makeFakeChat(): GeminiChat {
+    const history: Content[] = [
+      { role: 'user', parts: [{ text: 'msg1' }] },
+      { role: 'model', parts: [{ text: 'msg2' }] },
+    ];
+    const getHistoryMock = vi.fn().mockReturnValue(history);
+    return {
+      getHistory: getHistoryMock,
+      getHistoryShallow: getHistoryMock,
+    } as unknown as GeminiChat;
+  }
+
+  function makeFakeConfig(opts: { contextWindowSize: number }): Config {
+    return {
+      getChatCompression: vi.fn(),
+      getBaseLlmClient: vi.fn(),
+      getContentGeneratorConfig: vi
+        .fn()
+        .mockReturnValue({ contextWindowSize: opts.contextWindowSize }),
+      getHookSystem: vi.fn().mockReturnValue({
+        fireSessionStartEvent: vi.fn().mockResolvedValue(undefined),
+        firePreCompactEvent: vi.fn().mockResolvedValue(undefined),
+        firePostCompactEvent: vi.fn().mockResolvedValue(undefined),
+      }),
+      getModel: () => 'test-model',
+      getApprovalMode: () => 'default',
+      getDebugLogger: () => ({ warn: vi.fn(), debug: vi.fn() }),
+    } as unknown as Config;
+  }
+
+  it('on a 200K window with originalTokenCount=160K, NOOPs (below auto=167K)', async () => {
+    const spy = vi
+      .spyOn(sideQueryModule, 'runSideQuery')
+      .mockResolvedValue({ text: 's', usage: {} } as never);
+
+    const result = await new ChatCompressionService().compress(makeFakeChat(), {
+      promptId: 'p',
+      force: false,
+      model: 'qwen-test',
+      config: makeFakeConfig({ contextWindowSize: 200_000 }),
+      consecutiveFailures: 0,
+      originalTokenCount: 160_000,
+    });
+
+    expect(spy).not.toHaveBeenCalled();
+    expect(result.info.compressionStatus).toBe(CompressionStatus.NOOP);
+  });
+
+  it('on a 200K window with originalTokenCount=168K, falls through cheap-gate (above auto=167K)', async () => {
+    const spy = vi.spyOn(sideQueryModule, 'runSideQuery').mockResolvedValue({
+      text: '<state_snapshot>summary</state_snapshot>',
+      usage: {
+        promptTokenCount: 1000,
+        candidatesTokenCount: 500,
+        totalTokenCount: 1500,
+      },
+    } as never);
+
+    const result = await new ChatCompressionService().compress(makeFakeChat(), {
+      promptId: 'p',
+      force: false,
+      model: 'qwen-test',
+      config: makeFakeConfig({ contextWindowSize: 200_000 }),
+      consecutiveFailures: 0,
+      originalTokenCount: 168_000,
+    });
+
+    // 168K > 167K (computeThresholds(200K).auto), cheap-gate lets through
+    expect(spy).toHaveBeenCalled();
+    expect(result.info.compressionStatus).not.toBe(CompressionStatus.NOOP);
+  });
+});
diff --git a/packages/core/src/services/chatCompressionService.ts b/packages/core/src/services/chatCompressionService.ts
index 97934d819e..a6e2434e1c 100644
--- a/packages/core/src/services/chatCompressionService.ts
+++ b/packages/core/src/services/chatCompressionService.ts
@@ -20,12 +20,7 @@ import {
   resolveSlimmingConfig,
   slimCompactionInput,
 } from './compactionInputSlimming.js';
-
-/**
- * Threshold for compression token count as a fraction of the model's token limit.
- * If the chat history exceeds this threshold, it will be compressed.
- */
-export const COMPRESSION_TOKEN_THRESHOLD = 0.7;
+import { estimatePromptTokens } from './tokenEstimation.js';
 
 /**
  * The fraction of the latest chat history to keep. A value of 0.3
@@ -50,6 +45,108 @@ export const MIN_COMPRESSION_FRACTION = 0.05;
  */
 export const TOOL_ROUND_RETAIN_COUNT = 2;
 
+/**
+ * Hard cap on the compression sideQuery output (summary text only, since
+ * thinking is disabled). Mirrors claude-code's MAX_OUTPUT_TOKENS_FOR_SUMMARY
+ * (autoCompact.ts:30) which is based on p99.99 of real compaction outputs.
+ */
+export const COMPACT_MAX_OUTPUT_TOKENS = 20_000;
+
+/**
+ * Default proportional auto-compaction threshold. Used as a small-window
+ * fallback / safety net inside computeThresholds — when the window is so
+ * small that the absolute branch becomes degenerate, the proportional
+ * branch keeps the trigger usable.
+ */
+export const DEFAULT_PCT = 0.7;
+
+/**
+ * Offset from DEFAULT_PCT used to position the warn tier proportionally
+ * (warn-pct = 0.7 - 0.1 = 0.6). Three-tier ladder makes warn fire
+ * meaningfully before auto on small windows where the absolute formula
+ * would otherwise compress warn flush against auto.
+ */
+export const WARN_PCT_OFFSET = 0.1;
+
+/**
+ * Token budget reserved from the window for compression output. Matches
+ * COMPACT_MAX_OUTPUT_TOKENS because thinking is disabled (see Task 1) and
+ * maxOutputTokens is therefore the hard ceiling on total summary output.
+ */
+export const SUMMARY_RESERVE = COMPACT_MAX_OUTPUT_TOKENS; // 20_000
+
+/**
+ * Distance between auto threshold and effectiveWindow. Matches claude-code's
+ * AUTOCOMPACT_BUFFER_TOKENS (autoCompact.ts:62) — empirically chosen to leave
+ * headroom for the compaction sideQuery round-trip plus a few user-message
+ * turns before the window saturates.
+ */
+export const AUTOCOMPACT_BUFFER = 13_000;
+
+/**
+ * Distance between warn threshold and auto threshold. Matches claude-code's
+ * WARNING_THRESHOLD_BUFFER_TOKENS (autoCompact.ts:63) — sized so the warn
+ * tier fires a couple of turns before auto-compaction in practice.
+ */
+export const WARN_BUFFER = 20_000;
+
+/** Distance between hard threshold and effectiveWindow (matches claude-code's MANUAL_COMPACT_BUFFER). */
+export const HARD_BUFFER = 3_000;
+
+/**
+ * Auto-compaction consecutive-failure circuit breaker. After this many
+ * consecutive failures the cheap-gate NOOPs until a successful force
+ * compress resets the counter. Co-located here with other compaction-
+ * tuning constants; the counter state itself lives on GeminiChat.
+ */
+export const MAX_CONSECUTIVE_FAILURES = 3;
+
+export interface CompactionThresholds {
+  /** Token count at which UI warn tier triggers. */
+  readonly warn: number;
+  /** Token count at which auto-compaction triggers. */
+  readonly auto: number;
+  /** Token count at which auto-compaction is force-triggered (bypasses the consecutive-failure breaker). */
+  readonly hard: number;
+  /** Window minus SUMMARY_RESERVE; the budget available for input + summary. */
+  readonly effectiveWindow: number;
+}
+
+/**
+ * Compute the three-tier threshold ladder for a given context window.
+ *
+ * Each tier is `max(proportional, absolute)`:
+ *   auto = max(DEFAULT_PCT * window,                       effectiveWindow - AUTOCOMPACT_BUFFER)
+ *   warn = max((DEFAULT_PCT - WARN_PCT_OFFSET) * window,   auto - WARN_BUFFER)
+ *   hard = max(effectiveWindow - HARD_BUFFER,              auto)   // hard degrades to auto for tiny windows
+ *
+ * Small windows (where the absolute branch goes negative) automatically
+ * fall back to the proportional branch. Large windows are dominated by
+ * the absolute branch, capping wasted reservation to ~33K instead of 30%
+ * of the window.
+ *
+ * Pure function — no I/O, no shared state — safe to call repeatedly.
+ */
+export function computeThresholds(window: number): CompactionThresholds {
+  // Clamp to 0 for tiny windows (window < SUMMARY_RESERVE) so the surfaced
+  // value in `/context` stays meaningful. The Math.max guards on auto/warn/hard
+  // below absorb the floor — clamping does not shift those outputs because
+  // each is `max(proportional, absolute)` and the proportional branch
+  // dominates whenever the absolute branch goes negative.
+  const effectiveWindow = Math.max(0, window - SUMMARY_RESERVE);
+
+  const absAuto = effectiveWindow - AUTOCOMPACT_BUFFER;
+  const auto = Math.max(DEFAULT_PCT * window, absAuto);
+
+  const absWarn = auto - WARN_BUFFER;
+  const warn = Math.max((DEFAULT_PCT - WARN_PCT_OFFSET) * window, absWarn);
+
+  const rawHard = effectiveWindow - HARD_BUFFER;
+  const hard = Math.max(rawHard, auto);
+
+  return { warn, auto, hard, effectiveWindow };
+}
+
 export type CompactTrigger = 'manual' | 'auto';
 
 const hasFunctionCall = (content: Content | undefined): boolean =>
@@ -170,13 +267,16 @@ export interface CompressOptions {
   model: string;
   config: Config;
   /**
-   * Whether a previous unforced compression attempt failed for this chat.
-   * Suppresses auto-compaction; manual `/compress` (force=true) overrides.
+   * Number of consecutive auto-compaction failures for this chat. When it reaches
+   * MAX_CONSECUTIVE_FAILURES, the cheap-gate stops trying until a successful
+   * force=true call resets it.
    */
-  hasFailedCompressionAttempt: boolean;
+  consecutiveFailures: number;
   /**
    * Most recent prompt token count for this chat. Compared against
-   * `threshold * contextWindowSize` for the auto-compaction gate. Callers
+   * `computeThresholds(contextWindowSize).auto` for the auto-compaction
+   * gate, optionally augmented by the pending user message's estimated
+   * token count via `estimatePromptTokens` (see Task 3 / Task 6). Callers
    * source this from the per-chat counter (main session, subagents alike) —
    * the service does not read or write any global telemetry.
    */
@@ -188,6 +288,23 @@ export interface CompressOptions {
    */
   trigger?: CompactTrigger;
   signal?: AbortSignal;
+  /**
+   * Pending user message about to be sent. When present, the cheap-gate
+   * adds its estimated token count to `originalTokenCount` (which reflects
+   * only the prior turn's API usage) so the gate sees the real prompt size.
+   * Optional for backward compatibility with callers that don't have a
+   * user message in hand (e.g. manual /compress force=true paths).
+   */
+  pendingUserMessage?: Content;
+  /**
+   * Pre-computed effective-token count from `estimatePromptTokens()`. When
+   * provided, the cheap-gate skips its own estimation pass (and the
+   * accompanying `chat.getHistoryShallow(true)` clone). Callers that already
+   * computed this value upstream — primarily `sendMessageStream` for the
+   * hard-tier rescue — pass it through to avoid duplicate work.
+   * (review #4168 R1.3 / R1.4)
+   */
+  precomputedEffectiveTokens?: number;
 }
 
 export class ChatCompressionService {
@@ -200,24 +317,25 @@ export class ChatCompressionService {
       force,
       model,
       config,
-      hasFailedCompressionAttempt,
+      consecutiveFailures,
       originalTokenCount,
       trigger,
       signal,
     } = opts;
     const compactTrigger = trigger ?? (force ? 'manual' : 'auto');
     const chatCompressionSettings = config.getChatCompression();
-    const threshold =
-      chatCompressionSettings?.contextPercentageThreshold ??
-      COMPRESSION_TOKEN_THRESHOLD;
     const slimmingConfig = resolveSlimmingConfig(chatCompressionSettings);
 
-    if (threshold <= 0 || (hasFailedCompressionAttempt && !force)) {
+    // Cheap gates first — these don't need the curated history. Forward
+    // originalTokenCount on NOOP (matching the threshold-gate branch below)
+    // so telemetry consumers can distinguish "breaker tripped at N tokens"
+    // from "session has zero tokens".
+    if (consecutiveFailures >= MAX_CONSECUTIVE_FAILURES && !force) {
       return {
         newHistory: null,
         info: {
-          originalTokenCount: 0,
-          newTokenCount: 0,
+          originalTokenCount,
+          newTokenCount: originalTokenCount,
           compressionStatus: CompressionStatus.NOOP,
         },
       };
@@ -227,7 +345,26 @@ export class ChatCompressionService {
       const contextLimit =
         config.getContentGeneratorConfig()?.contextWindowSize ??
         DEFAULT_TOKEN_LIMIT;
-      if (originalTokenCount < threshold * contextLimit) {
+      const { auto } = computeThresholds(contextLimit);
+      // Order of preference for the effective-token estimate:
+      //   1. Caller already computed it (sendMessageStream hard-tier rescue)
+      //   2. Compute it here from history + pending user message
+      //   3. Fall back to the raw API-reported count
+      // Path 1 avoids a second `getHistoryShallow(true)` clone per send when
+      // sendMessageStream already paid for one. (R1.3 / R1.4)
+      const pendingUserMessage = opts.pendingUserMessage;
+      const effectiveTokens =
+        opts.precomputedEffectiveTokens !== undefined
+          ? opts.precomputedEffectiveTokens
+          : pendingUserMessage
+            ? estimatePromptTokens(
+                chat.getHistoryShallow(true),
+                pendingUserMessage,
+                originalTokenCount,
+                slimmingConfig.imageTokenEstimate,
+              )
+            : originalTokenCount;
+      if (effectiveTokens < auto) {
         return {
           newHistory: null,
           info: {
@@ -249,8 +386,8 @@ export class ChatCompressionService {
       return {
         newHistory: null,
         info: {
-          originalTokenCount: 0,
-          newTokenCount: 0,
+          originalTokenCount,
+          newTokenCount: originalTokenCount,
           compressionStatus: CompressionStatus.NOOP,
         },
       };
@@ -270,18 +407,26 @@ export class ChatCompressionService {
       }
     }
 
-    // For manual /compress (force=true), if the last message is an orphaned model
-    // funcCall (agent interrupted/crashed before the response arrived), strip it
-    // before computing the split point. After stripping, the history ends cleanly
-    // (typically with a user funcResponse) and findCompressSplitPoint handles it
-    // through its normal logic — no special-casing needed.
+    // Only manual `/compress` (trigger='manual') performs the orphan-strip:
+    // if the chat was interrupted with a trailing model funcCall whose
+    // funcResponse never arrived, the user-initiated /compress between
+    // turns can safely drop it before computing the split point.
     //
-    // auto-compress (force=false) must NOT strip: it fires inside
-    // sendMessageStream() before the matching funcResponse is pushed onto the
+    // Both automatic paths (trigger='auto') — cheap-gate (force=false) AND
+    // hard-rescue (force=true) — must NOT strip. They fire inside
+    // sendMessageStream() BEFORE the pending funcResponse is pushed onto
     // history, so the trailing funcCall is still active, not orphaned.
+    //
+    // Gating on `trigger === 'manual'` instead of `force` disambiguates
+    // "user wants this compressed now, history can be mutated" from
+    // "automatic compression mid-turn, history snapshot is live state and
+    // must be preserved verbatim". Earlier the predicate used `force`,
+    // which is correct for manual /compress (force=true, trigger='manual')
+    // but conflated hard-rescue (force=true, trigger='auto') and silently
+    // stripped active funcCalls there.
     const lastMessage = curatedHistory[curatedHistory.length - 1];
     const hasOrphanedFuncCall =
-      force &&
+      compactTrigger === 'manual' &&
       lastMessage?.role === 'model' &&
       lastMessage.parts?.some((p) => !!p.functionCall);
     const historyForSplit = hasOrphanedFuncCall
@@ -367,9 +512,13 @@ export class ChatCompressionService {
           ],
         },
       ],
-      // Compression quality drives every subsequent main turn — keep reasoning on.
+      // Compression output is bounded by maxOutputTokens to guarantee a predictable
+      // reserve across providers (see docs/design/auto-compaction-threshold-redesign.md).
+      // Thinking is disabled because per-provider thinking-budget semantics are
+      // inconsistent (Anthropic/OpenAI count it separately, Gemini varies by model).
       config: {
-        thinkingConfig: { includeThoughts: true },
+        thinkingConfig: { includeThoughts: false },
+        maxOutputTokens: COMPACT_MAX_OUTPUT_TOKENS,
       },
       abortSignal: signal ?? new AbortController().signal,
       promptId,
@@ -392,6 +541,48 @@ export class ChatCompressionService {
       );
     }
 
+    // Defensive guard: if the side-query hit COMPACT_MAX_OUTPUT_TOKENS, the
+    // summary is likely truncated mid-content and unsafe to persist. Drop it
+    // and surface as a failure so the consecutive-failure breaker counts it —
+    // if the model consistently produces max-length summaries we want to stop
+    // trying after MAX_CONSECUTIVE_FAILURES strikes rather than burn an API
+    // call on every send. Reactive overflow still catches the catastrophic
+    // case. See docs/design/auto-compaction-threshold-redesign.md risk #2.
+    //
+    // TODO(finish_reason): the current `>= cap` check is a heuristic that
+    // false-positives on legitimate summaries that happen to land exactly at
+    // the cap. The proper signal is `finish_reason === 'length'` (OpenAI) /
+    // `MAX_TOKENS` (Gemini), but `runSideQuery` doesn't surface it today.
+    // Plumb it through and tighten this guard when that's available.
+    if (
+      !isSummaryEmpty &&
+      typeof compressionOutputTokenCount === 'number' &&
+      compressionOutputTokenCount >= COMPACT_MAX_OUTPUT_TOKENS
+    ) {
+      config
+        .getDebugLogger()
+        .warn(
+          `[chat-compression] summary output reached the ` +
+            `COMPACT_MAX_OUTPUT_TOKENS cap (${COMPACT_MAX_OUTPUT_TOKENS}); ` +
+            `dropping potentially-truncated result. This counts as a ` +
+            `compression failure for the per-chat circuit breaker.`,
+        );
+      return {
+        newHistory: null,
+        info: {
+          originalTokenCount,
+          newTokenCount: originalTokenCount,
+          // Distinct from EMPTY_SUMMARY so telemetry / logs can tell a
+          // prompt-quality failure (empty summary → tune prompt / splitter)
+          // apart from a capacity failure (output cap hit → raise cap or
+          // shrink splitter input). isCompressionFailureStatus() treats both
+          // as failures so the persistence behaviour is unchanged. (R5.2)
+          compressionStatus:
+            CompressionStatus.COMPRESSION_FAILED_OUTPUT_TRUNCATED,
+        },
+      };
+    }
+
     let newTokenCount = originalTokenCount;
     let extraHistory: Content[] = [];
     let canCalculateNewTokenCount = false;
@@ -429,7 +620,8 @@ export class ChatCompressionService {
       //
       // Note: compressionInputTokenCount includes the compression prompt and
       // the extra "reason in your scratchpad" instruction(approx. 1000 tokens), and
-      // compressionOutputTokenCount may include non-persisted tokens (thoughts).
+      // compressionOutputTokenCount reflects the summary tokens only since
+      // thinking is disabled.
       // We accept these inaccuracies to avoid local token estimation.
       if (
         typeof compressionInputTokenCount === 'number' &&
diff --git a/packages/core/src/services/compactionInputSlimming.ts b/packages/core/src/services/compactionInputSlimming.ts
index 7f0fb9f8dd..effe5f83c2 100644
--- a/packages/core/src/services/compactionInputSlimming.ts
+++ b/packages/core/src/services/compactionInputSlimming.ts
@@ -20,7 +20,13 @@ import type { ChatCompressionSettings } from '../config/config.js';
 
 export const DEFAULT_IMAGE_TOKEN_ESTIMATE = 1600;
 
-const TOKEN_TO_CHAR_RATIO = 4;
+/**
+ * Generic char/token conversion factor (claude-code's canonical heuristic).
+ * Exported so adjacent estimators (`tokenEstimation.ts`'s `CHARS_PER_TOKEN`)
+ * stay programmatically linked — if this ever moves, both sites move
+ * together rather than drifting silently.
+ */
+export const TOKEN_TO_CHAR_RATIO = 4;
 const DEFAULT_MIME = 'application/octet-stream';
 
 /**
diff --git a/packages/core/src/services/tokenEstimation.test.ts b/packages/core/src/services/tokenEstimation.test.ts
new file mode 100644
index 0000000000..b853ffc3d1
--- /dev/null
+++ b/packages/core/src/services/tokenEstimation.test.ts
@@ -0,0 +1,91 @@
+/**
+ * @license
+ * Copyright 2025 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, it, expect } from 'vitest';
+import type { Content } from '@google/genai';
+import {
+  estimateContentTokens,
+  estimatePromptTokens,
+} from './tokenEstimation.js';
+
+const textContent = (text: string): Content => ({
+  role: 'user',
+  parts: [{ text }],
+});
+
+describe('estimateContentTokens', () => {
+  it('returns 0 for empty array', () => {
+    expect(estimateContentTokens([])).toBe(0);
+  });
+
+  it('estimates plain text at ~chars/4', () => {
+    // "hello world" = 11 chars → ceil(11/4) = 3
+    expect(estimateContentTokens([textContent('hello world')])).toBe(3);
+  });
+
+  it('sums tokens across multiple messages', () => {
+    const a = textContent('aaaa'); // 4/4 = 1
+    const b = textContent('bbbbbbbb'); // 8/4 = 2
+    expect(estimateContentTokens([a, b])).toBe(3);
+  });
+
+  it('estimates inlineData via imageTokenEstimate', () => {
+    const c: Content = {
+      role: 'user',
+      parts: [{ inlineData: { mimeType: 'image/png', data: 'xxx' } }],
+    };
+    // estimateContentChars uses imageTokenEstimate * TOKEN_TO_CHAR_RATIO (4)
+    // for inlineData, so estimateContentTokens divides back by 4 → 1600
+    expect(estimateContentTokens([c], 1600)).toBe(1600);
+  });
+
+  it('estimates functionCall (json-dense) contributes some positive count', () => {
+    const c: Content = {
+      role: 'model',
+      parts: [{ functionCall: { name: 'foo', args: { a: 1, b: 2 } } }],
+    };
+    const result = estimateContentTokens([c]);
+    expect(result).toBeGreaterThan(0);
+  });
+
+  it('estimates functionResponse (nested parts) contributes some positive count', () => {
+    // functionResponse takes a distinct branch in estimateContentChars
+    // (nested parts walk + json-stringify fallback). Tool-heavy
+    // conversations are where context grows fastest, so locking coverage
+    // here protects the trigger from undercounting. (review #4168 R3.5)
+    const c: Content = {
+      role: 'user',
+      parts: [
+        {
+          functionResponse: {
+            name: 'tool',
+            response: { result: 'data'.repeat(100) },
+          },
+        },
+      ],
+    };
+    const result = estimateContentTokens([c]);
+    expect(result).toBeGreaterThan(0);
+  });
+});
+
+describe('estimatePromptTokens', () => {
+  const history: Content[] = [
+    textContent('older message a'),
+    textContent('older message b'),
+  ];
+  const user = textContent('current user message');
+
+  it('uses lastPromptTokenCount + user-message estimate when count > 0', () => {
+    const userEst = estimateContentTokens([user]);
+    expect(estimatePromptTokens(history, user, 5000)).toBe(5000 + userEst);
+  });
+
+  it('falls back to full estimate when lastPromptTokenCount is 0', () => {
+    const fullEst = estimateContentTokens([...history, user]);
+    expect(estimatePromptTokens(history, user, 0)).toBe(fullEst);
+  });
+});
diff --git a/packages/core/src/services/tokenEstimation.ts b/packages/core/src/services/tokenEstimation.ts
new file mode 100644
index 0000000000..cd86642192
--- /dev/null
+++ b/packages/core/src/services/tokenEstimation.ts
@@ -0,0 +1,78 @@
+/**
+ * @license
+ * Copyright 2025 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import type { Content } from '@google/genai';
+import {
+  DEFAULT_IMAGE_TOKEN_ESTIMATE,
+  TOKEN_TO_CHAR_RATIO,
+  estimateContentChars,
+} from './compactionInputSlimming.js';
+
+/**
+ * Average characters-per-token for char-based token estimation. The inputs
+ * are character counts from `estimateContentChars` (i.e. `string.length`),
+ * not byte counts — for CJK / multi-byte text the byte/char ratio differs
+ * from 1, so a "bytes" name would mislead. Programmatically aliased to
+ * compactionInputSlimming.ts's TOKEN_TO_CHAR_RATIO so the auto-compaction
+ * trigger and the compression splitter can never drift on this constant.
+ * Matches claude-code's roughTokenCountEstimation default. (review #4168 R3.1)
+ */
+export const CHARS_PER_TOKEN = TOKEN_TO_CHAR_RATIO;
+
+/**
+ * Estimate the token count of a list of Content objects via char/4.
+ *
+ * Reuses `estimateContentChars` so that inlineData / functionCall /
+ * functionResponse get the same treatment they receive when computing
+ * compression split points — keeping the two estimators in sync prevents
+ * the auto-compaction trigger and the splitter from disagreeing on size.
+ *
+ * Intended for the pre-send threshold gate only. char/4 is a conservative
+ * lower bound (real tokenizers vary ±30%); using it to TRIGGER compaction
+ * earlier is safe (false-positive), using it to SKIP compaction is not.
+ */
+export function estimateContentTokens(
+  contents: Content[],
+  imageTokenEstimate: number = DEFAULT_IMAGE_TOKEN_ESTIMATE,
+): number {
+  let totalChars = 0;
+  for (const content of contents) {
+    totalChars += estimateContentChars(content, imageTokenEstimate);
+  }
+  return Math.ceil(totalChars / CHARS_PER_TOKEN);
+}
+
+/**
+ * Compute an effective prompt-token count for the auto-compaction gate.
+ *
+ * `lastPromptTokenCount` (from the previous turn's usage metadata) lacks
+ * two things: the current user message, and any initial value on the
+ * very first send. This helper closes both gaps via local estimation.
+ *
+ * WARNING: like estimateContentTokens, this is a conservative lower
+ * bound. Use it to TRIGGER earlier, never to SKIP — the fallback path
+ * (lastPromptTokenCount === 0) returns a pure estimate with no API-
+ * authoritative anchor.
+ */
+export function estimatePromptTokens(
+  history: Content[],
+  userMessage: Content,
+  lastPromptTokenCount: number,
+  imageTokenEstimate: number = DEFAULT_IMAGE_TOKEN_ESTIMATE,
+): number {
+  if (lastPromptTokenCount > 0) {
+    return (
+      lastPromptTokenCount +
+      estimateContentTokens([userMessage], imageTokenEstimate)
+    );
+  }
+  // First-send fallback (no API data yet): estimate from `history + userMessage`
+  // only. This MISSES the system prompt (~8-15K), tool definitions (~5K),
+  // skill content, and cache headers — typically ~15-20K of under-estimate.
+  // The reactive overflow handler is the safety net if the hard-tier rescue
+  // misses for that reason. See review #4168 R3.3.
+  return estimateContentTokens([...history, userMessage], imageTokenEstimate);
+}