fix(cli): stabilize flaky sticky-todo remeasure test#4416
Conversation
Replace absolute mock.calls.length assertion with mockClear() + not.toHaveBeenCalled() in the sticky todo status-only update test. The previous assertion captured the total measureElement call count after initial render, rerendered, and checked the count was unchanged. This was flaky on CI (macOS runner) because React 19's Ink test renderer can invoke useLayoutEffect a variable number of times during mount (StrictMode double-invoke, multiple reconciliation passes), making the absolute count unreliable across environments. The new approach resets the mock after initial render and asserts no new calls occur during rerender — clearly expressing the test intent and eliminating environment-dependent flakiness.
📋 Review SummaryThis PR fixes a flaky CI test in 🔍 General Feedback
🎯 Specific Feedback🔵 Low
✅ Highlights
|
pomelo-nwu
left a comment
There was a problem hiding this comment.
Thanks for the review. On the Low suggestion about expanding the mockConfig.initialize comment — the existing 4-line comment already covers the full reasoning (what throws, why React double-invokes, and how it leaks into later tests). Adding more would be redundant. Moving on.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
Replace fragile measureElement call-count assertion with a behavioral assertion on availableTerminalHeight stability, wrapped in act() to flush useLayoutEffect timing. The original test asserted that measureElement was not called after rerender when only todo status changed (pending -> in_progress). This was flaky because: 1. The absolute mock.calls.length count was environment-dependent (React 19 StrictMode double-invoke, variable reconciliation passes) 2. Even with mockClear(), the useLayoutEffect fires for legitimate reasons (buffer ref, btwItem) unrelated to sticky todo status, especially on Windows CI runners 3. The controlsHeight state (useState(0)) races with useLayoutEffect's first measurement — mainControlsRef.current may be null on initial render, causing controlsHeight to settle at different times The fix: - Assert on availableTerminalHeight (the behavioral outcome exposed via UIState context) rather than measureElement call count - Wrap render + rerender in act() to ensure useLayoutEffect and setControlsHeight fully settle before capturing the baseline - Consolidate duplicate react imports
|
Pushed a follow-up commit (f6e7762) that further stabilizes the test:
The Verified: 10/10 consecutive runs pass locally, full suite (77 tests) passes. |
| // The sticky todo status change (pending → in_progress) must not alter | ||
| // the computed terminal height. This asserts on the behavioral outcome | ||
| // rather than the implementation detail of measureElement call count. | ||
| expect(capturedUIState.availableTerminalHeight).toBe(heightAfterSettle); |
There was a problem hiding this comment.
[Suggestion] The behavioral assertion can pass tautologically under this test's mock setup, so it may not catch the regression the test name describes.
With mockedMeasureElement.mockReturnValue({ width: 80, height: 4 }) returning a constant and setControlsHeight in AppContainer.tsx:2312-2316 short-circuiting on prev === measurement.height, controlsHeight is pinned at 4 regardless of how many times the layout effect fires. Combined with constant terminalHeight=24, staticExtraHeight=3, tabBarHeight=0, the computation availableTerminalHeight = max(0, 24 - 4 - 3 - 2 - 0) = 15 is mechanically forced to the same value across every rerender. A regression that re-includes status into stickyTodosLayoutKey would re-fire the useLayoutEffect (the original target of this test), but the assertion would still pass.
The trade-off documented in the follow-up commit (call-count was too sensitive to buffer/btwItem re-renders) is fair, but it can be reconciled by making the mock return a different height on subsequent invocations so that a re-measurement actually changes the computed outcome:
| expect(capturedUIState.availableTerminalHeight).toBe(heightAfterSettle); | |
| // After the settle rerender, change the mock so any further measurement | |
| // would visibly shift availableTerminalHeight. Sticky-todo status-only | |
| // updates must NOT trigger a remeasurement. | |
| mockedMeasureElement.mockReturnValue({ width: 80, height: 7 }); | |
| historyManager.history = makeTodoHistory('in_progress'); | |
| await act(async () => { | |
| view!.rerender( | |
| <AppContainer | |
| config={mockConfig} | |
| settings={mockSettings} | |
| version="1.0.0" | |
| initializationResult={mockInitResult} | |
| />, | |
| ); | |
| }); | |
| // The sticky todo status change (pending → in_progress) must not alter | |
| // the computed terminal height. With the mock height now 7, any spurious | |
| // remeasurement would shift availableTerminalHeight from 15 to 12. | |
| expect(capturedUIState.availableTerminalHeight).toBe(heightAfterSettle); |
That preserves the behavioral framing the follow-up commit moved toward, while restoring real detection power for the regression class this test is named after.
— claude-opus-4-7 via Claude Code /qreview
| />, | ||
| ); | ||
| await act(async () => { | ||
| view!.rerender( |
There was a problem hiding this comment.
[Suggestion] let view: ReturnType<typeof render>; plus view!.rerender(...) swallows the diagnostic when the first act(async () => { view = render(...) }) throws inside render itself. If render fails (e.g. a future refactor breaks a mock shape), view stays undefined and this line surfaces as TypeError: Cannot read properties of undefined (reading 'rerender'), hiding the real failure several frames up.
The rest of this file (e.g. lines 513, 950, 2360-ish in earlier tests) assigns render synchronously, which makes the assignment site itself the failure point. Consider hoisting the initial render out of act:
const view = render(
<AppContainer
config={mockConfig}
settings={mockSettings}
version="1.0.0"
initializationResult={mockInitResult}
/>,
);
// Flush the async mount effect (config.initialize → setConfigInitialized) so
// the post-settle availableTerminalHeight is a stable baseline.
await act(async () => {});
await act(async () => { view.rerender(...); });That keeps the act wrapping where it's load-bearing (flushing the async setConfigInitialized chain that the rerender drains) while making any render-time failure show up at the actual render(...) site with view typed as non-undefined.
— claude-opus-4-7 via Claude Code /qreview
LaZzyMan
left a comment
There was a problem hiding this comment.
Review
The act wrapping and the explicit settle rerender are reasonable hygiene for React 19 + Ink async effects, and identifying the cross-test Config was already initialized leak is a real diagnosis. The implementation as written has three problems worth addressing before merge.
1. The new assertion silently accepts the regression the original test was written to catch (severity: high · confidence: very high)
The mocked measureElement is hard-coded to return { width: 80, height: 4 }, and the production code that consumes it has a same-value short-circuit when setting controlsHeight. So if the underlying optimization regresses and measureElement is called again on a status-only rerender — exactly the regression the original test was added to catch — the height value comes back identical, controlsHeight doesn't change, and availableTerminalHeight stays the same. The new equality assertion passes either way. The previous test directly validated "no re-measure on status-only update"; this one validates "the optimization is invisible," but the fixture is rigged so that breaking the optimization is also invisible. Either restore a relative call-count assertion (mockClear() after the settle pass, then not.toHaveBeenCalled() after the status rerender — which is what the PR body promises), or make the mocked measureElement return distinguishable heights per call so the behavioral assertion can actually detect re-measurement.
2. The beforeEach initialize stub flips a config gate for ~75 unrelated tests (severity: high · confidence: high)
Adding vi.spyOn(mockConfig, 'initialize').mockResolvedValue(undefined) to beforeEach changes branch coverage for the entire describe('AppContainer State Management') block, not just the flaky test. Previously initialize() rejected on React's double-mount so setConfigInitialized(true) was never reached, and isConfigInitialized stayed false across all ~75 tests in the block. Now it flips to true, activating every isConfigInitialized-gated effect in AppContainer.tsx. Some of those tests likely passed previously because the gated branches were inert; they may now pass for the wrong reason, or start failing. The right scope is per-test: narrow the stub back to the one test that needs it (there's already a redundant inline stub elsewhere in the file that should also be removed), or document a deliberate decision to flip the gate for the whole block after verifying no other test silently relied on it being off.
3. PR description describes a different fix than the diff implements (severity: medium · confidence: very high)
The PR body says the change is mockClear() after initial render + not.toHaveBeenCalled() after rerender. The diff does neither — it uses await act(async) wrapping, a settle rerender pass, an availableTerminalHeight equality assertion, and a global initialize stub. Future bisectors and reviewers will be misled by the mismatch; please update the description to match what was actually shipped (especially the rationale for the global initialize stub).
Verdict
REQUEST_CHANGES — the new assertion silently accepts the regression it's meant to catch, the global initialize stub broadens scope well beyond one flaky test without justification, and the description doesn't match the diff.
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — qwen-latest-series-invite-beta-v36 via Qwen Code /review
- Narrow the `mockConfig.initialize` stub from `beforeEach` (which flipped `isConfigInitialized` for ~75 tests in the block) back to the single test that needs it. Other tests now exercise the real init gate as before. - Strengthen the behavioral assertion: switch `measureElement`'s mocked return value between the settle phase and the status-only rerender, so any re-measurement triggered by the status change would change `controlsHeight` and break the equality assertion. Without this, the production same-value short-circuit on `setControlsHeight` made the assertion pass even when the optimization regressed. The core layout-key contract (status-only changes return the same key) is already directly covered by `todoSnapshot.test.ts` — this integration test provides layered protection on top. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: CI still running. LGTM! ✅ — qwen3.7-max via Qwen Code /review
Thanks @LaZzyMan for the careful review. Addressing each point in commit 25a5a91:
(high) Assertion silently accepts the regression — Added the measureElement mock-height swap between settle and rerender, so the equality assertion now actually distinguishes "optimization works" from "optimization regresses" (in environments where useLayoutEffect would re-fire). The most direct regression protection still lives in todoSnapshot.test.ts — this test is layered protection.
(high) Global initialize stub flipped a config gate for ~75 tests — Narrowed back to per-test scope. Removed from beforeEach; added inline only in the sticky-todo test that needs it.
(medium) PR description didn't match the diff — Rewritten above to describe what was actually shipped and why.
Local maintainer validation — flake stabilized ✅Reviewed at head Environment
Results
The fix — what changedThe PR replaces a fragile spy call-count assertion with a behavioral one: Before (broken): // Record spy call count after initial render, assert no new calls after status-only rerender
const callsAfterInitialRender = mockedMeasureElement.mock.calls.length;
view.rerender(...);
expect(mockedMeasureElement).toHaveBeenCalledTimes(callsAfterInitialRender);This was flaky because After (stable):
The core contract ("status-only changes return the same layout key") is already directly unit-tested in 5× consecutive run — proof of stabilityThis is the same test that was failing intermittently across multiple recent PR validations (#4451, #4453, #4459) — zero flakes in 5 consecutive runs. Triage of the 15 unrelated CLI failures (NOT caused by this PR)
Zero failures in Reviewer recommendationSafe to merge. This is a high-quality test stabilization:
— Maintainer local validation, run on |
Fixes #4415
What
Stabilizes a flaky CI test in
AppContainer.test.tsx— "does not remeasure footer height for sticky todo status-only updates" was intermittently failing on the macOS/Windows runners.Background
A first attempt (#4415, already merged) replaced the absolute
mock.calls.lengthbaseline withmockClear()+not.toHaveBeenCalled(). That fix was still flaky on the Windows CI runner becauseuseLayoutEffectlegitimately fires on rerender for reasons unrelated to sticky-todo status (buffer ref churn,btwItemupdates), so any strict call-count assertion remains environment-dependent.Change
This PR switches to a behavioral assertion that tolerates legitimate effect runs but still catches the regression the test is meant to guard:
act()wrapping + explicit settle rerender — flushes React 19 + Ink asyncuseLayoutEffecttiming so the baseline is stable before the status-only rerender.availableTerminalHeightequality (read fromUIStateContext) instead ofmeasureElementcall count — the behavioral outcome is what actually matters.measureElementheight between settle and rerender — without this, the production same-value short-circuit onsetControlsHeightwould hide the regression. With the swap, any re-measurement during the status-only rerender changescontrolsHeightand breaks the equality assertion.mockConfig.initializestub —makeFakeConfig().initialize()rejects on React's double-mount, leaking async renders into footer-measurement timing. The stub is scoped to this one test (notbeforeEach) so other tests in the block still exercise the real init gate.The core layout-key contract ("status-only changes return the same key") is already directly covered by
todoSnapshot.test.ts:253-255. This integration test provides layered protection on top of that unit test.Verification
AppContainer.test.tsxsuite (77 tests) — all passed.Review feedback addressed
Thanks @LaZzyMan for the careful review. Addressing each point in commit 25a5a91:
measureElementmock-height swap between settle and rerender, so the equality assertion now actually distinguishes "optimization works" from "optimization regresses" (in environments whereuseLayoutEffectwould re-fire). The most direct regression protection still lives intodoSnapshot.test.ts— this test is layered protection.initializestub flipped a config gate for ~75 tests — Narrowed back to per-test scope. Removed frombeforeEach; added inline only in the sticky-todo test that needs it.中文说明
修复
AppContainer.test.tsx中 "does not remeasure footer height for sticky todo status-only updates" 测试在 CI 上的偶现失败。背景:#4415 首次尝试用
mockClear()+not.toHaveBeenCalled()替代绝对调用次数断言,但在 Windows CI 上仍 flaky ——useLayoutEffect会因为 buffer ref、btwItem 等合理原因在 rerender 时触发,任何严格的 call-count 断言都会受环境影响。改动:
act()wrapping + 显式 settle rerender —— 在 React 19 + Ink 异步环境下 flush effect 时序,让 baseline 稳定。availableTerminalHeight等值(从UIStateContext读取)替代measureElement调用次数 —— 行为结果才是关键。measureElement的返回 height —— 不切换的话,生产代码里setControlsHeight的 same-value short-circuit 会让回归隐形;切换后,status-only rerender 触发的任何重测量都会改变controlsHeight,打破等值断言。mockConfig.initialize的 stub 缩到 per-test 范围 —— 不放到beforeEach,避免污染同 describe block 下其他 ~75 个测试的isConfigInitialized分支。核心 layout-key 契约("status-only change 返回相同 key")已在
todoSnapshot.test.ts:253-255单元测试中直接覆盖,本集成测试是叠加保护。回应 @LaZzyMan 的 review(commit 25a5a91):
todoSnapshot.test.ts。