feat(core): atomic write rollout for credentials, memory, config, JSONL (closes #3681, #4095 Phase 2)#4333
feat(core): atomic write rollout for credentials, memory, config, JSONL (closes #3681, #4095 Phase 2)#4333doudouOUC wants to merge 16 commits into
Conversation
Sync mirror of atomicWriteFile for code paths that can't await (settings persistence on exit, sync config writers). Same semantics: symlink chain resolution, permission preservation, fsync via flush:true, EPERM/EACCES rename retry, EXDEV fallback to direct write. Add forceMode option on AtomicWriteFileOptions — when true, ignore the existing target's permission bits and apply options.mode regardless. Needed for credential files that must heal historically over-permissive files (e.g. a 0o644 token restored from backup must be forced to 0o600). Honored by both async and sync paths. Default false preserves existing behavior. Reuses Atomics.wait for true blocking sleep in renameWithRetrySync — no busy-wait, no extra dep. Refs: #4095 Phase 2
…ier 1) Route all OAuth credential persistence through atomicWriteFile with forceMode: true, so a process crash mid-write cannot leave the user with a half-written token file, and historically over-permissive files (e.g. 0o644 from a manual restore) are healed to 0o600 on the next write. - oauth-token-storage.ts: setCredentials, deleteCredentials - file-token-storage.ts: saveTokens (encrypted MCP token storage) - qwenOAuth2.ts: cacheQwenCredentials (also fixes missing mode — was inheriting 0o644 from umask, now forced to 0o600) - sharedTokenManager.ts: saveCredentialsToFile — drops ~15 lines of hand-rolled tmp + rename in favor of the shared helper Lock-file writes using flag: 'wx' (sharedTokenManager.ts:720) are intentionally left untouched — they rely on exclusive-create semantics that atomic write does not preserve. Tests updated to mock atomicWriteFile instead of fs.writeFile. Refs: #4095 Phase 2
…Tier 2) Route all auto-memory state persistence through atomicWriteFile so a process crash during a dream/extract/forget cycle cannot corrupt the metadata sidecar, extraction cursor, or topic body files. Touched: manager (writeDreamMetadata), extract (writeExtractCursor + bumpMetadata), indexer (rebuild), dream (bumpDreamMetadata), forget (bumpMetadata + topic body rewrite). manager.ts:362 acquireDreamLock uses flag: 'wx' for exclusive create — left untouched, atomic write does not preserve that semantic. Uses atomicWriteFile (not atomicWriteJSON) to preserve the trailing newline these files have always had. Refs: #4095 Phase 2
…4095 Tier 3a) Route the remaining state-file write paths through atomic helpers so a crash mid-write cannot corrupt config, log, or session-scoped state: - trustedFolders.ts (sync): atomicWriteFileSync — sole path that flips workspace trust, must not half-write - logger.ts (4 sites): atomicWriteFile — full-file JSON rewrites for logs.json and per-checkpoint files - tipHistory, installationManager, projectSummary, todoWrite, trustedHooks: bonus sites with the same shape (state JSON written multiple times per session) todoWrite is on the hot path — writes every time the todo list mutates — so the added rename + fsync cost is measurable (a few ms per write on SSD). Trade-off accepted to avoid a half-written todos file silently breaking the next session's resume. Export atomicWriteFile / atomicWriteFileSync from the core public index so CLI-side callers (trustedFolders, tipHistory) can reach them. Tests updated: - logger.test.ts uses vi.importActual to re-export the real helper and override per-test via vi.mocked(atomicWriteFile).mockRejectedValueOnce - trustedFolders.test.ts and todoWrite.test.ts mock the helper directly Refs: #4095 Phase 2
#3656 fixed the read side of glued '}{' JSONL records — when a process was killed mid-appendFile, the trailing '\n' was lost and the next record was concatenated. The write side was left for a follow-up (#3681). This adds flush:true (fsync) to every per-line append: - jsonl-utils.ts writeLine / writeLineSync (session transcripts, auto-titles, prompt history) - debugLogger.ts appendFile (per-session debug log) jsonl-utils.ts write() (full-file replace) now goes through atomicWriteFileSync so a crash during overwrite cannot corrupt the session transcript either. Trade-off: fsync on every append adds disk-sync latency (single-digit ms on SSD, more on spinning disk / network FS). Acceptable for a few writes per turn; the alternative is silently losing the last record of every interrupted session, which #3681 explicitly flagged. Refs: #4095 Phase 2 Tier 3b Closes: #3681
Catch up two sites where Claude Code's equivalent path is atomic but qwen-code's isn't (verified against /Users/jinye.djy/Projects/claude-code on 2026-05-19): - extension/extensionManager.ts:533, :1073 — enablement config and install metadata writes. Claude Code's plugin install-counts and zip cache use atomic temp+rename via writeFileSyncAndFlush_DEPRECATED. - lsp/NativeLspService.ts:1351 — applying an LSP edit to a user file. Claude Code's FileWriteTool/FileEditTool both route through atomic writeTextContent → writeFileSyncAndFlush_DEPRECATED. A bare writeFileSync here could half-write the user's source file if the process is killed during an LSP-driven rename or quick-fix. Also clean up stale fs.rename mock setups in sharedTokenManager.test.ts that became no-ops after Tier 1 migration (rename is no longer called by saveCredentialsToFile). The fs.writeFile mocks stay because the wx-flag lock path still uses them. Refs: #4095 Phase 2
- packages/core/src/index.ts: move atomicFileWrite export to its alphabetical position (before browser.js) - tipHistory.ts: add forceMode: true to atomicWriteFileSync for consistency with other 0o600 sites — heals legacy 0o644 files even though tips are non-critical Refs: #4095 Phase 2
Three issues caught by post-merge Codex review of the #4095 Phase 2 branch — none had user-visible symptoms yet but all were latent bugs. 1. atomicFileWrite: forceMode without mode silently downgraded perms `if (!options?.forceMode)` skipped the existing-mode stat whenever forceMode was true, regardless of whether `mode` was also supplied. Calling `atomicWriteFile(p, data, { forceMode: true })` (no mode) on an existing 0o600 file produced 0o644 (umask default) instead of preserving 0o600. Tightened the guard to also require `options.mode` to be defined; mirrored fix in atomicWriteFileSync. Added two regression tests (async + sync) that assert mode preservation. 2. logger.test.ts: vi.resetAllMocks() blanked the atomicWriteFile shim The vi.fn(actual.atomicWriteFile) factory implementation gets reset to a no-op by `vi.resetAllMocks()` in beforeEach, which would make `logger.initialize()` silently skip creating logs.json on disk. Tests passed by coincidence (file pre-existence from prior runs). Captured the real implementation at module load and re-attach it via `mockImplementation` after each reset. 3. NativeLspService.applyTextEdits: atomic write bypassed file unwritability The read catch swallowed every error and treated it as "new empty file". With atomic write (tmp + rename), an unreadable target on a writable parent could be replaced with edits applied to an empty buffer — the old fs.writeFileSync would have errored on the target permission. Now only ENOENT is treated as new-file; other read errors (EACCES, EISDIR, etc.) propagate. Refs: #4095 Phase 2
The previous fix only handled "read failed → propagate the error". Codex round 2 caught the remaining gap: a file that's readable but chmod 0444 (read-only) would still be replaced by the atomic rename, because rename only needs parent-directory write access. Add an explicit fs.accessSync(W_OK) check before the atomic write. ENOENT is allowed through so LSP can still create new files via edits. Refs: #4095 Phase 2
…nd 3)
`saveCredentialsToFile` wrapped `atomicWriteFile` in `withTimeout(5000)`.
If the call hits the 5s budget (e.g. slow NFS home, network-backed
storage, fsync added by Phase 2), withTimeout rejects but the
atomicWriteFile internal write+rename keeps running unobserved:
1. withTimeout rejects → saveCredentialsToFile throws
2. performTokenRefresh `finally` releases the refresh lock
3. Another process acquires the lock and writes newer credentials
4. The original atomicWriteFile finally completes its rename and
overwrites the newer credentials — silent token rollback
Pre-migration the code awaited the tmp write and the rename in two
separate withTimeout calls; a timed-out tmp write never reached the
rename so there was no race against the target file. The migration
collapsed both into one inseparable atomicWriteFile, which made the
timeout actively unsafe (the work cannot be cancelled after the
timeout fires — fs.rename is not abortable).
Atomic write is durable by design — accept the I/O latency. The
mkdir and stat timeouts are kept (idempotent and read-only
respectively, no corruption risk on late completion).
Refs: #4095 Phase 2
There was a problem hiding this comment.
Pull request overview
This PR expands the atomic file-write rollout across core persistence paths, focusing on credential safety, JSONL/session durability, memory metadata, config/state files, and LSP-driven edits.
Changes:
- Adds synchronous atomic write support and
forceModepermission tightening. - Migrates high-value bare writes/appends to atomic writes or flushed appends.
- Updates affected unit tests and mocks to validate the new write paths.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
packages/core/src/utils/atomicFileWrite.ts |
Adds forceMode, sync atomic write helper, and sync rename retry support. |
packages/core/src/utils/atomicFileWrite.test.ts |
Adds sync helper and forceMode regression coverage. |
packages/core/src/utils/jsonl-utils.ts |
Flushes JSONL appends and atomically rewrites full JSONL files. |
packages/core/src/utils/debugLogger.ts |
Flushes debug log appends. |
packages/core/src/utils/debugLogger.test.ts |
Updates append expectations for flushed writes. |
packages/core/src/utils/projectSummary.ts |
Uses atomic write for welcome-back state. |
packages/core/src/utils/installationManager.ts |
Uses sync atomic write for installation ID persistence. |
packages/core/src/core/logger.ts |
Uses atomic writes for log and checkpoint JSON files. |
packages/core/src/core/logger.test.ts |
Mocks atomic writes while preserving real disk behavior by default. |
packages/core/src/qwen/qwenOAuth2.ts |
Writes cached Qwen credentials atomically with restricted mode. |
packages/core/src/qwen/qwenOAuth2.test.ts |
Adds atomic write mock. |
packages/core/src/qwen/sharedTokenManager.ts |
Replaces manual temp/rename credential write with atomic helper. |
packages/core/src/qwen/sharedTokenManager.test.ts |
Updates credential save mocks. |
packages/core/src/mcp/oauth-token-storage.ts |
Stores MCP OAuth token files with atomic restricted writes. |
packages/core/src/mcp/oauth-token-storage.test.ts |
Updates tests for atomic token writes. |
packages/core/src/mcp/token-storage/file-token-storage.ts |
Stores encrypted token file with atomic restricted write. |
packages/core/src/mcp/token-storage/file-token-storage.test.ts |
Updates encrypted token write expectations. |
packages/core/src/memory/manager.ts |
Writes auto-memory metadata atomically. |
packages/core/src/memory/indexer.ts |
Writes memory index atomically. |
packages/core/src/memory/forget.ts |
Writes memory metadata and topic files atomically. |
packages/core/src/memory/extract.ts |
Writes extraction cursor and metadata atomically. |
packages/core/src/memory/dream.ts |
Writes dream metadata atomically. |
packages/core/src/tools/todoWrite.ts |
Writes todo state atomically. |
packages/core/src/tools/todoWrite.test.ts |
Updates todo write tests for atomic helper. |
packages/core/src/lsp/NativeLspService.ts |
Applies LSP edits via sync atomic write with read/write error handling. |
packages/core/src/hooks/trustedHooks.ts |
Writes trusted hooks config atomically. |
packages/core/src/extension/extensionManager.ts |
Writes extension enablement and install metadata atomically. |
packages/core/src/index.ts |
Re-exports atomic file write utilities. |
packages/cli/src/services/tips/tipHistory.ts |
Writes tip history atomically with restricted mode. |
packages/cli/src/config/trustedFolders.ts |
Writes trusted folders config atomically with restricted mode. |
packages/cli/src/config/trustedFolders.test.ts |
Updates trusted folder save expectations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…bling-cook # Conflicts: # packages/core/src/qwen/qwenOAuth2.ts
Address PR review suggestions from wenshao (via qwen-latest /review): neither renameWithRetry/Sync nor the EXDEV cross-device fallback had direct test coverage. Both paths are critical (Windows AV contention, Docker tmpfs /tmp) and a regression would degrade silently. Vitest can't spy on ESM exports of `node:fs` (`Cannot redefine property: renameSync`), so add narrow internal test seams instead: - renameWithRetry / renameWithRetrySync take an optional `_renameImpl` parameter, defaulting to fs.rename / fs.renameSync. - atomicWriteFile / atomicWriteFileSync take an optional `_testFs` parameter with `rename` and `writeFile` overrides, forwarded to the retry helper and used in the EXDEV fallback branch. The seams are underscore-prefixed and JSDoc-tagged as "Internal test seam — production callers never pass this", which keeps the public API clean while making the behavior testable. New coverage (+9 tests, 36 → 45): - renameWithRetry: retry-EPERM-then-succeed, give-up after retries, no-retry on non-retryable (ENOSPC) - renameWithRetrySync: same 3 patterns (EACCES, EPERM exhausted, EINVAL) - EXDEV fallback: async direct write + tmp cleanup, sync ditto, non-EXDEV failure propagates without fallback (rejects EIO + tmp cleanup) Refs: #4333, #4095 Phase 2
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
…sh:true CI failure on all 3 OSes (macos / ubuntu / windows): sdk.test.ts asserted `fs.appendFile` was called with `'utf8'` as the 3rd positional argument, but commit b7badc7 (#4095 Tier 3b — JSONL fsync) changed the `debugLogger.ts` appendFile call from string-form to options-form `{ encoding: 'utf8', flush: true }` to enable per-line fsync. Update the 3 assertions in the telemetry diagnostics test to match the new shape. No behavior change — debugLogger still flushes per append; only the assertion in this previously-unrelated suite needed updating. Refs: #4333, #4095 Phase 2 Tier 3b
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] qwenOAuth2.ts cacheQwenCredentials was de-migrated from atomicWriteFile back to a hand-rolled temp+chmod+rename pattern during merge conflict resolution (commit 7b1a7ae38). The PR description still lists qwenOAuth2 under Tier 1 migration and the behavior-changes table references qwen/qwenOAuth2.ts:982, but the file is no longer in the diff and line 982 now points to pollDeviceToken (the function starts at ~1060 after the merge).
The hand-rolled version uses a predictable temp path component (process.pid) versus atomicWriteFile's crypto.randomBytes(6), and lacks EPERM rename retry and EXDEV fallback. Consider either re-migrating to atomicWriteFile (consolidating onto the shared, tested helper) or moving qwenOAuth2 to the "Deliberately not in scope" section with a rationale and fixing the stale line reference.
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
…4333 review) Address three review suggestions from wenshao (via qwen-latest /review), each pointing at a real coverage gap introduced by this PR: 1. NativeLspService.applyTextEdits error branches (round-2 LSP fix): the ENOENT-only read guard and the fs.accessSync(W_OK) refusal had no automated coverage. Added 3 tests accessing applyTextEdits via a typed cast (the method is private; making it public for one verification inflates API surface). Tests use chmod 0000 / chmod 0444 reproducers and assert (a) read failure propagates EACCES without silently overwriting with empty content, (b) W_OK rejects with EACCES/EPERM before the atomic rename touches the target, (c) nonexistent files are still accepted so LSP can create via edits. 2. atomicWriteFileSync non-EXDEV rename failure cleanup: the async counterpart had an explicit EIO-rename test asserting tmp cleanup; the sync variant did not. Added the mirror — injects a sync rename throwing EIO via the existing _testFs seam and asserts `readdirSync(tmpDir).length === 0`. 3. jsonl-utils writeLine / writeLineSync / write smoke tests: the three write paths are the core fix for #3681 (the PR's headline goal) but downstream callers (chatRecordingService, sessionService) mock them entirely. Without direct unit tests, a regression that dropped `flush: true` or reverted `write()` to bare writeFileSync would go undetected. Added 3 real-fs roundtrip tests. Test count delta: - NativeLspService.test.ts: 15 → 18 - atomicFileWrite.test.ts: 45 → 46 - jsonl-utils.test.ts: 22 → 25 Refs: #4333, #4095 Phase 2
|
Re #4333 (review) ( Updated the PR body to remove the stale Why not re-migrate: upstream PR #4255 added The trade-off is real: upstream's hand-roll loses EPERM retry + EXDEV fallback + symlink chain resolution vs the shared helper. That gap is worth a follow-up that extends |
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — gpt-5.5 via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
LGTM. Verified locally on macOS arm64, Node v22.17.0, against the worktree at HEAD dd549aec5.
Build & typecheck: clean (tsc --noEmit on core + cli).
Automated suites — 565/565 pass across atomic helpers, Tier 1 credentials, Tier 2 memory, Tier 3 + telemetry, and LSP / lspCommand. The one transient failure (LspConnectionFactory.test.ts > captures stderr…) is a pre-existing flaky 5s timeout from #3649, untouched by this PR — passes 1.6s in isolation.
Library-level integration script (Part 1): 17/17 pass. Confirmed atomic write basics, perm preservation under umask, forceMode round-1 regression cases (both async + sync), symlink chain resolution, JSONL kill -9 survival with 0 }{ glue, and LSP chmod 0444 refusal.
Real CLI tmux end-to-end (Part 2): launched npm run dev, streamed 444 tokens, kill -9 the leaf tsx PID mid-stream. After kill:
logs.json: parses as valid JSON array with the user entry- session JSONL: 4/4 lines parse, 0
}{glue, ends with}\n - restart +
/resume: killed session appears in picker and loads cleanly (context 7.6%)
The behavior changes called out in the PR (one-shot 0o600 healing on credential files, +few ms per JSONL append for fsync, removed 5s withTimeout around the credential atomic write) are reasonable trade-offs and clearly documented for release notes. Round-3 reasoning on withTimeout removal is correct — fs.rename isn't abortable, and the prior shape opened a real overwrite window.
Ship it.
Maintainer Verification ReportReproduced the PR's Part 1 + Part 2 verification end-to-end on a clean worktree. Reporting raw numbers so they're auditable. Environment
1. Build & static checks
2. Automated unit tests — 565 / 565 passOne transient failure observed during a parallel run: 3. Library-level integration script (PR Part 1) — 17 / 17 passRan the script verbatim from the PR description ( 4. Real CLI tmux end-to-end (PR Part 2)Reproduced the kill -9 + On-disk state immediately after kill: Restart +
5. Coverage matrix — claim → verification
VerdictAll claims in the PR description reproduced. Behavior changes (one-shot 0o600 healing, +few ms per JSONL append, removed 5s NFS withTimeout) are documented and reasonable. Approving. |
…ve failure (#4333 review) Three review fold-ins from wenshao (via qwen-latest /review): 1. atomicFileWrite: error messages reference the random `.tmp.<hex>` path, not the logical target — many callers (memory subsystem, extension manager) don't wrap the error, making debug logs unhelpful. Add `annotateWriteError(error, targetPath)` that mutates the error message in-place to prefix `atomicWriteFile("<targetPath>"): ` while preserving `code` / `errno` / `syscall` / `stack` / the prototype chain so downstream `err.code === 'ENOENT'` checks and `instanceof` narrowing keep working. Applied to both async and sync variants; only the final propagated throw (not the EXDEV fallback path) is annotated. 2. NativeLspService.test.ts: the chmod 0444 and chmod 0000 tests rely on `accessSync(W_OK)` and `readFileSync` failing — but on POSIX with UID 0 (root, including most Docker CI runners), permission bits are bypassed and `accessSync` always succeeds. The tests would silently pass even with the W_OK guard removed entirely. Add `process.getuid?.() === 0` to the skip guard on both tests. 3. sharedTokenManager.test.ts: the catch block in saveCredentialsToFile that maps disk-full / permission-denied to `TokenManagerError(FILE_ACCESS_ERROR)` was never exercised — every prior test mocked atomicWriteFile as always-successful. Added a regression test that rejects atomicWriteFile with ENOSPC and asserts the wrapped TokenManagerError surfaces with the right type and carries the original message. Refs: #4333, #4095 Phase 2
|
Thanks for the independent reproduction — matches my numbers exactly (565/565 unit + 17/17 helper + tmux kill -9 + Will leave the PR open for any other reviewer concerns. Phase 2.5 follow-up issues (atomicWriteFile + AbortSignal so cacheQwenCredentials can be consolidated; jsonl-utils batching with measurements) tracked separately once this lands. |
| */ | ||
| function annotateWriteError(error: unknown, targetPath: string): unknown { | ||
| if (error instanceof Error && !error.message.includes(targetPath)) { | ||
| error.message = `atomicWriteFile(${JSON.stringify(targetPath)}): ${error.message}`; |
There was a problem hiding this comment.
[Suggestion] annotateWriteError hardcodes "atomicWriteFile" in the error prefix regardless of whether the caller was the async or sync variant. atomicWriteFileSync (line 408) calls the same function, so a sync write failure produces atomicWriteFile("/path"): ENOSPC ... — misleading for incident response.
| error.message = `atomicWriteFile(${JSON.stringify(targetPath)}): ${error.message}`; | |
| function annotateWriteError(error: unknown, targetPath: string, fnName = 'atomicWriteFile'): unknown { | |
| if (error instanceof Error && !error.message.includes(targetPath)) { | |
| error.message = `${fnName}(${JSON.stringify(targetPath)}): ${error.message}`; | |
| } | |
| return error; | |
| } |
Then call with 'atomicWriteFileSync' from the sync variant.
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
There was a problem hiding this comment.
Adopted in 7a562eb. Added optional fnName parameter to annotateWriteError (default atomicWriteFile); sync call sites now pass "atomicWriteFileSync" so incident logs identify the correct code path.
| // EXDEV: cross-device rename not supported — fall back to direct write. | ||
| if (isNodeError(error) && error.code === 'EXDEV') { | ||
| await fs.writeFile(targetPath, data, writeOptions); | ||
| await writeFileImpl(targetPath, data, writeOptions); |
There was a problem hiding this comment.
[Suggestion] The EXDEV fallback writes directly to targetPath without a try-catch. If writeFileImpl fails (e.g., ENOSPC), the error propagates without passing through annotateWriteError, breaking the function's error-shape contract. Same issue in the sync variant at line 403.
| await writeFileImpl(targetPath, data, writeOptions); | |
| if (isNodeError(error) && error.code === 'EXDEV') { | |
| try { | |
| await writeFileImpl(targetPath, data, writeOptions); | |
| await tryChmod(targetPath); | |
| } catch (fallbackError) { | |
| throw annotateWriteError(fallbackError, targetPath); | |
| } | |
| return; | |
| } |
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
There was a problem hiding this comment.
Adopted in 7a562eb. Wrapped both EXDEV fallback branches (async + sync) in try/catch and routed the fallback failure through annotateWriteError — error-shape contract is now preserved on the non-atomic cross-device path too.
| } | ||
|
|
||
| // Atomic write so a crash mid-edit can't leave the user file half-written. | ||
| atomicWriteFileSync(filePath, lines.join('\n'), { encoding: 'utf-8' }); |
There was a problem hiding this comment.
[Suggestion] applyTextEdits is declared async and all callers await it, but uses atomicWriteFileSync (blocking) along with readFileSync and accessSync. The async atomicWriteFile exists and is used by all other async callers in this PR.
The sync variant's renameWithRetrySync can block the event loop for up to 350ms via Atomics.wait on EPERM retry — problematic for the LSP edit hot path where workspace edits loop over multiple files.
Consider switching to async I/O throughout the method (fs.promises.readFile, fs.promises.access, atomicWriteFile).
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
There was a problem hiding this comment.
Adopted in 7a562eb. Switched applyTextEdits to async IO throughout: fsp.readFile for content load, fsp.access(W_OK) for the round-2 permission gate, atomicWriteFile for the write. Behavior preserved (same ENOENT-as-new-file distinction, same chmod 0444 rejection), but the LSP edit hot path no longer blocks the event loop on Atomics.wait backoff during EPERM retry. Existing tests (3 new ones from the previous review fold-in) pass unchanged since they already exercise the async entry point via typed cast.
…+ async LSP IO (#4333 review) Three review fold-ins from wenshao (via qwen-latest /review), all real correctness/consistency issues: 1. annotateWriteError hardcoded "atomicWriteFile" prefix regardless of caller. Sync write failures produced misleading `atomicWriteFile("/path"):` prefixes in incident logs. Add optional `fnName` parameter (defaults to async name) and have sync call sites pass "atomicWriteFileSync". 2. EXDEV fallback path in both async and sync variants did NOT route the inner writeFileImpl/tryChmod failures through annotateWriteError. On a cross-device write that subsequently hit ENOSPC, the propagated error had a bare syscall message without the target-path prefix — breaking the function's documented error-shape contract. Wrap both fallback branches in try/catch + annotate. 3. NativeLspService.applyTextEdits is declared `async` and all callers `await` it, but the round-2 fix mixed in sync IO: readFileSync, accessSync, atomicWriteFileSync. The sync helper's renameWithRetrySync blocks the event loop up to ~350ms under Atomics.wait EPERM backoff — particularly bad for LSP workspace edits that loop over many files. Switch to async throughout: fsp.readFile, fsp.access, atomicWriteFile. Behavior preserved (same ENOENT-vs-other distinction, same W_OK gate). Existing tests pass unchanged (they already use the async typed-cast entry point). Refs: #4333, #4095 Phase 2
Summary
Phase 2 of #4095 — replace the remaining bare
fs.writeFile/fs.writeFileSync/fs.appendFilecalls in security-sensitive and data-integrity paths with the atomic helpers introduced by Phase 1 (PR #4096). Also closes #3681 (JSONL session writer durability).A process killed mid-write —
kill -9, OOM, power loss, AV scan stalling rename, FS unmount mid-flush — could previously leave OAuth credentials, memory metadata, session transcripts, or trust-folder config half-written or with glued}{JSONL records. After this PR all of those go throughatomicWriteFile/atomicWriteFileSync(temp + fsync + rename + EXDEV fallback + EPERM retry).Ref: #4095 Phase 2. Closes: #3681.
What changed
Ten commits, each independently green:
Core migration (6 commits):
feat(core): add atomicWriteFileSync + forceMode option— new sync helper mirroring async API (flush:true, symlink chain, EXDEV fallback, EPERM retry viaAtomics.waitblocking sleep). NewforceModeoption ignores existing-target permissions and appliesoptions.moderegardless — needed for credentials so a historically over-permissive file (e.g. 0o644 restored from backup) gets healed to 0o600 on next write.refactor(core): migrate credential writes (Tier 1)—oauth-token-storage,file-token-storage,sharedTokenManager. All write with{mode: 0o600, forceMode: true}. (cacheQwenCredentialsinqwenOAuth2.tswas initially included but de-migrated during merge with upstream PR feat(serve): auth device-flow route (#4175 Wave 4 PR 21) #4255 — see "Deliberately not in scope" below.)refactor(core): migrate memory state writes (Tier 2)—memory/manager,extract,indexer,dream,forget. Trailing-newline preserved.refactor: migrate config + logger + state writes (Tier 3a)—trustedFolders(sync),logger(4 sites), plus state-file siblings the issue didn't list but match the same risk profile:tipHistory,installationManager,projectSummary,todoWrite,trustedHooks.fix(core): flush JSONL appends to disk (Tier 3b, closes #3681)—jsonl-utils.tswriteLine/writeLineSyncgetflush: true;write()full-file replace goes throughatomicWriteFileSync;debugLogger.tsappendFile getsflush: true.refactor(core): migrate extension config + LSP edit to atomic write— catch-up commit after auditing Claude Code's equivalents:extensionManager.tsenablement config + install metadata,NativeLspService.tsLSP-driven user-file edit.Cosmetic cleanup (1 commit):
7.
chore: cosmetic cleanups from PR review—index.tsexport ordering;tipHistory.tsaddsforceMode: truefor consistency with other 0o600 sites.Codex-review-driven fixes (3 commits): all three were latent bugs flagged across three Codex review rounds. None had user-visible symptoms yet — caught before merge.
8.
fix(core): address Codex review findings on Phase 2 PR— three latent bugs from round 1:modesilently downgraded perms.if (!options?.forceMode)skipped the existing-mode stat wheneverforceModewas true, regardless of whethermodewas also supplied. CallingatomicWriteFile(p, data, { forceMode: true })(no mode) on an existing 0o600 file produced 0o644 (umask default). Tightened guard to also requireoptions.mode; mirrored in sync; added regression tests that assert mode preservation.vi.resetAllMocks()blanked theatomicWriteFileshim.vi.fn(actual.atomicWriteFile)factory implementation gets reset to no-op byvi.resetAllMocks()inbeforeEach, which would makelogger.initialize()silently skip creatinglogs.json. Tests passed by coincidence (file pre-existence). Captured real implementation at module load and re-attached viamockImplementation.NativeLspService.applyTextEditsswallowed all read errors. A read failure (EACCES on read-protected file, EISDIR, etc.) was treated as "new empty file" and the atomic rename could replace the original. Now only ENOENT is treated as new-file; other errors propagate.fix(lsp): refuse LSP edits to chmod 0444 files (round 2)— even after the previous fix, a file that's readable butchmod 0444(read-only) would still be replaced by atomic rename (rename only needs parent-directory write access). Added explicitfs.accessSync(W_OK)check before the atomic write; ENOENT still allowed so LSP can create new files.fix(core): drop withTimeout around atomic credential write (round 3)—saveCredentialsToFilewrappedatomicWriteFileinwithTimeout(5000). If the call hit the budget (slow NFS, fsync stall), withTimeout would reject while the atomicWriteFile internal write+rename kept running unobserved → the refresh lock got released → another process could write newer credentials → the original atomicWriteFile finally completed its rename and silently overwrote the newer token. Atomic write is durable by design andfs.renameis not abortable — accept the I/O latency.mkdirandstattimeouts kept (idempotent and read-only respectively).mcp/oauth-token-storage.ts,mcp/token-storage/file-token-storage.ts,qwen/sharedTokenManager.tsforceMode: trueheals historically over-permissive token files to 0o600utils/jsonl-utils.tswriteLine/writeLineSyncflush: trueper append}{glued records fromkill -9sharedTokenManager.saveCredentialsToFilewithTimeoutaround the writeNativeLspService.applyTextEditsW_OKcheck before writechmod 0444files now throw EACCES (matches pre-Phase-2 behavior); LSP edits on read-protected files no longer silently overwrite with an empty bufferDeliberately not in scope
Audited the repo for other bare-write call sites; the following are intentionally left as-is. For each I checked Claude Code's equivalent path; verdicts in parentheses.
flag: 'wx':sharedTokenManager:708,memory/manager:363,memory/store:49,chatRecordingService:633,sessionService:937— exclusive-create semantics, must not become atomic. (Claude Code agrees.)qwen/qwenOAuth2.tscacheQwenCredentials— initially migrated as part of Tier 1, then de-migrated during merge with upstream PR feat(serve): auth device-flow route (#4175 Wave 4 PR 21) #4255 (the daemon device-flow registry work). Upstream addedAbortSignalthreading +SharedTokenManager.getInstance().clearCache()invalidation into the hand-rolled atomic write.atomicWriteFiledoesn't accept a signal (andfs.renameis not abortable), so re-migrating would silently drop signal cancellation — the exact race I just fixed in round 3 forsaveCredentialsToFile. Kept upstream's hand-rolled atomic, accept slightly weaker guarantees (no EPERM retry / EXDEV fallback) in exchange for correct cancellation semantics.services/gitService.ts,services/gitWorktreeService.ts— both planned for deprecation (shadow-git checkpointing is being replaced byfileHistoryServiceper PR feat(rewind): add file restoration support to /rewind command #4064; worktree path is also slated to be retired). Not worth investing atomic-write effort in code on the way out. Defer permanently.extension/extensionSettings.tsenv writes,extension/claude-converter.ts,extension/gemini-converter.ts,extension/variables.ts— Claude Code has no equivalent (no per-extension env, no converters). Defer.agents/agent-transcript.ts:127,agents/arena/ArenaManager.ts:1616— Claude Code's sub-agent metadata sidecars use bare writeFile too (write-once, re-derivable). Defer.core/prompts.ts:406prompt dump,utils/openaiLogger.ts,tools/shell.ts,utils/truncation.ts,tools/modifiable-tool.ts,core/config/config.ts:2600— transient / per-request / scratch files. Claude Code's equivalent debug-dump uses appendFile too. Defer.cli/src/serve/httpAcpBridge.ts:1274— already a hand-rolled atomic write withwxtmp + rename. Worth folding into the shared helper in a follow-up PR with BOM/encoding regression testing.cli/src/config/settings.ts:875recovery write-back — has its own backup machinery; touching it risks settings-migration regressions, defer.Test plan
Automated
npx vitest run packages/core/src/utils/atomicFileWrite.test.ts— 36 tests (19 async, 13 sync, 5 forceMode including 2 round-1 regression cases for the "no mode" edge)npx vitest run packages/core/src/utils/jsonl-utils.test.ts packages/core/src/utils/debugLogger.test.ts— 44 passlogger.test.ts52 pass —beforeEachre-binds the realatomicWriteFileaftervi.resetAllMocks()(fix from Codex round 1)trustedFolders.test.ts21,todoWrite.test.ts28,extensionManager.test.ts42, LSP suite 62 passnpx tsc --noEmit -p packages/core && npx tsc --noEmit -p packages/clicleanManual smoke verification — completed
Verified on macOS arm64, Node v24.12.0, against this branch (
worktree-ethereal-bubbling-cook). Combined library-level integration script (covers crash-safety claims that don't need a model) with a realnpm run devtmux run (proves the actual code paths users hit fire correctly under kill -9).Part 1 — library-level integration script (17/17 pass)
Direct exercise of
atomicWriteFile/atomicWriteFileSync/writeLine/writeLineSyncagainst real disk, including a kill -9 subprocess for the JSONL claim. Confirms every Codex-round fix is regression-proof at the helper level.verify-atomic-helpers.mjs — full script (click to expand)
Actual output (run on this branch):
Part 2 — real CLI end-to-end (tmux +
npm run dev)Validates the helper changes work inside an actual qwen-code session — specifically that
logger.ts(logs.json full-file rewrite) andjsonl-utils.writeLine(per-turn session transcript append) survive a hard process kill and/resumecan still read the JSONL. This is the closest reproduction of the real #3681 failure mode.Environment: API key auth (
selectedType: "openai"→ DASHSCOPE → glm-5). OAuth path is dormant for API-key users, so the high-value paths under load arelogger.tsandjsonl-utils.ts— exactly what's verified here.Commands (run from the worktree root):
Observed results:
glm-5shownkill -9 <leaf PID>mid-responseJSON.parse(logs.json)}{/resumepicker请用大约200字详细描述...hellovisibleCoverage matrix — PR claim → verification
atomicWriteFileis crash-atomic (write tmp + fsync + rename)atomicWriteFileSyncmirrors async semanticsforceMode: trueheals legacy 0o644 → 0o600forceModewithoutmodepreserves existing perms (Codex round-1 fix)flush: truesurvives kill -9 — no glued}{(closes #3681)chmod 0444(Codex round-2 fix)withTimeoutremoval eliminates token-overwrite race (Codex round-3 fix)sharedTokenManager.test.ts(31 pass)/rewind-style session resume still works after a kill/resumepicker + select)Not covered (out of scope for local verification)
~/.qwen/oauth_creds.jsonin the test environment held test fixtures, and the user's CLI uses API-key auth instead. The code path is identical to the verifiedatomicWriteFile(creds, {mode:0o600, forceMode:true})shape, just driven by the OAuth refresh callback rather than the logger.renameWithRetry/renameWithRetrySyncEPERM retry path is exercised by the existing unit tests but not under live AV pressure.Out-of-scope follow-ups
cli/src/serve/httpAcpBridge.ts:1274hand-rolled atomic write into the shared helper~/.qwen/settings.json— defense in depth, low priority🤖 Generated with Qwen Code