Skip to content

fix: handle incomplete multi-byte UTF-8 sequences in setEncoding()#5003

Open
joecwu wants to merge 1 commit intonodejs:mainfrom
joecwu:fix/setEncoding-utf8-multibyte
Open

fix: handle incomplete multi-byte UTF-8 sequences in setEncoding()#5003
joecwu wants to merge 1 commit intonodejs:mainfrom
joecwu:fix/setEncoding-utf8-multibyte

Conversation

@joecwu
Copy link
Copy Markdown

@joecwu joecwu commented Apr 9, 2026

This relates to...

Fixes #5002

Rationale

setEncoding('utf8') on undici response body corrupts multi-byte UTF-8 characters (3-byte CJK, 4-byte emoji) at chunk boundaries, replacing them with U+FFFD. Node.js's built-in https module handles this correctly.

Root cause

BodyReadable.setEncoding() only set _readableState.encoding directly, without initializing a StringDecoder:

// Before
setEncoding (encoding) {
  if (Buffer.isEncoding(encoding)) {
    this._readableState.encoding = encoding
  }
  return this
}

Node.js Readable.prototype.setEncoding() does two things:

  1. Creates _readableState.decoder = new StringDecoder(enc), which buffers incomplete multi-byte byte sequences across chunks.
  2. Re-encodes any already-buffered chunks through the decoder (iterating _readableState.buffer), so bytes that arrived before setEncoding() was called are also decoded correctly.

Without the decoder, fromList() falls back to buf.toString(encoding) on each individual chunk, producing U+FFFD replacement characters whenever a multi-byte sequence spans a chunk boundary. This silently corrupts data on the streaming path (for await ... of body, body.on('data', ...)).

Why not just super.setEncoding(encoding)?

Delegating directly to the parent implementation fixes the streaming path, but breaks the consume path (body.text(), body.json(), body.arrayBuffer(), etc.) for two reasons:

  1. super.setEncoding() iterates _readableState.buffer and rewrites it from Buffer to decoded string. consumeStart then hands those strings to chunksDecode, whose Buffer.concat(...) rejects strings.
  2. The StringDecoder holds any trailing incomplete bytes internally. Those bytes disappear from _readableState.buffer entirely, and subsequent push() calls into consumePush only see the new raw Buffers — the held bytes from the pre-setEncoding() chunks are permanently lost to the consume path.

Both failure modes are observable against the pre-existing request multibyte json/text with setEncoding tests in test/client-request.js.

Fix

Before delegating to super.setEncoding(), snapshot any raw Buffer chunks already sitting in _readableState.buffer into a private kPreservedBuffer symbol on the stream. Then teach consumeStart to prefer that snapshot over _readableState.buffer when present.

  • Streaming path (for await, on('data')) — benefits from the properly initialized StringDecoder in the parent class, decoding multi-byte sequences across chunk boundaries correctly.
  • Consume path (body.text(), body.json(), etc.) — reads the preserved raw Buffers, so all original bytes remain available for byte-level Buffer.concat(...) + final toString(encoding).

Subsequent push() calls continue to feed consumePush raw Buffers before calling super.push(), so no bytes are lost once the consume path is active.

Bug fixes

  • Fixed multi-byte UTF-8 character corruption when using setEncoding('utf8') on a response body consumed via for await ... of or on('data', ...).

Breaking changes

None. The fix aligns BodyReadable.setEncoding() with Node.js core Readable streams, and preserves the existing contract for body.text() / body.json() / etc.

Test description

Added a test in test/client-request.js:

  • setEncoding('utf8') handles 3-byte UTF-8 characters split across chunks — constructs an HTTP response where a 3-byte CJK character (, bytes e5 82 b3) is deliberately split across two chunks (first chunk ends mid-sequence on 0xe5), then asserts for await ... of body produces the original string with no U+FFFD.

All pre-existing tests still pass, including the three request multibyte json/text with setEncoding tests that exercise the ordering body.setEncoding('utf8') → await body.json()/text() — these are what caught the naive super.setEncoding() approach.

Status

@mcollina
Copy link
Copy Markdown
Member

mcollina commented Apr 9, 2026

Many CI failures

…coding()

BodyReadable.setEncoding() only set _readableState.encoding without
initializing a StringDecoder. As a result, Node.js's fromList() used
buf.toString(encoding) on each individual chunk, producing U+FFFD
replacement characters whenever a multi-byte sequence (3-byte CJK,
4-byte emoji, etc.) was split at a chunk boundary. This silently
corrupted response data for any consumer iterating the body with
for-await or listening to 'data' events after calling setEncoding.

Simply delegating to super.setEncoding() would fix the for-await /
data-event path, but it also re-encodes the internal buffer, converting
raw Buffer chunks to decoded strings. That breaks the consume path
(body.text(), body.json(), body.arrayBuffer(), etc.) in two ways:

  1. consumeStart reads state.buffer directly and passes its entries
     to Buffer.concat, which rejects strings.
  2. The StringDecoder holds any trailing incomplete multi-byte bytes,
     so those bytes disappear from state.buffer entirely — the consume
     path can never see them again.

Fix: before delegating to super.setEncoding(), snapshot any raw Buffer
chunks already sitting in state.buffer into a private kPreservedBuffer.
Then consumeStart prefers that snapshot over state.buffer when present,
guaranteeing the consume path always sees the original bytes. Subsequent
push() calls continue to feed consumePush raw Buffers before super.push(),
so no bytes are lost.

This fixes both the streaming path (for-await / on('data')) and keeps
the consume path (body.text(), body.json()) working correctly, including
the existing "request multibyte json/text with setEncoding" tests which
cover the setEncoding-then-consume ordering.

Closes nodejs#5002
@joecwu joecwu force-pushed the fix/setEncoding-utf8-multibyte branch from a78274e to ef881be Compare April 9, 2026 15:17
@joecwu
Copy link
Copy Markdown
Author

joecwu commented Apr 9, 2026

Force-pushed an updated fix. The previous version (just return super.setEncoding(encoding)) broke three pre-existing tests in test/client-request.js — the request multibyte json/text with setEncoding tests — because super.setEncoding() re-encodes the internal buffer through a StringDecoder, which:

  1. Converts buffered raw Buffer chunks to strings, breaking Buffer.concat() in consumeStart/chunksDecode.
  2. Holds any trailing incomplete multi-byte bytes inside the decoder, so the consume path (body.text(), body.json(), etc.) loses access to those bytes entirely.

The updated fix snapshots raw Buffer chunks from state.buffer into a private kPreservedBuffer before calling super.setEncoding(), and consumeStart prefers that snapshot when present. This keeps both paths correct:

  • for-await / on('data') — benefits from the properly initialized StringDecoder in the parent class.
  • body.text() / body.json() / etc. — reads the preserved raw Buffers instead of the (potentially string-ified, byte-losing) state.buffer.

All 48 tests in test/client-request.js now pass, including the three pre-existing multibyte tests and the new test for the original bug (3-byte UTF-8 split at chunk boundary).

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.94%. Comparing base (a434502) to head (ef881be).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5003   +/-   ##
=======================================
  Coverage   92.93%   92.94%           
=======================================
  Files         110      110           
  Lines       35735    35780   +45     
=======================================
+ Hits        33210    33254   +44     
- Misses       2525     2526    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

setEncoding('utf8') on response body corrupts multi-byte UTF-8 characters at chunk boundaries

3 participants