fix: handle incomplete multi-byte UTF-8 sequences in setEncoding() by joecwu · Pull Request #5003 · nodejs/undici

joecwu · 2026-04-09T03:55:53Z

This relates to...

Rationale

setEncoding('utf8') on undici response body corrupts multi-byte UTF-8 characters (3-byte CJK, 4-byte emoji) at chunk boundaries, replacing them with U+FFFD. Node.js's built-in https module handles this correctly.

Root cause

BodyReadable.setEncoding() only set _readableState.encoding directly, without initializing a StringDecoder:

// Before
setEncoding (encoding) {
  if (Buffer.isEncoding(encoding)) {
    this._readableState.encoding = encoding
  }
  return this
}

Node.js Readable.prototype.setEncoding() does two things:

Creates _readableState.decoder = new StringDecoder(enc), which buffers incomplete multi-byte byte sequences across chunks.
Re-encodes any already-buffered chunks through the decoder (iterating _readableState.buffer), so bytes that arrived before setEncoding() was called are also decoded correctly.

Without the decoder, fromList() falls back to buf.toString(encoding) on each individual chunk, producing U+FFFD replacement characters whenever a multi-byte sequence spans a chunk boundary. This silently corrupts data on the streaming path (for await ... of body, body.on('data', ...)).

Why not just `super.setEncoding(encoding)`?

Delegating directly to the parent implementation fixes the streaming path, but breaks the consume path (body.text(), body.json(), body.arrayBuffer(), etc.) for two reasons:

super.setEncoding() iterates _readableState.buffer and rewrites it from Buffer to decoded string. consumeStart then hands those strings to chunksDecode, whose Buffer.concat(...) rejects strings.
The StringDecoder holds any trailing incomplete bytes internally. Those bytes disappear from _readableState.buffer entirely, and subsequent push() calls into consumePush only see the new raw Buffers — the held bytes from the pre-setEncoding() chunks are permanently lost to the consume path.

Both failure modes are observable against the pre-existing request multibyte json/text with setEncoding tests in test/client-request.js.

Fix

Before delegating to super.setEncoding(), snapshot any raw Buffer chunks already sitting in _readableState.buffer into a private kPreservedBuffer symbol on the stream. Then teach consumeStart to prefer that snapshot over _readableState.buffer when present.

Streaming path (for await, on('data')) — benefits from the properly initialized StringDecoder in the parent class, decoding multi-byte sequences across chunk boundaries correctly.
Consume path (body.text(), body.json(), etc.) — reads the preserved raw Buffers, so all original bytes remain available for byte-level Buffer.concat(...) + final toString(encoding).

Subsequent push() calls continue to feed consumePush raw Buffers before calling super.push(), so no bytes are lost once the consume path is active.

Bug fixes

Fixed multi-byte UTF-8 character corruption when using setEncoding('utf8') on a response body consumed via for await ... of or on('data', ...).

Breaking changes

None. The fix aligns BodyReadable.setEncoding() with Node.js core Readable streams, and preserves the existing contract for body.text() / body.json() / etc.

Test description

Added a test in test/client-request.js:

setEncoding('utf8') handles 3-byte UTF-8 characters split across chunks — constructs an HTTP response where a 3-byte CJK character (傳, bytes e5 82 b3) is deliberately split across two chunks (first chunk ends mid-sequence on 0xe5), then asserts for await ... of body produces the original string with no U+FFFD.

All pre-existing tests still pass, including the three request multibyte json/text with setEncoding tests that exercise the ordering body.setEncoding('utf8') → await body.json()/text() — these are what caught the naive super.setEncoding() approach.

Status

mcollina · 2026-04-09T08:29:14Z

Many CI failures

…coding() BodyReadable.setEncoding() only set _readableState.encoding without initializing a StringDecoder. As a result, Node.js's fromList() used buf.toString(encoding) on each individual chunk, producing U+FFFD replacement characters whenever a multi-byte sequence (3-byte CJK, 4-byte emoji, etc.) was split at a chunk boundary. This silently corrupted response data for any consumer iterating the body with for-await or listening to 'data' events after calling setEncoding. Simply delegating to super.setEncoding() would fix the for-await / data-event path, but it also re-encodes the internal buffer, converting raw Buffer chunks to decoded strings. That breaks the consume path (body.text(), body.json(), body.arrayBuffer(), etc.) in two ways: 1. consumeStart reads state.buffer directly and passes its entries to Buffer.concat, which rejects strings. 2. The StringDecoder holds any trailing incomplete multi-byte bytes, so those bytes disappear from state.buffer entirely — the consume path can never see them again. Fix: before delegating to super.setEncoding(), snapshot any raw Buffer chunks already sitting in state.buffer into a private kPreservedBuffer. Then consumeStart prefers that snapshot over state.buffer when present, guaranteeing the consume path always sees the original bytes. Subsequent push() calls continue to feed consumePush raw Buffers before super.push(), so no bytes are lost. This fixes both the streaming path (for-await / on('data')) and keeps the consume path (body.text(), body.json()) working correctly, including the existing "request multibyte json/text with setEncoding" tests which cover the setEncoding-then-consume ordering. Closes nodejs#5002

joecwu · 2026-04-09T15:17:39Z

Force-pushed an updated fix. The previous version (just return super.setEncoding(encoding)) broke three pre-existing tests in test/client-request.js — the request multibyte json/text with setEncoding tests — because super.setEncoding() re-encodes the internal buffer through a StringDecoder, which:

Converts buffered raw Buffer chunks to strings, breaking Buffer.concat() in consumeStart/chunksDecode.
Holds any trailing incomplete multi-byte bytes inside the decoder, so the consume path (body.text(), body.json(), etc.) loses access to those bytes entirely.

The updated fix snapshots raw Buffer chunks from state.buffer into a private kPreservedBuffer before calling super.setEncoding(), and consumeStart prefers that snapshot when present. This keeps both paths correct:

for-await / on('data') — benefits from the properly initialized StringDecoder in the parent class.
body.text() / body.json() / etc. — reads the preserved raw Buffers instead of the (potentially string-ified, byte-losing) state.buffer.

All 48 tests in test/client-request.js now pass, including the three pre-existing multibyte tests and the new test for the original bug (3-byte UTF-8 split at chunk boundary).

codecov-commenter · 2026-04-10T09:04:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.94%. Comparing base (a434502) to head (ef881be).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5003   +/-   ##
=======================================
  Coverage   92.93%   92.94%           
=======================================
  Files         110      110           
  Lines       35735    35780   +45     
=======================================
+ Hits        33210    33254   +44     
- Misses       2525     2526    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

joecwu mentioned this pull request Apr 9, 2026

fix: use Buffer.concat for UTF-8 response body to prevent multi-byte character corruption elastic/elastic-transport-js#364

Open

joecwu force-pushed the fix/setEncoding-utf8-multibyte branch from a78274e to ef881be Compare April 9, 2026 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: handle incomplete multi-byte UTF-8 sequences in setEncoding()#5003

fix: handle incomplete multi-byte UTF-8 sequences in setEncoding()#5003
joecwu wants to merge 1 commit intonodejs:mainfrom
joecwu:fix/setEncoding-utf8-multibyte

joecwu commented Apr 9, 2026 •

edited

Loading

Uh oh!

mcollina commented Apr 9, 2026

Uh oh!

joecwu commented Apr 9, 2026

Uh oh!

codecov-commenter commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

joecwu commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This relates to...

Rationale

Root cause

Why not just super.setEncoding(encoding)?

Fix

Bug fixes

Breaking changes

Test description

Status

Uh oh!

mcollina commented Apr 9, 2026

Uh oh!

joecwu commented Apr 9, 2026

Uh oh!

codecov-commenter commented Apr 10, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joecwu commented Apr 9, 2026 •

edited

Loading

Why not just `super.setEncoding(encoding)`?