fix: handle incomplete multi-byte UTF-8 sequences in setEncoding()#5003
fix: handle incomplete multi-byte UTF-8 sequences in setEncoding()#5003joecwu wants to merge 1 commit intonodejs:mainfrom
Conversation
|
Many CI failures |
…coding()
BodyReadable.setEncoding() only set _readableState.encoding without
initializing a StringDecoder. As a result, Node.js's fromList() used
buf.toString(encoding) on each individual chunk, producing U+FFFD
replacement characters whenever a multi-byte sequence (3-byte CJK,
4-byte emoji, etc.) was split at a chunk boundary. This silently
corrupted response data for any consumer iterating the body with
for-await or listening to 'data' events after calling setEncoding.
Simply delegating to super.setEncoding() would fix the for-await /
data-event path, but it also re-encodes the internal buffer, converting
raw Buffer chunks to decoded strings. That breaks the consume path
(body.text(), body.json(), body.arrayBuffer(), etc.) in two ways:
1. consumeStart reads state.buffer directly and passes its entries
to Buffer.concat, which rejects strings.
2. The StringDecoder holds any trailing incomplete multi-byte bytes,
so those bytes disappear from state.buffer entirely — the consume
path can never see them again.
Fix: before delegating to super.setEncoding(), snapshot any raw Buffer
chunks already sitting in state.buffer into a private kPreservedBuffer.
Then consumeStart prefers that snapshot over state.buffer when present,
guaranteeing the consume path always sees the original bytes. Subsequent
push() calls continue to feed consumePush raw Buffers before super.push(),
so no bytes are lost.
This fixes both the streaming path (for-await / on('data')) and keeps
the consume path (body.text(), body.json()) working correctly, including
the existing "request multibyte json/text with setEncoding" tests which
cover the setEncoding-then-consume ordering.
Closes nodejs#5002
a78274e to
ef881be
Compare
|
Force-pushed an updated fix. The previous version (just
The updated fix snapshots raw
All 48 tests in |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5003 +/- ##
=======================================
Coverage 92.93% 92.94%
=======================================
Files 110 110
Lines 35735 35780 +45
=======================================
+ Hits 33210 33254 +44
- Misses 2525 2526 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This relates to...
Fixes #5002
Rationale
setEncoding('utf8')on undici response body corrupts multi-byte UTF-8 characters (3-byte CJK, 4-byte emoji) at chunk boundaries, replacing them with U+FFFD. Node.js's built-inhttpsmodule handles this correctly.Root cause
BodyReadable.setEncoding()only set_readableState.encodingdirectly, without initializing aStringDecoder:Node.js
Readable.prototype.setEncoding()does two things:_readableState.decoder = new StringDecoder(enc), which buffers incomplete multi-byte byte sequences across chunks._readableState.buffer), so bytes that arrived beforesetEncoding()was called are also decoded correctly.Without the decoder,
fromList()falls back tobuf.toString(encoding)on each individual chunk, producingU+FFFDreplacement characters whenever a multi-byte sequence spans a chunk boundary. This silently corrupts data on the streaming path (for await ... of body,body.on('data', ...)).Why not just
super.setEncoding(encoding)?Delegating directly to the parent implementation fixes the streaming path, but breaks the
consumepath (body.text(),body.json(),body.arrayBuffer(), etc.) for two reasons:super.setEncoding()iterates_readableState.bufferand rewrites it fromBufferto decodedstring.consumeStartthen hands those strings tochunksDecode, whoseBuffer.concat(...)rejects strings.StringDecoderholds any trailing incomplete bytes internally. Those bytes disappear from_readableState.bufferentirely, and subsequentpush()calls intoconsumePushonly see the new rawBuffers — the held bytes from the pre-setEncoding()chunks are permanently lost to the consume path.Both failure modes are observable against the pre-existing
request multibyte json/text with setEncodingtests intest/client-request.js.Fix
Before delegating to
super.setEncoding(), snapshot any rawBufferchunks already sitting in_readableState.bufferinto a privatekPreservedBuffersymbol on the stream. Then teachconsumeStartto prefer that snapshot over_readableState.bufferwhen present.for await,on('data')) — benefits from the properly initializedStringDecoderin the parent class, decoding multi-byte sequences across chunk boundaries correctly.body.text(),body.json(), etc.) — reads the preserved rawBuffers, so all original bytes remain available for byte-levelBuffer.concat(...)+ finaltoString(encoding).Subsequent
push()calls continue to feedconsumePushrawBuffers before callingsuper.push(), so no bytes are lost once the consume path is active.Bug fixes
setEncoding('utf8')on a response body consumed viafor await ... oforon('data', ...).Breaking changes
None. The fix aligns
BodyReadable.setEncoding()with Node.js coreReadablestreams, and preserves the existing contract forbody.text()/body.json()/ etc.Test description
Added a test in
test/client-request.js:setEncoding('utf8') handles 3-byte UTF-8 characters split across chunks— constructs an HTTP response where a 3-byte CJK character (傳, bytese5 82 b3) is deliberately split across two chunks (first chunk ends mid-sequence on0xe5), then assertsfor await ... of bodyproduces the original string with noU+FFFD.All pre-existing tests still pass, including the three
request multibyte json/text with setEncodingtests that exercise the orderingbody.setEncoding('utf8') → await body.json()/text()— these are what caught the naivesuper.setEncoding()approach.Status