fix: use Buffer.concat for UTF-8 response body to prevent multi-byte character corruption#364
Open
joecwu wants to merge 1 commit intoelastic:mainfrom
Open
fix: use Buffer.concat for UTF-8 response body to prevent multi-byte character corruption#364joecwu wants to merge 1 commit intoelastic:mainfrom
joecwu wants to merge 1 commit intoelastic:mainfrom
Conversation
…corruption
UndiciConnection was using response.body.setEncoding('utf8') + string
concatenation to decode HTTP response bodies. Due to an upstream bug in
undici's setEncoding() implementation (nodejs/undici#5002), multi-byte
UTF-8 characters (CJK, emoji) split at chunk boundaries get replaced with
U+FFFD replacement characters, silently corrupting response data.
Switch to collecting raw Buffer chunks and performing a single
Buffer.concat().toString('utf8') at the end. This ensures the complete
byte sequence is assembled before UTF-8 decoding, sidestepping the
chunk boundary issue entirely and guaranteeing correct UTF-8 output.
Verified on production (Node v24.14.1, undici 7.15.0) querying an
Elasticsearch index with Chinese text: corruption went from ~1 FFFD
per 4KB of response to 0.
Closes elastic#363
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
UndiciConnection.tsusesresponse.body.setEncoding('utf8')+ string concatenation to decode HTTP response bodies. Due to an upstream bug in undici'ssetEncoding()implementation (see nodejs/undici#5002), multi-byte UTF-8 characters (CJK — Chinese/Japanese/Korean — and emoji) that span HTTP chunk boundaries get silently replaced with U+FFFD (replacement character), corrupting response data.Closes #363
Root cause
undici's
BodyReadable.setEncoding()does not initialize aStringDecoderon_readableState. As a result, when iterating the response body with string encoding enabled, Node.js falls back tobuf.toString('utf8')on each individual chunk — which cannot handle incomplete multi-byte sequences at chunk boundaries.The fix for the upstream bug is being proposed in nodejs/undici#5003, but even once that lands, transport should still avoid the fragile
setEncoding + string concatpattern.Fix
Replace
setEncoding('utf8')+ string concatenation with rawBuffer[]collection and a singleBuffer.concat().toString('utf8')at the end. This guarantees:} else { const payload: Buffer[] = [] let currentLength = 0 - response.body.setEncoding('utf8') for await (const chunk of response.body) { - currentLength += Buffer.byteLength(chunk) + const buf = Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk) + currentLength += buf.byteLength if (currentLength > maxResponseSize) { response.body.destroy() throw new RequestAbortedError(...) } - payload.push(chunk) + payload.push(buf) } return { statusCode: response.statusCode, headers: response.headers, body: Buffer.concat(payload).toString('utf8') } }Tests
Added
UTF-8 multi-byte characters not corrupted when split across chunk boundariesintest/unit/undici-connection.test.ts, which constructs an HTTP response where a 3-byte CJK character (傳, bytese5 82 b3) is deliberately split across two chunks and asserts:All 87 existing tests in
undici-connection.test.tscontinue to pass.Production verification
Verified on a production container (Node v24.14.1,
@elastic/elasticsearch9.1.1,@elastic/transport9.1.2, undici 7.15.0) querying an Elasticsearch index containing Chinese text:setEncoding('utf8')+ string concat)Buffer.concat)Corruption was deterministic — same request, same FFFD positions every time.
CLA
I will sign the Elastic Contributor License Agreement when the CLA bot prompts.