Skip to content

fix: use Buffer.concat for UTF-8 response body to prevent multi-byte character corruption#364

Open
joecwu wants to merge 1 commit intoelastic:mainfrom
joecwu:fix/utf8-chunked-response-corruption
Open

fix: use Buffer.concat for UTF-8 response body to prevent multi-byte character corruption#364
joecwu wants to merge 1 commit intoelastic:mainfrom
joecwu:fix/utf8-chunked-response-corruption

Conversation

@joecwu
Copy link
Copy Markdown

@joecwu joecwu commented Apr 9, 2026

Bug

UndiciConnection.ts uses response.body.setEncoding('utf8') + string concatenation to decode HTTP response bodies. Due to an upstream bug in undici's setEncoding() implementation (see nodejs/undici#5002), multi-byte UTF-8 characters (CJK — Chinese/Japanese/Korean — and emoji) that span HTTP chunk boundaries get silently replaced with U+FFFD (replacement character), corrupting response data.

Closes #363

Root cause

undici's BodyReadable.setEncoding() does not initialize a StringDecoder on _readableState. As a result, when iterating the response body with string encoding enabled, Node.js falls back to buf.toString('utf8') on each individual chunk — which cannot handle incomplete multi-byte sequences at chunk boundaries.

The fix for the upstream bug is being proposed in nodejs/undici#5003, but even once that lands, transport should still avoid the fragile setEncoding + string concat pattern.

Fix

Replace setEncoding('utf8') + string concatenation with raw Buffer[] collection and a single Buffer.concat().toString('utf8') at the end. This guarantees:

  1. The complete byte sequence is assembled before decoding
  2. Correct UTF-8 decoding regardless of how chunks happen to be split
  3. Robustness against both current and future undici behavior
  } else {
    const payload: Buffer[] = []
    let currentLength = 0
-   response.body.setEncoding('utf8')
    for await (const chunk of response.body) {
-     currentLength += Buffer.byteLength(chunk)
+     const buf = Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk)
+     currentLength += buf.byteLength
      if (currentLength > maxResponseSize) {
        response.body.destroy()
        throw new RequestAbortedError(...)
      }
-     payload.push(chunk)
+     payload.push(buf)
    }
    return {
      statusCode: response.statusCode,
      headers: response.headers,
      body: Buffer.concat(payload).toString('utf8')
    }
  }

Tests

Added UTF-8 multi-byte characters not corrupted when split across chunk boundaries in test/unit/undici-connection.test.ts, which constructs an HTTP response where a 3-byte CJK character (, bytes e5 82 b3) is deliberately split across two chunks and asserts:

  1. The response body contains no U+FFFD
  2. The decoded text exactly matches the original

All 87 existing tests in undici-connection.test.ts continue to pass.

Production verification

Verified on a production container (Node v24.14.1, @elastic/elasticsearch 9.1.1, @elastic/transport 9.1.2, undici 7.15.0) querying an Elasticsearch index containing Chinese text:

Approach FFFD count per ~40KB response
Current (setEncoding('utf8') + string concat) 8-10
This PR (Buffer.concat) 0

Corruption was deterministic — same request, same FFFD positions every time.

CLA

I will sign the Elastic Contributor License Agreement when the CLA bot prompts.

…corruption

UndiciConnection was using response.body.setEncoding('utf8') + string
concatenation to decode HTTP response bodies. Due to an upstream bug in
undici's setEncoding() implementation (nodejs/undici#5002), multi-byte
UTF-8 characters (CJK, emoji) split at chunk boundaries get replaced with
U+FFFD replacement characters, silently corrupting response data.

Switch to collecting raw Buffer chunks and performing a single
Buffer.concat().toString('utf8') at the end. This ensures the complete
byte sequence is assembled before UTF-8 decoding, sidestepping the
chunk boundary issue entirely and guaranteeing correct UTF-8 output.

Verified on production (Node v24.14.1, undici 7.15.0) querying an
Elasticsearch index with Chinese text: corruption went from ~1 FFFD
per 4KB of response to 0.

Closes elastic#363
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UndiciConnection response body decoding corrupts multi-byte UTF-8 characters

1 participant