fix: use Buffer.concat for UTF-8 response body to prevent multi-byte character corruption by joecwu · Pull Request #364 · elastic/elastic-transport-js

joecwu · 2026-04-09T13:16:54Z

Bug

UndiciConnection.ts uses response.body.setEncoding('utf8') + string concatenation to decode HTTP response bodies. Due to an upstream bug in undici's setEncoding() implementation (see nodejs/undici#5002), multi-byte UTF-8 characters (CJK — Chinese/Japanese/Korean — and emoji) that span HTTP chunk boundaries get silently replaced with U+FFFD (replacement character), corrupting response data.

Closes #363

Root cause

undici's BodyReadable.setEncoding() does not initialize a StringDecoder on _readableState. As a result, when iterating the response body with string encoding enabled, Node.js falls back to buf.toString('utf8') on each individual chunk — which cannot handle incomplete multi-byte sequences at chunk boundaries.

The fix for the upstream bug is being proposed in nodejs/undici#5003, but even once that lands, transport should still avoid the fragile setEncoding + string concat pattern.

Fix

Replace setEncoding('utf8') + string concatenation with raw Buffer[] collection and a single Buffer.concat().toString('utf8') at the end. This guarantees:

The complete byte sequence is assembled before decoding
Correct UTF-8 decoding regardless of how chunks happen to be split
Robustness against both current and future undici behavior

  } else {
    const payload: Buffer[] = []
    let currentLength = 0
-   response.body.setEncoding('utf8')
    for await (const chunk of response.body) {
-     currentLength += Buffer.byteLength(chunk)
+     const buf = Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk)
+     currentLength += buf.byteLength
      if (currentLength > maxResponseSize) {
        response.body.destroy()
        throw new RequestAbortedError(...)
      }
-     payload.push(chunk)
+     payload.push(buf)
    }
    return {
      statusCode: response.statusCode,
      headers: response.headers,
      body: Buffer.concat(payload).toString('utf8')
    }
  }

Tests

Added UTF-8 multi-byte characters not corrupted when split across chunk boundaries in test/unit/undici-connection.test.ts, which constructs an HTTP response where a 3-byte CJK character (傳, bytes e5 82 b3) is deliberately split across two chunks and asserts:

The response body contains no U+FFFD
The decoded text exactly matches the original

All 87 existing tests in undici-connection.test.ts continue to pass.

Production verification

Verified on a production container (Node v24.14.1, @elastic/elasticsearch 9.1.1, @elastic/transport 9.1.2, undici 7.15.0) querying an Elasticsearch index containing Chinese text:

Approach	FFFD count per ~40KB response
Current (`setEncoding('utf8')` + string concat)	8-10
This PR (`Buffer.concat`)	0

Corruption was deterministic — same request, same FFFD positions every time.

CLA

I will sign the Elastic Contributor License Agreement when the CLA bot prompts.

…corruption UndiciConnection was using response.body.setEncoding('utf8') + string concatenation to decode HTTP response bodies. Due to an upstream bug in undici's setEncoding() implementation (nodejs/undici#5002), multi-byte UTF-8 characters (CJK, emoji) split at chunk boundaries get replaced with U+FFFD replacement characters, silently corrupting response data. Switch to collecting raw Buffer chunks and performing a single Buffer.concat().toString('utf8') at the end. This ensures the complete byte sequence is assembled before UTF-8 decoding, sidestepping the chunk boundary issue entirely and guaranteeing correct UTF-8 output. Verified on production (Node v24.14.1, undici 7.15.0) querying an Elasticsearch index with Chinese text: corruption went from ~1 FFFD per 4KB of response to 0. Closes elastic#363

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use Buffer.concat for UTF-8 response body to prevent multi-byte character corruption#364

fix: use Buffer.concat for UTF-8 response body to prevent multi-byte character corruption#364
joecwu wants to merge 1 commit intoelastic:mainfrom
joecwu:fix/utf8-chunked-response-corruption

joecwu commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joecwu commented Apr 9, 2026

Bug

Root cause

Fix

Tests

Production verification

CLA

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant