Improve ergonomics - rebuild Reader around native utf-8 string types by dralley · Pull Request #963 · tafia/quick-xml

dralley · 2026-05-11T15:01:37Z

Add UTF-8 validation in Reader internals for parsing events
Change name / namespace types from &[u8] to &str
Change event types from Cow<[u8]> to Cow<str>, remove Decoder
Change attribute types from Cow<[u8]> to Cow<str>
Remove Decoder & methods from public API

codecov-commenter · 2026-05-11T15:16:18Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 75.53648% with 171 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.27%. Comparing base (e00ae5c) to head (9d834a5).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
src/events/mod.rs	56.75%	64 Missing ⚠️
src/events/attributes.rs	82.17%	18 Missing ⚠️
src/name.rs	87.02%	17 Missing ⚠️
examples/custom_entities.rs	0.00%	10 Missing ⚠️
src/reader/state.rs	85.29%	10 Missing ⚠️
examples/read_nodes.rs	0.00%	9 Missing ⚠️
benches/macrobenches.rs	0.00%	8 Missing ⚠️
src/reader/buffered_reader.rs	79.31%	6 Missing ⚠️
src/reader/slice_reader.rs	77.77%	6 Missing ⚠️
examples/nested_readers.rs	0.00%	5 Missing ⚠️
... and 10 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #963      +/-   ##
==========================================
- Coverage   57.31%   56.27%   -1.04%     
==========================================
  Files          46       47       +1     
  Lines       18197    18135      -62     
==========================================
- Hits        10429    10205     -224     
- Misses       7768     7930     +162

Flag	Coverage Δ
unittests	`56.27% <75.53%> (-1.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dralley · 2026-05-11T16:44:45Z

-                Cow::Owned(owned) => CowRef::Owned(owned),
-            },
+            Cow::Borrowed(b) => {
+                let name_str = std::str::from_utf8(&b[..start.name_len])


There will be a handful of these temporary from_utf8() calls, but they should be able to be removed by by subsequent commits as additional types are switched over.

dralley · 2026-05-11T19:50:22Z

@Mingun Would you be satisfied if BinaryStream / raw byte buffers were only supported when the rest of the document apart from those buffers is UTF-8?

edit: well, that's what I implemented.

Sidenote: maybe decoded_and_normalized_value() (etc.) ought to be marked deprecated in 0.40.1, to point people in the direction of using DecodingReader

dralley · 2026-05-11T22:15:36Z

These 3 particular commits are ready for review, with the caveat that there will be (probably) 6-8 additional commits coming.

dralley · 2026-05-12T21:27:23Z

Remaining design questions, not all of which actually need to be dealt with in this PR:

Should Reader and / or Deserializer wrap a DecodingReader automatically if the encoding feature is enabled?
- If we do, should we use from_utf8_unchecked() (since input is pre-validated)?
- Should some form of built-in stream decoding or stream validation be the "default", with a special feature to disable it for access to BinaryStream / Reader::stream()?
- Should slice_reader be converted to &str-only, with a reader over &[u8] using the buffered_reader path?
Should BytesStart, BytesEnd, BytesText (etc.) be renamed since they are no longer raw bytes but rather guaranteed UTF-8?
- e.g. XmlStart, XmlEnd, XmlText or just Start, End, Text (which would maybe be ambiguous given Event::Text, Event::Start etc.)
Would introducing a Utf8ValidatingReader be worthwhile, or is the encoding_rs dependency not that big a deal?
Should a similar built-in wrapper of the inner Reader be used to track position within the file globally (line numbers, etc.) and perform EOL normalization?

dralley · 2026-05-12T21:46:57Z

Also, I can improve the commit messages and Changelog entries if needed. The initial are pretty... concise.

Mingun · 2026-05-13T17:15:48Z

First, I would prefer to keep the ability to parse non-utf8 encoded documents without recoding. XML itself can be parsed without knowing the exact encoding, it is enough if it is XML-compatible (which is all legacy 1-byte encodings that we support). So, is it possible to create a separate reader and event which will be always UTF-8 encoded and keep the current ones for advanced usage? It is fine to promote the new UTF-8-based reader as default, but keep the ability to work with non-UTF-8 input without recoding.

Here is the same situation as for regexp -- although it is defined in terms of strings, nothing prevents it from running on top of any byte arrays. The author of regexp engine even created a bstr crate to add useful string-based methods to byte arrays.

dralley · 2026-05-13T19:27:21Z

First, I would prefer to keep the ability to parse non-utf8 encoded documents without recoding.

IMO, it is not worth the ergonomic and maintenance costs. If you look at all the major XML parsing libraries like libxml2, expat, encoding/xml (Go), and Jackson (Java) etc, they all do internal transcoding and throw errors if they encounter something that can't be decoded (or or replaced) - with no escape hatch. I suspect if this was a significant use case we would likely not be the only ones catering to it.

e.g.

libxml2 parses & handles UTF-8 only, performs a streaming decode of other encodings
https://dev.w3.org/XInclude-Test-Suite/libxml2-2.4.24/libxml2-2.4.24/doc/encoding.html

expat selects either UTF-8 or UTF-16 as an internal encoding at compile time, decodes to that, returns whichever type of string was selected
https://libexpat.github.io/doc/expat-internals-encodings/

encoding/xml is the same as libxml2 - utf-8 only
https://pkg.go.dev/encoding/xml (search CharsetReader)

Decoding is very very fast relative to XML parsing - it varies depending on encoding and the precise makeup of the document of course, but generally between 15 and 90 Gbps, whereas XML parsing is currently in the ballpark of 0.5 Gbps and often slower, so I don't really think that's a reason to avoid it either.

I would maybe accept the argument that it's a huge API change and it might be warranted to support both for some time to allow a migration, but even then it would likely be easier to just maintain an older branch for a longer period of time.

Duplicating the reader would, I think, be way way more work than it's worth.

dralley · 2026-05-13T19:57:58Z

Also, the reason the XML libraries work that way, apart from overall simplicity, is that the XML standard effectively requires working that way. The standard actually said that all XML processors should be able to handle
either UTF-16 or UTF-8, and have mandatory fatal decoding errors in many situations, the easiest way to satisfy the requirements is to just do what everyone does, which is decode the document up front, and build a parser against one canonical encoding.

I'm not a complete stickler for compliance, and we do provide a handful of features catering to noncompliant XML and XML-derived document formats (which is fine), but in this case I really don't see a good reason to go out of our way to break with it. It's just more complexity for a use case of (IMO) very questionable value.

https://www.w3.org/TR/xml/

Section 2.2

The mechanism for encoding character code points into bit patterns may vary from entity to entity. **All XML processors MUST accept **the UTF-8 and UTF-16 encodings of Unicode [Unicode]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.

Section 4.3.3 - Character Encoding in Entities

Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings.

...

It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

dralley · 2026-06-29T16:00:53Z

@Mingun Can you give me some idea of when you plan to review this, I will be away from my laptop for a week and a half starting this weekend.

Mingun

I started review on per-commit basis about month ago, but finished it now using the final diff, to we may move forward. So maybe some last questions may be dumb.

I still think, that changing the low level from [u8] to str is not a better idea. It would be better at middle-level, where we will also expand general entity references (#948).

Mingun · 2026-05-24T18:17:01Z

        let result = match event.into() {
            Event::Start(e) => {
-                let result = self.write_wrapped(b"<", &e, b">");
+                let result = self.write_wrapped(b"<", e.as_bytes(), b">");


Shouldn't those functions be changed to accept &str?

For the inner functions which are writing to output, it doesn't make much difference. Could be either way.

But I will change it anyways. It does clean things up slightly.

Mingun · 2026-06-29T17:42:31Z

    }
-
-    #[cfg(feature = "encoding")]
-    mod utf16 {


Is it not possible to keep those tests?

It's probably possible but I don't understand what value it would provide

The actual mechanism of this library's support for UTF-16 involves processing it before it ever reaches the parser / Deserializer code in the first place. There's not really a use case where a user could Deserialize XML markup directly from UTF-16 without being pre-decoded - and to the extent that any of these tests work, it's only because they don't involve any actual XML markup? All the nontrivial cases here already expect failure.

There's a UTF-8 copy of all of these tests cases a few lines up, which covers everything a user could hit in practice. IMO it would make sense to drop or rename the module though.

Mingun · 2026-06-29T17:58:12Z

Also just want to inform you, that I plan to release 0.41.0 now with the latest security fixes.

dralley · 2026-06-29T22:25:56Z

@Mingun Updated

I will wait to rebase / resolve the merge conflicts until you're done reviewing.

dralley · 2026-06-30T14:41:16Z

Low level -- current Reader
Middle level -- #948 (performs automatic expand of general entity references)
High level -- serde interface

I'm not sure I agree, but I'm open to hearing the argument in favor. I can think of some positives and some negatives.

Is this a blocking issue or something that can be worked on later?

Document that Reader expects UTF-8 input.

Required for const fn split_at - needed to keep trim_xml* functions const.

Make xml*_content() methods infalliable as they no longer handle decoding.

Deprecate decode_and* methods, since they no longer serve a purpose.

It is now impossible for ReaderState to receive unvalidated bytes. This avoids some redundant validation and allows making different decisions about how to validate for different types of XmlSource.

Eliminates some duplicitous validation

Custom impl no longer required after converting to String-based types.

BytesStart / BytesPI::attributes_raw() ought to return &str BytesStart::try_get_attribute() ought to take &str - drop the AsRef also.

It's a little cleaner, makes no practical difference otherwise.

Possible now that the MSRV is bumped to 1.86

dralley · 2026-07-02T17:27:18Z

I rebased anyway, but all the changes are in new commits, everything after and including 22c4a94 "Make DeError::UnexpectedStart carry String"

dralley force-pushed the ergonomics-str branch from 82cd028 to 6959046 Compare May 11, 2026 15:03

dralley changed the title ~~Improve ergonomics - rebuild Reader around native &str types~~ Improve ergonomics - rebuild Reader around native utf-8 string types May 11, 2026

dralley commented May 11, 2026

View reviewed changes

dralley force-pushed the ergonomics-str branch 4 times, most recently from e028123 to 4a01f13 Compare May 11, 2026 19:31

dralley force-pushed the ergonomics-str branch from c8749de to 1939e95 Compare May 12, 2026 18:20

dralley commented May 12, 2026

View reviewed changes

Comment thread src/utils.rs Outdated

dralley force-pushed the ergonomics-str branch 3 times, most recently from a198941 to 27e09f5 Compare May 12, 2026 18:50

dralley commented May 12, 2026

View reviewed changes

Comment thread src/events/attributes.rs

dralley force-pushed the ergonomics-str branch 3 times, most recently from 5474a71 to 266bf6f Compare May 12, 2026 20:58

dralley marked this pull request as ready for review May 12, 2026 21:11

dralley requested a review from Mingun May 12, 2026 21:37

dralley force-pushed the ergonomics-str branch 2 times, most recently from 4ba0d68 to cd8da9b Compare May 12, 2026 22:40

dralley mentioned this pull request May 13, 2026

Support for embedded raw binary with serde API #792

Open

dralley mentioned this pull request May 13, 2026

ergonomics & encodings #158

Open

Mingun reviewed Jun 29, 2026

View reviewed changes

dralley force-pushed the ergonomics-str branch 2 times, most recently from 7d7578b to f023822 Compare June 29, 2026 22:46

dralley added 21 commits July 2, 2026 13:19

Reader now verifies that events are UTF-8

5a1baec

Document that Reader expects UTF-8 input.

Make Name types (QName, LocalName, etc.) str-based

272e87f

Remove Decoder from Event type / Attribute structs

65a300c

Convert Event types to str

f7ad538

Bump MSRV to 1.86

02b4417

Required for const fn split_at - needed to keep trim_xml* functions const.

Change Deref<Target=..> from [u8] to &str

970048f

Remove decode() method from BytesText, etc.

97f1c6e

Make xml*_content() methods infalliable as they no longer handle decoding.

Convert Attributes to Cow<str>

b1217f1

Deprecate decode_and* methods, since they no longer serve a purpose.

Remove remanents of Decoder from the Reader API

80b08e7

Fix fuzzing error

516fb89

Convert ReaderState.opened_buffer from Vec<u8> to String

5e12bc2

Moved UTF-8 validation from ReaderState::emit_* to XmlSource

0a44d8c

It is now impossible for ReaderState to receive unvalidated bytes. This avoids some redundant validation and allows making different decisions about how to validate for different types of XmlSource.

Convert NamespaceResolver::buffer from Vec<u8> to String

43ca2c8

Eliminates some duplicitous validation

Make DeError::UnexpectedStart carry String

22c4a94

Derive Debug for multiple types

c2350c8

Custom impl no longer required after converting to String-based types.

Use regex::Regex instead of regex::bytes::Regex

2aed4a1

Fix a couple of signatures

762b649

BytesStart / BytesPI::attributes_raw() ought to return &str BytesStart::try_get_attribute() ought to take &str - drop the AsRef also.

Convert write_wrapped() to &str

3d1be68

It's a little cleaner, makes no practical difference otherwise.

Update Changelog

5f6a1f5

Fix docstrings

3305b05

Restore const on a few functions

9d834a5

Possible now that the MSRV is bumped to 1.86

dralley force-pushed the ergonomics-str branch from f023822 to 9d834a5 Compare July 2, 2026 17:25

Conversation

dralley commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dralley May 11, 2026

Choose a reason for hiding this comment

Uh oh!

dralley commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dralley commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mingun commented May 13, 2026

Uh oh!

dralley commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented Jun 29, 2026

Uh oh!

Mingun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mingun May 24, 2026

Choose a reason for hiding this comment

Uh oh!

dralley Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

dralley Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mingun Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

dralley Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mingun commented Jun 29, 2026

Uh oh!

dralley commented Jun 29, 2026

Uh oh!

dralley commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dralley commented May 11, 2026 •

edited

Loading

codecov-commenter commented May 11, 2026 •

edited

Loading

dralley commented May 11, 2026 •

edited

Loading

dralley commented May 11, 2026 •

edited

Loading

dralley commented May 12, 2026 •

edited

Loading

dralley commented May 12, 2026 •

edited

Loading

dralley commented May 13, 2026 •

edited

Loading

dralley commented May 13, 2026 •

edited

Loading

dralley Jun 29, 2026 •

edited

Loading

dralley commented Jun 30, 2026 •

edited

Loading