Skip to content

Improve ergonomics - rebuild Reader around native utf-8 string types#963

Open
dralley wants to merge 21 commits into
tafia:masterfrom
dralley:ergonomics-str
Open

Improve ergonomics - rebuild Reader around native utf-8 string types#963
dralley wants to merge 21 commits into
tafia:masterfrom
dralley:ergonomics-str

Conversation

@dralley

@dralley dralley commented May 11, 2026

Copy link
Copy Markdown
Collaborator
  • Add UTF-8 validation in Reader internals for parsing events
  • Change name / namespace types from &[u8] to &str
  • Change event types from Cow<[u8]> to Cow<str>, remove Decoder
  • Change attribute types from Cow<[u8]> to Cow<str>
  • Remove Decoder & methods from public API

@codecov-commenter

codecov-commenter commented May 11, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 75.53648% with 171 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.27%. Comparing base (e00ae5c) to head (9d834a5).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
src/events/mod.rs 56.75% 64 Missing ⚠️
src/events/attributes.rs 82.17% 18 Missing ⚠️
src/name.rs 87.02% 17 Missing ⚠️
examples/custom_entities.rs 0.00% 10 Missing ⚠️
src/reader/state.rs 85.29% 10 Missing ⚠️
examples/read_nodes.rs 0.00% 9 Missing ⚠️
benches/macrobenches.rs 0.00% 8 Missing ⚠️
src/reader/buffered_reader.rs 79.31% 6 Missing ⚠️
src/reader/slice_reader.rs 77.77% 6 Missing ⚠️
examples/nested_readers.rs 0.00% 5 Missing ⚠️
... and 10 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #963      +/-   ##
==========================================
- Coverage   57.31%   56.27%   -1.04%     
==========================================
  Files          46       47       +1     
  Lines       18197    18135      -62     
==========================================
- Hits        10429    10205     -224     
- Misses       7768     7930     +162     
Flag Coverage Δ
unittests 56.27% <75.53%> (-1.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dralley dralley changed the title Improve ergonomics - rebuild Reader around native &str types Improve ergonomics - rebuild Reader around native utf-8 string types May 11, 2026
Comment thread src/de/key.rs Outdated
Cow::Owned(owned) => CowRef::Owned(owned),
},
Cow::Borrowed(b) => {
let name_str = std::str::from_utf8(&b[..start.name_len])

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a handful of these temporary from_utf8() calls, but they should be able to be removed by by subsequent commits as additional types are switched over.

@dralley dralley force-pushed the ergonomics-str branch 4 times, most recently from e028123 to 4a01f13 Compare May 11, 2026 19:31
@dralley

dralley commented May 11, 2026

Copy link
Copy Markdown
Collaborator Author

@Mingun Would you be satisfied if BinaryStream / raw byte buffers were only supported when the rest of the document apart from those buffers is UTF-8?

edit: well, that's what I implemented.

Sidenote: maybe decoded_and_normalized_value() (etc.) ought to be marked deprecated in 0.40.1, to point people in the direction of using DecodingReader

@dralley

dralley commented May 11, 2026

Copy link
Copy Markdown
Collaborator Author

These 3 particular commits are ready for review, with the caveat that there will be (probably) 6-8 additional commits coming.

Comment thread src/utils.rs Outdated
@dralley dralley force-pushed the ergonomics-str branch 3 times, most recently from a198941 to 27e09f5 Compare May 12, 2026 18:50
Comment thread src/events/attributes.rs
@dralley dralley force-pushed the ergonomics-str branch 3 times, most recently from 5474a71 to 266bf6f Compare May 12, 2026 20:58
@dralley dralley marked this pull request as ready for review May 12, 2026 21:11
@dralley

dralley commented May 12, 2026

Copy link
Copy Markdown
Collaborator Author

Remaining design questions, not all of which actually need to be dealt with in this PR:

  • Should Reader and / or Deserializer wrap a DecodingReader automatically if the encoding feature is enabled?
    • If we do, should we use from_utf8_unchecked() (since input is pre-validated)?
    • Should some form of built-in stream decoding or stream validation be the "default", with a special feature to disable it for access to BinaryStream / Reader::stream()?
    • Should slice_reader be converted to &str-only, with a reader over &[u8] using the buffered_reader path?
  • Should BytesStart, BytesEnd, BytesText (etc.) be renamed since they are no longer raw bytes but rather guaranteed UTF-8?
    • e.g. XmlStart, XmlEnd, XmlText or just Start, End, Text (which would maybe be ambiguous given Event::Text, Event::Start etc.)
  • Would introducing a Utf8ValidatingReader be worthwhile, or is the encoding_rs dependency not that big a deal?
  • Should a similar built-in wrapper of the inner Reader be used to track position within the file globally (line numbers, etc.) and perform EOL normalization?

@dralley dralley requested a review from Mingun May 12, 2026 21:37
@dralley

dralley commented May 12, 2026

Copy link
Copy Markdown
Collaborator Author

Also, I can improve the commit messages and Changelog entries if needed. The initial are pretty... concise.

@dralley dralley force-pushed the ergonomics-str branch 2 times, most recently from 4ba0d68 to cd8da9b Compare May 12, 2026 22:40
@Mingun

Mingun commented May 13, 2026

Copy link
Copy Markdown
Collaborator

First, I would prefer to keep the ability to parse non-utf8 encoded documents without recoding. XML itself can be parsed without knowing the exact encoding, it is enough if it is XML-compatible (which is all legacy 1-byte encodings that we support). So, is it possible to create a separate reader and event which will be always UTF-8 encoded and keep the current ones for advanced usage? It is fine to promote the new UTF-8-based reader as default, but keep the ability to work with non-UTF-8 input without recoding.

Here is the same situation as for regexp -- although it is defined in terms of strings, nothing prevents it from running on top of any byte arrays. The author of regexp engine even created a bstr crate to add useful string-based methods to byte arrays.

@dralley

dralley commented May 13, 2026

Copy link
Copy Markdown
Collaborator Author

First, I would prefer to keep the ability to parse non-utf8 encoded documents without recoding.

IMO, it is not worth the ergonomic and maintenance costs. If you look at all the major XML parsing libraries like libxml2, expat, encoding/xml (Go), and Jackson (Java) etc, they all do internal transcoding and throw errors if they encounter something that can't be decoded (or or replaced) - with no escape hatch. I suspect if this was a significant use case we would likely not be the only ones catering to it.

e.g.

libxml2 parses & handles UTF-8 only, performs a streaming decode of other encodings
https://dev.w3.org/XInclude-Test-Suite/libxml2-2.4.24/libxml2-2.4.24/doc/encoding.html

expat selects either UTF-8 or UTF-16 as an internal encoding at compile time, decodes to that, returns whichever type of string was selected
https://libexpat.github.io/doc/expat-internals-encodings/

encoding/xml is the same as libxml2 - utf-8 only
https://pkg.go.dev/encoding/xml (search CharsetReader)

Decoding is very very fast relative to XML parsing - it varies depending on encoding and the precise makeup of the document of course, but generally between 15 and 90 Gbps, whereas XML parsing is currently in the ballpark of 0.5 Gbps and often slower, so I don't really think that's a reason to avoid it either.

I would maybe accept the argument that it's a huge API change and it might be warranted to support both for some time to allow a migration, but even then it would likely be easier to just maintain an older branch for a longer period of time.

Duplicating the reader would, I think, be way way more work than it's worth.

@dralley

dralley commented May 13, 2026

Copy link
Copy Markdown
Collaborator Author

Also, the reason the XML libraries work that way, apart from overall simplicity, is that the XML standard effectively requires working that way. The standard actually said that all XML processors should be able to handle
either UTF-16 or UTF-8, and have mandatory fatal decoding errors in many situations, the easiest way to satisfy the requirements is to just do what everyone does, which is decode the document up front, and build a parser against one canonical encoding.

I'm not a complete stickler for compliance, and we do provide a handful of features catering to noncompliant XML and XML-derived document formats (which is fine), but in this case I really don't see a good reason to go out of our way to break with it. It's just more complexity for a use case of (IMO) very questionable value.

https://www.w3.org/TR/xml/

Section 2.2

The mechanism for encoding character code points into bit patterns may vary from entity to entity. **All XML processors MUST accept **the UTF-8 and UTF-16 encodings of Unicode [Unicode]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.

Section 4.3.3 - Character Encoding in Entities

Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings.

...

It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

@dralley

dralley commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

@Mingun Can you give me some idea of when you plan to review this, I will be away from my laptop for a week and a half starting this weekend.

@Mingun Mingun left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started review on per-commit basis about month ago, but finished it now using the final diff, to we may move forward. So maybe some last questions may be dumb.

I still think, that changing the low level from [u8] to str is not a better idea. It would be better at middle-level, where we will also expand general entity references (#948).

Comment thread src/de/mod.rs
Comment thread src/de/mod.rs Outdated
Comment thread src/de/resolver.rs Outdated
Comment thread src/events/attributes.rs
Comment thread src/writer.rs Outdated
let result = match event.into() {
Event::Start(e) => {
let result = self.write_wrapped(b"<", &e, b">");
let result = self.write_wrapped(b"<", e.as_bytes(), b">");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't those functions be changed to accept &str?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the inner functions which are writing to output, it doesn't make much difference. Could be either way.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I will change it anyways. It does clean things up slightly.

Comment thread src/events/mod.rs Outdated
Comment thread src/events/mod.rs Outdated
Comment thread src/reader/mod.rs
Comment thread src/reader/mod.rs
Comment thread src/de/simple_type.rs
}

#[cfg(feature = "encoding")]
mod utf16 {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to keep those tests?

@dralley dralley Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably possible but I don't understand what value it would provide

The actual mechanism of this library's support for UTF-16 involves processing it before it ever reaches the parser / Deserializer code in the first place. There's not really a use case where a user could Deserialize XML markup directly from UTF-16 without being pre-decoded - and to the extent that any of these tests work, it's only because they don't involve any actual XML markup? All the nontrivial cases here already expect failure.

There's a UTF-8 copy of all of these tests cases a few lines up, which covers everything a user could hit in practice. IMO it would make sense to drop or rename the module though.

@Mingun

Mingun commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Also just want to inform you, that I plan to release 0.41.0 now with the latest security fixes.

@dralley

dralley commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

@Mingun Updated

I will wait to rebase / resolve the merge conflicts until you're done reviewing.

@dralley dralley force-pushed the ergonomics-str branch 2 times, most recently from 7d7578b to f023822 Compare June 29, 2026 22:46
@dralley

dralley commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

Low level -- current Reader
Middle level -- #948 (performs automatic expand of general entity references)
High level -- serde interface

I'm not sure I agree, but I'm open to hearing the argument in favor. I can think of some positives and some negatives.

Is this a blocking issue or something that can be worked on later?

dralley added 21 commits July 2, 2026 13:19
Document that Reader expects UTF-8 input.
Required for const fn split_at - needed to keep trim_xml* functions
const.
Make xml*_content() methods infalliable as they no longer handle
decoding.
Deprecate decode_and* methods, since they no longer serve a purpose.
It is now impossible for ReaderState to receive unvalidated bytes.
This avoids some redundant validation and allows making different
decisions about how to validate for different types of XmlSource.
Eliminates some duplicitous validation
Custom impl no longer required after converting to String-based types.
BytesStart / BytesPI::attributes_raw() ought to return &str

BytesStart::try_get_attribute() ought to take &str - drop the AsRef
also.
It's a little cleaner, makes no practical difference otherwise.
Possible now that the MSRV is bumped to 1.86
@dralley

dralley commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

I rebased anyway, but all the changes are in new commits, everything after and including 22c4a94 "Make DeError::UnexpectedStart carry String"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants