Improve ergonomics - rebuild Reader around native utf-8 string types#963
Improve ergonomics - rebuild Reader around native utf-8 string types#963dralley wants to merge 21 commits into
Conversation
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #963 +/- ##
==========================================
- Coverage 57.31% 56.27% -1.04%
==========================================
Files 46 47 +1
Lines 18197 18135 -62
==========================================
- Hits 10429 10205 -224
- Misses 7768 7930 +162
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| Cow::Owned(owned) => CowRef::Owned(owned), | ||
| }, | ||
| Cow::Borrowed(b) => { | ||
| let name_str = std::str::from_utf8(&b[..start.name_len]) |
There was a problem hiding this comment.
There will be a handful of these temporary from_utf8() calls, but they should be able to be removed by by subsequent commits as additional types are switched over.
e028123 to
4a01f13
Compare
|
@Mingun Would you be satisfied if edit: well, that's what I implemented. Sidenote: maybe |
|
These 3 particular commits are ready for review, with the caveat that there will be (probably) 6-8 additional commits coming. |
a198941 to
27e09f5
Compare
5474a71 to
266bf6f
Compare
|
Remaining design questions, not all of which actually need to be dealt with in this PR:
|
|
Also, I can improve the commit messages and Changelog entries if needed. The initial are pretty... concise. |
4ba0d68 to
cd8da9b
Compare
|
First, I would prefer to keep the ability to parse non-utf8 encoded documents without recoding. XML itself can be parsed without knowing the exact encoding, it is enough if it is XML-compatible (which is all legacy 1-byte encodings that we support). So, is it possible to create a separate reader and event which will be always UTF-8 encoded and keep the current ones for advanced usage? It is fine to promote the new UTF-8-based reader as default, but keep the ability to work with non-UTF-8 input without recoding. Here is the same situation as for regexp -- although it is defined in terms of strings, nothing prevents it from running on top of any byte arrays. The author of regexp engine even created a |
IMO, it is not worth the ergonomic and maintenance costs. If you look at all the major XML parsing libraries like e.g. libxml2 parses & handles UTF-8 only, performs a streaming decode of other encodings expat selects either UTF-8 or UTF-16 as an internal encoding at compile time, decodes to that, returns whichever type of string was selected encoding/xml is the same as libxml2 - utf-8 only Decoding is very very fast relative to XML parsing - it varies depending on encoding and the precise makeup of the document of course, but generally between 15 and 90 Gbps, whereas XML parsing is currently in the ballpark of 0.5 Gbps and often slower, so I don't really think that's a reason to avoid it either. I would maybe accept the argument that it's a huge API change and it might be warranted to support both for some time to allow a migration, but even then it would likely be easier to just maintain an older branch for a longer period of time. Duplicating the reader would, I think, be way way more work than it's worth. |
|
Also, the reason the XML libraries work that way, apart from overall simplicity, is that the XML standard effectively requires working that way. The standard actually said that all XML processors should be able to handle I'm not a complete stickler for compliance, and we do provide a handful of features catering to noncompliant XML and XML-derived document formats (which is fine), but in this case I really don't see a good reason to go out of our way to break with it. It's just more complexity for a use case of (IMO) very questionable value. Section 2.2
Section 4.3.3 - Character Encoding in Entities
...
|
|
@Mingun Can you give me some idea of when you plan to review this, I will be away from my laptop for a week and a half starting this weekend. |
Mingun
left a comment
There was a problem hiding this comment.
I started review on per-commit basis about month ago, but finished it now using the final diff, to we may move forward. So maybe some last questions may be dumb.
I still think, that changing the low level from [u8] to str is not a better idea. It would be better at middle-level, where we will also expand general entity references (#948).
| let result = match event.into() { | ||
| Event::Start(e) => { | ||
| let result = self.write_wrapped(b"<", &e, b">"); | ||
| let result = self.write_wrapped(b"<", e.as_bytes(), b">"); |
There was a problem hiding this comment.
Shouldn't those functions be changed to accept &str?
There was a problem hiding this comment.
For the inner functions which are writing to output, it doesn't make much difference. Could be either way.
There was a problem hiding this comment.
But I will change it anyways. It does clean things up slightly.
| } | ||
|
|
||
| #[cfg(feature = "encoding")] | ||
| mod utf16 { |
There was a problem hiding this comment.
Is it not possible to keep those tests?
There was a problem hiding this comment.
It's probably possible but I don't understand what value it would provide
The actual mechanism of this library's support for UTF-16 involves processing it before it ever reaches the parser / Deserializer code in the first place. There's not really a use case where a user could Deserialize XML markup directly from UTF-16 without being pre-decoded - and to the extent that any of these tests work, it's only because they don't involve any actual XML markup? All the nontrivial cases here already expect failure.
There's a UTF-8 copy of all of these tests cases a few lines up, which covers everything a user could hit in practice. IMO it would make sense to drop or rename the module though.
|
Also just want to inform you, that I plan to release 0.41.0 now with the latest security fixes. |
|
@Mingun Updated I will wait to rebase / resolve the merge conflicts until you're done reviewing. |
7d7578b to
f023822
Compare
I'm not sure I agree, but I'm open to hearing the argument in favor. I can think of some positives and some negatives. Is this a blocking issue or something that can be worked on later? |
Document that Reader expects UTF-8 input.
Required for const fn split_at - needed to keep trim_xml* functions const.
Make xml*_content() methods infalliable as they no longer handle decoding.
Deprecate decode_and* methods, since they no longer serve a purpose.
It is now impossible for ReaderState to receive unvalidated bytes. This avoids some redundant validation and allows making different decisions about how to validate for different types of XmlSource.
Eliminates some duplicitous validation
Custom impl no longer required after converting to String-based types.
BytesStart / BytesPI::attributes_raw() ought to return &str BytesStart::try_get_attribute() ought to take &str - drop the AsRef also.
It's a little cleaner, makes no practical difference otherwise.
Possible now that the MSRV is bumped to 1.86
|
I rebased anyway, but all the changes are in new commits, everything after and including 22c4a94 "Make DeError::UnexpectedStart carry String" |
Cow<[u8]>toCow<str>, remove DecoderCow<[u8]>toCow<str>