Add Unicode and escape sequence support#624
Conversation
|
Regarding the questions:
|
|
In the linked issue I was suggesting to terminate |
| return 0; | ||
| } | ||
|
|
||
| inline auto encode_utf8(uint32_t cp, auto out) -> void { |
There was a problem hiding this comment.
Add an assertion for the accepted input range or throw an exception.
| if (c >= 'A' && c <= 'F') { | ||
| return 10 + (c - 'A'); | ||
| } | ||
| if (c >= 'a' && c <= 'f') { |
There was a problem hiding this comment.
I would typically turn the condition into an assertion or exception and simply return; or throw an exception at the end.
| void unquote(std::string_view in, auto out, bool fstring = false) { | ||
| auto escape = '\0'; | ||
| for (auto c : in) { | ||
| for (size_t i = 0; i < in.size(); ++i) { |
There was a problem hiding this comment.
Using indices here seems a bit clunky. If I cannot use ranges, I typically use iterators. I think the algorithm could be refactored to use iterators.
Since the function appends to an output stream, we could also use exceptions here and simply throw if parsing the string fails.
7eaa216 to
f7eee32
Compare
30fb4a7 to
16d0323
Compare
|
@FrancoisLaferriere I refactored the unquoting function a bit. Since there is no need for a separate parse_unicode_escape function, I made it an implementation detail and made it partially parse string prefixes. |
16d0323 to
265f8cf
Compare
* Add `\t`, `\r`, and `\u{<hex>}` escape sequences to lexer and unquote
* Add UTF-8 decoding for `\u{<hex>}` in unquote
* Apply `unquote()` to f-string literals
* Add `\r` to `PrintQuoted` output
* Remove unused `quote` function
* Add tests for escape sequences
265f8cf to
9102445
Compare
|
@rkaminsk I squashed the commits. It should be ready to merge now. |
Refs #123
Changes
Added support for
\uXXXX,\t, and\rescape sequences in strings.\t,\r, and\uXXXXpatterns to STRING and FLIT (f-strings)\t,\r, and\uXXXXescapesquote()\rescape handlingQuestions
Should
\uXXXXbe converted to UTF-8 bytes in the internal representation (current behavior), or should it be preserved and printed back as\uXXXX? Currently:"caf\u00E9"→ internal UTF-8 → output"café"Currently, escape sequences in f-string literals are NOT processed, they pass through literally:
f"\n"outputs"\\n"Should we process escape sequences in f-string literals via
unquote(), or keep the current behavior where they are passed through literally?