Skip to content

Add Unicode and escape sequence support#624

Merged
rkaminsk merged 1 commit into
potassco:wip-20from
FrancoisLaferriere:string-escapes
Apr 20, 2026
Merged

Add Unicode and escape sequence support#624
rkaminsk merged 1 commit into
potassco:wip-20from
FrancoisLaferriere:string-escapes

Conversation

@FrancoisLaferriere
Copy link
Copy Markdown

Refs #123

Changes

Added support for \uXXXX, \t, and \r escape sequences in strings.

  • Lexer: Added \t, \r, and \uXXXX patterns to STRING and FLIT (f-strings)
  • unquote(): Added handling for \t, \r, and \uXXXX escapes
  • removed unused quote()
  • PrintQuoted: Added \r escape handling
  • Tests: Added tests for all new escape sequences

Questions

  • Should \uXXXX be converted to UTF-8 bytes in the internal representation (current behavior), or should it be preserved and printed back as \uXXXX? Currently:

    • "caf\u00E9" → internal UTF-8 → output "café"
  • Currently, escape sequences in f-string literals are NOT processed, they pass through literally: f"\n" outputs "\\n"
    Should we process escape sequences in f-string literals via unquote(), or keep the current behavior where they are passed through literally?

@rkaminsk rkaminsk linked an issue Apr 13, 2026 that may be closed by this pull request
@rkaminsk
Copy link
Copy Markdown
Member

Regarding the questions:

  1. Unicode escapes should be mapped to the UTF-8 sequences.
  2. Escapes in f-strings should also be mapped. Maybe I forgot to do this.

@rkaminsk
Copy link
Copy Markdown
Member

In the linked issue I was suggesting to terminate \uX; with a semicolon. This was intentional because the 4 digit representation does not cover all Unicode points. They range up to 10FFFF. Since we are free to come up with our own escape sequences, I would suggest to choose something that simply covers them all. If you don't like the terminating semicolon, I am also fine with another escape for arbitrary length ones, say \u{X}, for example, where X can be a hex digit in the range 0-10FFFF. These days emojis are too important to leave out. 😉

return 0;
}

inline auto encode_utf8(uint32_t cp, auto out) -> void {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an assertion for the accepted input range or throw an exception.

if (c >= 'A' && c <= 'F') {
return 10 + (c - 'A');
}
if (c >= 'a' && c <= 'f') {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would typically turn the condition into an assertion or exception and simply return; or throw an exception at the end.

Comment thread lib/util/include/clingo/util/string.hh Outdated
void unquote(std::string_view in, auto out, bool fstring = false) {
auto escape = '\0';
for (auto c : in) {
for (size_t i = 0; i < in.size(); ++i) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using indices here seems a bit clunky. If I cannot use ranges, I typically use iterators. I think the algorithm could be refactored to use iterators.

Since the function appends to an output stream, we could also use exceptions here and simply throw if parsing the string fails.

@FrancoisLaferriere FrancoisLaferriere force-pushed the string-escapes branch 2 times, most recently from 7eaa216 to f7eee32 Compare April 18, 2026 13:10
@rkaminsk
Copy link
Copy Markdown
Member

@FrancoisLaferriere I refactored the unquoting function a bit. Since there is no need for a separate parse_unicode_escape function, I made it an implementation detail and made it partially parse string prefixes.

    * Add `\t`, `\r`, and `\u{<hex>}` escape sequences to lexer and unquote
    * Add UTF-8 decoding for `\u{<hex>}` in unquote
    * Apply `unquote()` to f-string literals
    * Add `\r` to `PrintQuoted` output
    * Remove unused `quote` function
    * Add tests for escape sequences
@FrancoisLaferriere FrancoisLaferriere marked this pull request as ready for review April 20, 2026 10:58
@FrancoisLaferriere
Copy link
Copy Markdown
Author

@rkaminsk I squashed the commits. It should be ready to merge now.

@rkaminsk rkaminsk merged commit 7875aa0 into potassco:wip-20 Apr 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Only basic escapes are supported

2 participants