Add Unicode and escape sequence support by FrancoisLaferriere · Pull Request #624 · potassco/clingo

FrancoisLaferriere · 2026-04-10T10:29:25Z

Refs #123

Changes

Added support for \uXXXX, \t, and \r escape sequences in strings.

Lexer: Added \t, \r, and \uXXXX patterns to STRING and FLIT (f-strings)
unquote(): Added handling for \t, \r, and \uXXXX escapes
removed unused quote()
PrintQuoted: Added \r escape handling
Tests: Added tests for all new escape sequences

Questions

Should \uXXXX be converted to UTF-8 bytes in the internal representation (current behavior), or should it be preserved and printed back as \uXXXX? Currently:
- "caf\u00E9" → internal UTF-8 → output "café"
Currently, escape sequences in f-string literals are NOT processed, they pass through literally: f"\n" outputs "\\n"
Should we process escape sequences in f-string literals via unquote(), or keep the current behavior where they are passed through literally?

rkaminsk · 2026-04-13T06:36:32Z

Regarding the questions:

Unicode escapes should be mapped to the UTF-8 sequences.
Escapes in f-strings should also be mapped. Maybe I forgot to do this.

rkaminsk · 2026-04-13T07:04:23Z

In the linked issue I was suggesting to terminate \uX; with a semicolon. This was intentional because the 4 digit representation does not cover all Unicode points. They range up to 10FFFF. Since we are free to come up with our own escape sequences, I would suggest to choose something that simply covers them all. If you don't like the terminating semicolon, I am also fine with another escape for arbitrary length ones, say \u{X}, for example, where X can be a hex digit in the range 0-10FFFF. These days emojis are too important to leave out. 😉

rkaminsk · 2026-04-13T07:07:38Z

+    return 0;
+}
+
+inline auto encode_utf8(uint32_t cp, auto out) -> void {


Add an assertion for the accepted input range or throw an exception.

rkaminsk · 2026-04-13T07:08:33Z

+    if (c >= 'A' && c <= 'F') {
+        return 10 + (c - 'A');
+    }
+    if (c >= 'a' && c <= 'f') {


I would typically turn the condition into an assertion or exception and simply return; or throw an exception at the end.

rkaminsk · 2026-04-13T07:15:06Z

 void unquote(std::string_view in, auto out, bool fstring = false) {
    auto escape = '\0';
-    for (auto c : in) {
+    for (size_t i = 0; i < in.size(); ++i) {


Using indices here seems a bit clunky. If I cannot use ranges, I typically use iterators. I think the algorithm could be refactored to use iterators.

Since the function appends to an output stream, we could also use exceptions here and simply throw if parsing the string fails.

rkaminsk · 2026-04-20T10:00:37Z

@FrancoisLaferriere I refactored the unquoting function a bit. Since there is no need for a separate parse_unicode_escape function, I made it an implementation detail and made it partially parse string prefixes.

* Add `\t`, `\r`, and `\u{<hex>}` escape sequences to lexer and unquote * Add UTF-8 decoding for `\u{<hex>}` in unquote * Apply `unquote()` to f-string literals * Add `\r` to `PrintQuoted` output * Remove unused `quote` function * Add tests for escape sequences

FrancoisLaferriere · 2026-04-20T11:44:27Z

@rkaminsk I squashed the commits. It should be ready to merge now.

rkaminsk linked an issue Apr 13, 2026 that may be closed by this pull request

Only basic escapes are supported #608

Closed

rkaminsk requested changes Apr 13, 2026

View reviewed changes

FrancoisLaferriere force-pushed the string-escapes branch 2 times, most recently from 7eaa216 to f7eee32 Compare April 18, 2026 13:10

rkaminsk force-pushed the string-escapes branch from 30fb4a7 to 16d0323 Compare April 20, 2026 09:58

rkaminsk force-pushed the string-escapes branch from 16d0323 to 265f8cf Compare April 20, 2026 10:43

FrancoisLaferriere force-pushed the string-escapes branch from 265f8cf to 9102445 Compare April 20, 2026 10:55

FrancoisLaferriere marked this pull request as ready for review April 20, 2026 10:58

rkaminsk merged commit 7875aa0 into potassco:wip-20 Apr 20, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Unicode and escape sequence support#624

Add Unicode and escape sequence support#624
rkaminsk merged 1 commit into
potassco:wip-20from
FrancoisLaferriere:string-escapes

FrancoisLaferriere commented Apr 10, 2026

Uh oh!

rkaminsk commented Apr 13, 2026

Uh oh!

rkaminsk commented Apr 13, 2026

Uh oh!

rkaminsk Apr 13, 2026

Uh oh!

rkaminsk Apr 13, 2026

Uh oh!

rkaminsk Apr 13, 2026

Uh oh!

rkaminsk commented Apr 20, 2026

Uh oh!

FrancoisLaferriere commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FrancoisLaferriere commented Apr 10, 2026

Refs #123

Changes

Questions

Uh oh!

rkaminsk commented Apr 13, 2026

Uh oh!

rkaminsk commented Apr 13, 2026

Uh oh!

rkaminsk Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

rkaminsk Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

rkaminsk Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

rkaminsk commented Apr 20, 2026

Uh oh!

FrancoisLaferriere commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants