Fix handling of Unicode characters in String theory #431

daniel-raffler · 2025-01-16T16:34:27Z

Hello,
this MR patches several minor issues remaining after #422:

When escaping a String we now also escape \ to make sure the backslash is preserved and doesn't get captured later when unescaping again
We now always apply unescape first when creating a new String constant. This is needed even when the solver does not support plain Unicode and we'll have to escape the String again later. This change closes an issue with Z3 and CVC4 where broken escape sequences were not handled correctly.
Added two more tests and fixed a bug in another

As mentioned in #412 I think that makeString should not be translating SMTLIB escape sequences at all. However, this is an API breaking change and can still be considered some other time

kfriedberger

I am undecided whether our String escaping is useful.

src/org/sosy_lab/java_smt/solvers/cvc4/CVC4StringFormulaManager.java

src/org/sosy_lab/java_smt/solvers/cvc5/CVC5StringFormulaManager.java

src/org/sosy_lab/java_smt/test/StringFormulaManagerTest.java

The test uses Strings.replaceAll and compares it to the result of str.replace_all in SMTLIB. However, the two functions behave different when the "matching" String is empty, and we need a special case for that.

…he pattern should also be escaped.

This is needed to protect the backslash from substitution later when getting the results from the model.

…capeUnicodeForSmtlib() as backslashes (= codepoint 5c) are considered special characters by Matcher.appendReplacement()

…constant. This is needed even for solvers like CVC4+5 or Z3 that expect Unicode characters to be escaped. We first need to unescape the String to resolve any escape sequences from the user, and then apply escape again before sending the String to the solver.

Strings in SMTLIB may contain Unicode characters from planes 0-2

Mixing Unicode characters with SMTLIB escape sequences leads to confusing corner cases and has therefore been removed from StringFormulaManager.makeString. Users can call unescapeUnicodeForSmtlib() from AbstractStringFormulaManager to convert a String with escape sequences before handing it of to makeString(). In the other direction escapeUnicodeForSmtlib() can be used to convert (Unicode) Strings from the model into an escaped format that is compatible with other SMTLIB based solvers.

…Test

…Unicode escape sequences

baierd · 2025-07-21T15:03:09Z

@daniel-raffler whats the current state of this PR?

…ndling-of-unicode-characters-in-string-theory

daniel-raffler · 2025-07-23T09:18:18Z

@daniel-raffler whats the current state of this PR?

This is an API breaking change and I think we still have to decide how to handle it

The issue is that the format for the String argument in StringFormulaManager.makeString has never really been well defined. In the SMTLIB standard the Strings are Unicode, but a special encoding has to be used where all non-ASCII characters are written as \uXXXX escape sequences. One the other hand the argument for makeString is a Java String, so one would expect Unicode characters to be allowed. It's possible to translate between the two representations, but we have to know what "kind" of a String we have

Before 422, which wasn't merged until after the last stable release, we didn't really say which characters are allowed in the String. Some of the tests expect SMTLIB escape sequences to be recognized, but this was never officially documented. Currently both Unicode characters and SMTLIB escape sequences are allowed, but this can lead to some confusing situations for developers when a valid escape sequence is created "by accident"

The PR tries to sort this out by committing to Java Strings and removing support for SMTLIB escape sequences. It's still possible to handle these sequences, but developers have to do the conversion themselves by calling unescapeUnicodeForSmtlib and escapeUnicodeForSmtlib from AbstractStringFormulaManager. This shouldn't be too much of a problem as the extra step is only needed when reading SMTLIB from an external source, or trying to store the model in SMTLIB format. Most developers should therefore never have to worry about it

kfriedberger

lgtm. Thanks for preparing the PR.

daniel-raffler linked an issue Jan 16, 2025 that may be closed by this pull request

Inconsistent handling of Unicode characters in String theory #412

Closed

kfriedberger reviewed Jan 16, 2025

View reviewed changes

daniel-raffler force-pushed the 412-inconsistent-handling-of-unicode-characters-in-string-theory branch from dd9eedd to 5a21073 Compare February 22, 2025 13:15

daniel-raffler added 13 commits February 27, 2025 17:37

Strings: Added more tests for Unicode escaping

825d5f7

Strings: Patch a bug in testConstStringReplaceAll

0d2be26

The test uses Strings.replaceAll and compares it to the result of str.replace_all in SMTLIB. However, the two functions behave different when the "matching" String is empty, and we need a special case for that.

Strings: Fix pattern for \u{...} escape sequences. The final '}' in t…

0f31415

…he pattern should also be escaped.

Strings: Escape backslashes when creating String literals.

0b8d1a3

This is needed to protect the backslash from substitution later when getting the results from the model.

Strings: Add a separate case for the escape sequence "\u{5c}" in unes…

6738564

…capeUnicodeForSmtlib() as backslashes (= codepoint 5c) are considered special characters by Matcher.appendReplacement()

Strings: Add tests for Unicode characters that are not in the BMP

c5e81ac

Strings: Disable non-BMP test for CVC5 due to a bug in the solver

2265152

Strings: Clean up the documentation

9cc096b

Strings in SMTLIB may contain Unicode characters from planes 0-2

Strings: Fix evaluation of Unicode String formulas in ModelEvaluation…

9aaefaf

…Test

Strings: Fix format of the hex constant when printing codepoints for …

cabdad2

…Unicode escape sequences

Strings: Update JavaDoc for StringFormulaManager.makeString

795b316

daniel-raffler force-pushed the 412-inconsistent-handling-of-unicode-characters-in-string-theory branch from de93d71 to 795b316 Compare February 27, 2025 16:44

Merge remote-tracking branch 'origin/master' into 412-inconsistent-ha…

7612625

…ndling-of-unicode-characters-in-string-theory

StringFormulaManager: improve tests

2f33842

kfriedberger approved these changes Jul 26, 2025

View reviewed changes

kfriedberger merged commit c8e5576 into master Jul 26, 2025
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix handling of Unicode characters in String theory #431

Fix handling of Unicode characters in String theory #431

Uh oh!

daniel-raffler commented Jan 16, 2025 •

edited

Loading

Uh oh!

kfriedberger left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baierd commented Jul 21, 2025

Uh oh!

daniel-raffler commented Jul 23, 2025

Uh oh!

kfriedberger left a comment

Uh oh!

Uh oh!

Uh oh!

Fix handling of Unicode characters in String theory #431

Fix handling of Unicode characters in String theory #431

Uh oh!

Conversation

daniel-raffler commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfriedberger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baierd commented Jul 21, 2025

Uh oh!

daniel-raffler commented Jul 23, 2025

Uh oh!

kfriedberger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

daniel-raffler commented Jan 16, 2025 •

edited

Loading