Skip to content

Fix handling of Unicode characters in String theory #431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

daniel-raffler
Copy link
Contributor

@daniel-raffler daniel-raffler commented Jan 16, 2025

Hello,
this MR patches several minor issues remaining after #422:

  • When escaping a String we now also escape \ to make sure the backslash is preserved and doesn't get captured later when unescaping again
  • We now always apply unescape first when creating a new String constant. This is needed even when the solver does not support plain Unicode and we'll have to escape the String again later. This change closes an issue with Z3 and CVC4 where broken escape sequences were not handled correctly.
  • Added two more tests and fixed a bug in another

As mentioned in #412 I think that makeString should not be translating SMTLIB escape sequences at all. However, this is an API breaking change and can still be considered some other time

@daniel-raffler daniel-raffler linked an issue Jan 16, 2025 that may be closed by this pull request
Copy link
Member

@kfriedberger kfriedberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am undecided whether our String escaping is useful.

@daniel-raffler daniel-raffler force-pushed the 412-inconsistent-handling-of-unicode-characters-in-string-theory branch from dd9eedd to 5a21073 Compare February 22, 2025 13:15
The test uses Strings.replaceAll and compares it to the result of str.replace_all in SMTLIB. However, the two functions behave different when the "matching" String is empty, and we need a special case for that.
This is needed to protect the backslash from substitution later when getting the results from the model.
…capeUnicodeForSmtlib() as backslashes (= codepoint 5c) are considered special characters by Matcher.appendReplacement()
…constant.

This is needed even for solvers like CVC4+5 or Z3 that expect Unicode characters to be escaped. We first need to unescape the String to resolve any escape sequences from the user, and then apply escape again before sending the String to the solver.
Strings in SMTLIB may contain Unicode characters from planes 0-2
Mixing Unicode characters with SMTLIB escape sequences leads to confusing corner cases and has therefore been removed from StringFormulaManager.makeString. Users can call unescapeUnicodeForSmtlib() from AbstractStringFormulaManager to convert a String with escape sequences before handing it of to makeString(). In the other direction escapeUnicodeForSmtlib() can be used to convert (Unicode) Strings from the model into an escaped format that is compatible with other SMTLIB based solvers.
@daniel-raffler daniel-raffler force-pushed the 412-inconsistent-handling-of-unicode-characters-in-string-theory branch from de93d71 to 795b316 Compare February 27, 2025 16:44
@baierd
Copy link
Contributor

baierd commented Jul 21, 2025

@daniel-raffler whats the current state of this PR?

…ndling-of-unicode-characters-in-string-theory
@daniel-raffler
Copy link
Contributor Author

@daniel-raffler whats the current state of this PR?

This is an API breaking change and I think we still have to decide how to handle it

The issue is that the format for the String argument in StringFormulaManager.makeString has never really been well defined. In the SMTLIB standard the Strings are Unicode, but a special encoding has to be used where all non-ASCII characters are written as \uXXXX escape sequences. One the other hand the argument for makeString is a Java String, so one would expect Unicode characters to be allowed. It's possible to translate between the two representations, but we have to know what "kind" of a String we have

Before 422, which wasn't merged until after the last stable release, we didn't really say which characters are allowed in the String. Some of the tests expect SMTLIB escape sequences to be recognized, but this was never officially documented. Currently both Unicode characters and SMTLIB escape sequences are allowed, but this can lead to some confusing situations for developers when a valid escape sequence is created "by accident"

The PR tries to sort this out by committing to Java Strings and removing support for SMTLIB escape sequences. It's still possible to handle these sequences, but developers have to do the conversion themselves by calling unescapeUnicodeForSmtlib and escapeUnicodeForSmtlib from AbstractStringFormulaManager. This shouldn't be too much of a problem as the extra step is only needed when reading SMTLIB from an external source, or trying to store the model in SMTLIB format. Most developers should therefore never have to worry about it

Copy link
Member

@kfriedberger kfriedberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks for preparing the PR.

@kfriedberger kfriedberger merged commit c8e5576 into master Jul 26, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent handling of Unicode characters in String theory
4 participants