-
Notifications
You must be signed in to change notification settings - Fork 52
Fix handling of Unicode characters in String theory #431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of Unicode characters in String theory #431
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am undecided whether our String escaping is useful.
src/org/sosy_lab/java_smt/solvers/cvc4/CVC4StringFormulaManager.java
Outdated
Show resolved
Hide resolved
src/org/sosy_lab/java_smt/solvers/cvc5/CVC5StringFormulaManager.java
Outdated
Show resolved
Hide resolved
dd9eedd
to
5a21073
Compare
The test uses Strings.replaceAll and compares it to the result of str.replace_all in SMTLIB. However, the two functions behave different when the "matching" String is empty, and we need a special case for that.
…he pattern should also be escaped.
This is needed to protect the backslash from substitution later when getting the results from the model.
…capeUnicodeForSmtlib() as backslashes (= codepoint 5c) are considered special characters by Matcher.appendReplacement()
…constant. This is needed even for solvers like CVC4+5 or Z3 that expect Unicode characters to be escaped. We first need to unescape the String to resolve any escape sequences from the user, and then apply escape again before sending the String to the solver.
Strings in SMTLIB may contain Unicode characters from planes 0-2
Mixing Unicode characters with SMTLIB escape sequences leads to confusing corner cases and has therefore been removed from StringFormulaManager.makeString. Users can call unescapeUnicodeForSmtlib() from AbstractStringFormulaManager to convert a String with escape sequences before handing it of to makeString(). In the other direction escapeUnicodeForSmtlib() can be used to convert (Unicode) Strings from the model into an escaped format that is compatible with other SMTLIB based solvers.
…Unicode escape sequences
de93d71
to
795b316
Compare
@daniel-raffler whats the current state of this PR? |
…ndling-of-unicode-characters-in-string-theory
This is an API breaking change and I think we still have to decide how to handle it The issue is that the format for the String argument in Before 422, which wasn't merged until after the last stable release, we didn't really say which characters are allowed in the String. Some of the tests expect SMTLIB escape sequences to be recognized, but this was never officially documented. Currently both Unicode characters and SMTLIB escape sequences are allowed, but this can lead to some confusing situations for developers when a valid escape sequence is created "by accident" The PR tries to sort this out by committing to Java Strings and removing support for SMTLIB escape sequences. It's still possible to handle these sequences, but developers have to do the conversion themselves by calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Thanks for preparing the PR.
Hello,
this MR patches several minor issues remaining after #422:
\
to make sure the backslash is preserved and doesn't get captured later when unescaping againunescape
first when creating a new String constant. This is needed even when the solver does not support plain Unicode and we'll have to escape the String again later. This change closes an issue with Z3 and CVC4 where broken escape sequences were not handled correctly.As mentioned in #412 I think that
makeString
should not be translating SMTLIB escape sequences at all. However, this is an API breaking change and can still be considered some other time