-
Notifications
You must be signed in to change notification settings - Fork 11
Add support for combining characters #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In Unicode, a combining character is a character which can be stacked on top of the character preceding it. For example: - The character LATIN SMALL LETTER U (`u`) has the codepoint U+0075 assigned. - The character COMBINING DIAERESIS, which looks similar to the symbol `¨` but is actually a combining character, has the codepoint U+0308 assigned. - Writing both characters in sequence yields the letter `ü`, which looks just like `ü` but is actually *two* characters. On a Mac, such combinations are very common, especially in filenames due to an oddity in the HFS+ filesystem. In the `applescript-json` library however, the `encodeString` function always fails when the input contains a combining character. In detail, the code assumes that inside the `repeat with ch` loop, `ch` will be always one single character, and its `id` property will always return an integer. However, in reality `ch` will contain more than one character if a combining character is involved. Because of that, the `id` property will return a list instead of an integer. The code is not prepared to handle the list, which triggers the error. This commit adds a simple test case for the “u followed by ̈” scenario described above. It also includes the expected JSON output, which would be `u\u0308`.
This commit fixes the bug described in the previous commit. The trick is to fetch the `id` property on the entire input string _before_ we iterate over it. That way, `id` returns a simple list of integer codepoints for the entire string, which can be iterated safely.
Oh, and: apologies for overexplaining. |
Wow. Thanks! |
No need! I had no idea how applescript deals with multiple codepoints for a character. And I'm actually surprised anybody is using this library, let alone is willing to send a PR :) |
Not only do I use it, I feel it’s the best solution out there to make AppleScript write machine-readable things to standard output. Right now, I’m finishing up an Alfred 3 workflow, which inexplicably failed the other night despite months of testing. Turns out one of my Terminal.app tabs was showing a directory name in decomposed form just as my AppleScript inspected that tab. That one letter then went on to crash your library. Thank you for making this! |
Ah, cool, glad to hear that! I made it a while ago to export playlists from iTunes. The idea was to get the data out of applescript-land as quickly as possible and use a sane language to work with it. :) |
@xilopaint Thanks but no, I can’t implement this right now. |
No problem! In fact I didn't ask you to implement the feature, I just thought that you could know a workaround. Now I see someone seems had worked on the feature in #1. |
In Unicode, a combining character is a character which can be stacked on top of the character preceding it. For example:
Example
u
) has the codepoint U+0075 assigned.¨
but is actually a combining character, has the codepoint U+0308 assigned.ü
, which looks just likeü
but is actually two characters.Impact
On a Mac, such combinations are very common, especially in filenames due to an oddity in the HFS+ filesystem.
The issue
In the
applescript-json
library however, theencodeString
function always fails when the input contains a combining character.In detail, the code assumes that inside the
repeat with ch
loop,ch
will be always one single character, and itsid
property will always return an integer. However, in realitych
will contain more than one character if a combining character is involved. Because of that, theid
property will return a list instead of an integer. The code is not prepared to handle the list, which triggers the error.The fix
This PR fixes the issue by fetching
id
before doing the iteration. This yields a simple list of all codepoints in the entire input string, which can be iterated safely. I have also added a simple test case for the “u followed by ̈ ” scenario described above (including the expected JSON output, which would beu\u0308
).