Skip to content

Add support for combining characters #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 8, 2017

Conversation

claui
Copy link
Contributor

@claui claui commented Sep 8, 2017

In Unicode, a combining character is a character which can be stacked on top of the character preceding it. For example:

Example

  • The character LATIN SMALL LETTER U (u) has the codepoint U+0075 assigned.
  • The character COMBINING DIAERESIS, which looks similar to the symbol ¨ but is actually a combining character, has the codepoint U+0308 assigned.
  • Writing both characters in sequence yields the letter , which looks just like ü but is actually two characters.

Impact

On a Mac, such combinations are very common, especially in filenames due to an oddity in the HFS+ filesystem.

The issue

In the applescript-json library however, the encodeString function always fails when the input contains a combining character.
In detail, the code assumes that inside the repeat with ch loop, ch will be always one single character, and its id property will always return an integer. However, in reality ch will contain more than one character if a combining character is involved. Because of that, the id property will return a list instead of an integer. The code is not prepared to handle the list, which triggers the error.

The fix

This PR fixes the issue by fetching id before doing the iteration. This yields a simple list of all codepoints in the entire input string, which can be iterated safely. I have also added a simple test case for the “u followed by  ̈ ” scenario described above (including the expected JSON output, which would be u\u0308).

In Unicode, a combining character is a character which can be stacked
on top of the character preceding it. For example:

- The character LATIN SMALL LETTER U (`u`) has the codepoint
  U+0075 assigned.
- The character COMBINING DIAERESIS, which looks similar to the
  symbol `¨` but is actually a combining character, has the
  codepoint U+0308 assigned.
- Writing both characters in sequence yields the letter `ü`,
  which looks just like `ü` but is actually *two* characters.

On a Mac, such combinations are very common, especially in filenames
due to an oddity in the HFS+ filesystem.

In the `applescript-json` library however, the `encodeString`
function always fails when the input contains a combining character.
In detail, the code assumes that inside the `repeat with ch` loop,
`ch` will be always one single character, and its `id` property will
always return an integer. However, in reality `ch` will contain more
than one character if a combining character is involved. Because of
that, the `id` property will return a list instead of an integer. The
code is not prepared to handle the list, which triggers the error.

This commit adds a simple test case for the “u followed by ̈” scenario
described above. It also includes the expected JSON output, which
would be `u\u0308`.
This commit fixes the bug described in the previous commit. The trick is to fetch the `id` property on the entire input string _before_ we iterate over it.

That way, `id` returns a simple list of integer codepoints for the entire string, which can be iterated safely.
@claui
Copy link
Contributor Author

claui commented Sep 8, 2017

Oh, and: apologies for overexplaining.
I just came across your slides, which indicate you probably know a lot more about denormalization and combining characters than I do. 😉

@mgax mgax merged commit ad86475 into mgax:master Sep 8, 2017
@mgax
Copy link
Owner

mgax commented Sep 8, 2017

Wow. Thanks!

@claui claui deleted the combining-characters branch September 8, 2017 17:03
@mgax
Copy link
Owner

mgax commented Sep 8, 2017

Oh, and: apologies for overexplaining.

No need! I had no idea how applescript deals with multiple codepoints for a character. And I'm actually surprised anybody is using this library, let alone is willing to send a PR :)

@claui
Copy link
Contributor Author

claui commented Sep 8, 2017

And I'm actually surprised anybody is using this library, let alone is willing to send a PR :)

Not only do I use it, I feel it’s the best solution out there to make AppleScript write machine-readable things to standard output.

Right now, I’m finishing up an Alfred 3 workflow, which inexplicably failed the other night despite months of testing. Turns out one of my Terminal.app tabs was showing a directory name in decomposed form just as my AppleScript inspected that tab. That one letter then went on to crash your library.

Thank you for making this!

@mgax
Copy link
Owner

mgax commented Sep 8, 2017

Ah, cool, glad to hear that! I made it a while ago to export playlists from iTunes. The idea was to get the data out of applescript-land as quickly as possible and use a sane language to work with it. :)

@xilopaint
Copy link

xilopaint commented Sep 27, 2018

Hi @claui! I'm also trying to use applescript-json in an Alfred workflow to generate feedback in a Script Filter. Could you give me a hand in #5?

@claui
Copy link
Contributor Author

claui commented Sep 28, 2018

@xilopaint Thanks but no, I can’t implement this right now.

@xilopaint
Copy link

No problem! In fact I didn't ask you to implement the feature, I just thought that you could know a workaround. Now I see someone seems had worked on the feature in #1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants