-
-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encourage effective use of Unicode #1487
Comments
I agree. One way to avoid the braking change is to deprecate the current string API that produces a message outlining the move to Also, what is the preferred way of getting the proper length in JS? |
Honestly this is not only problem of JavaScript. But Rust, Swift, Java and all other languages which measure string length in at least code points instead graphemes. All languages exposed some new api for iterate over graphemes. For example Rust: use unicode_segmentation::UnicodeSegmentation;
fn main() {
for g in "नमस्ते्".graphemes(true) { // hindi
println!("grapheme - {}", g);
}
} JavaScript will do this (with const segmenter = new Intl.Segmenter("hi", { granularity: "grapheme" });
for (const { segment: g } of segmenter.segment("नमस्ते्")) {
console.log("grapheme: ", g);
} |
I don't think we should rework existing String or expose something new which fully unicode aware ( Also in most of cases we don't really care about fully compatible UTF strings. But if you really need it you could include something like |
By the way, no one language supports all special cases of unicode case mapping. Just most popular like Greek Final Sigma (which context dependent): rust-lang/rust#26035 we also support this: But |
Unicode standard is really mess currently, it has a lot of redundant glyphs, and really large amount of special cases which constantly grow from version to version. |
Since ES6 JS also support code points via function reverseStringNaive(str) {
return str.split("").reverse().join("");
}
function reverseStringUnicodeAware(str) {
return Array.from(str).reverse().join("");
// or [...str].reverse().join("");
}
console.log(reverseStringNaive("foo 𝌆 bar"));
// will print:
//> rab �� oof
console.log(reverseStringUnicodeAware("foo 𝌆 bar"));
// will print:
//> rab 𝌆 oof
// calc length in O(1)
console.log("🤦🏼♂️".length); // "7"
// calc length in O(N)
console.log([..."🤦🏼♂️"].length); // "5" and now it align with Rust's "🤦🏼♂️".chars().count() result iterate by code point (unicode aware): for (const c of "foo 𝌆 bar") {
console.log(c);
} So JS/TS/AS contain all instruments for handling with unicode strings and even graphemes. |
Here's what the official ICU documentation says about linrary size:
So some languages (like Swift) use or small_icu mode of ICU or use own custom implementations (Go, Rust, C#, C++) which contain only really necessary unicode property tables compressed as static tries (three staged indirect lookups). We also follow this way. |
Regarding string interfaces. What we can do for make it mode safer for users? That's good question. I guess we could declare some methods as deprecate and non recommend for usage (like |
I am not sure about deprecating these, like if one wants to deal with a WTF-16 string, then these are still the way to go I think? For instance, we use them for small strings in the loader because that's faster than piping through a TextDecoder. Of course we can add more details to the documentation, with a neutral link for everyone interested to learn more? |
I meant deprecate some string methods inside AssemblyScript in future when |
I am not so sure about that, unless JS itself officially discourages their use. Otherwise we are just creating unnecessary barriers, aren't we? |
How about deprecate it only for |
Btw Rust also required special counting api for code points (aka CCs/Character Codes): println!("utf8 units (bytes): {}", "🤦🏼♂️".len()); // 17 -> 17 bytes
println!("code points: {}", "🤦🏼♂️".chars().count()); // 5 JS: console.log("utf16 units (ushorts):", "🤦🏼♂️".length); // 7 "🤦🏼♂️".length * 2 -> 14 bytes
console.log("code points:", [..."🤦🏼♂️"].length); // 5 So Rust's |
Really not sure. If the broader ecosystem goes that route, perhaps a diagnostic message about a potential pitfall, but I don't see a reason for asc to spearhead something like that. At this point in time, I'd expect that most users would complain about it. |
The preferred way is to ask a more specific question 😉 . Are you asking for the visual width, the number of user-perceived characters, the number of Unicode code points, or the number of bytes of storage used?
I agree. I'm not looking to add new functionality in this issue, but just to present the current functionality in a different way.
This is a good example -- it's hard to see how this behavior helps anyone, except via bug-for-bug compatibility with JS.
I'm agree; deprecation feels too strong here. In particular, for functions like Renaming |
We could do even better. We could actually fix |
But your fix is someone else's bug. The |
If people use let unsafe = 'Emoji 🤖'.substr(0,7); // Emoji �
let safe = [...'Emoji 🤖'].slice(0,7).join(''); // Emoji 🤖
// or do this via regexps or third-party libraries like: And I don't think somebody will exploit existing broken for UTF-16 behaviour for some proposes. And even so it will be bad practice. Like utilize UBs in C++ for speedup some process "only for MSVC or ICC and Intel Code 2 Duo" compiler for example |
And at last we could add |
This is the kind of issue where if something is going to be done, it's easier to do it sooner rather than later. So I'm posting here to make another appeal. AssemblyScript calls itself "A language made for WebAssembly". WebAssembly seeks to make sense on its own terms, rather than behaving like JavaScript for JavaScript's sake. So, I propose to rename the functions which work in terms of the underlying code-unit concept, such as |
During the last conversation on this topic, we came to the conclusion that it makes sense to create a new addisonal string class "Str / str" which will be completely unicode-aware and possibly even have UTF-8 encoding and at the same time have the most similar interface with classic strings, however without random access like charCodeAt, charAt and etc |
The whole UTF-8 vs WTF-16 discussion is super unfortunate for us. AssemblyScript just so happens to be torn in between the two worlds as it both aims to be a language for WebAssembly, with a majority of stakeholders apparently trying to get rid of 16-bit unicode, and a language that looks and feels pretty much like TypeScript. As such it is based upon, and works best with, WebAPIs that are specified and designed for WTF-16, yet it compiles to WebAssembly. Our options are:
It seems there is simply no good decision we can make here, and whatever we do, we'll get ourselves into trouble. Path of least resistance might be 3., but that'll put AS at a disadvantage exactly where it currently excels, so morale to re-implement half of stdlib just to build something suboptimal isn't exactly high. As I said, it's all so unfortunate. |
There are indeed several related conversations, but I think the specific issue here can be considered in isolation. Let's forget UTF-8 vs WTF-16 here for a moment, and just focus on "encoding-independent" vs. "encoding-specific" APIs. The specific change I'm proposing here is just to rename encoding-specific functions so that they're explicit about it. It's a simple change. It can help users understand when they need to be aware of encoding-level functions, since these functions can be error-prone in a way that encoding-independent functions aren't. And it can give you more flexibility in the future, no matter what you end up deciding to do about encodings in general. |
Alright, let's play this through, using var str = "some string";
if (str.charCodeAt(0) == 0x73) {
// ...
} Can you give me an example of what you are envisioning with namespaces there? In particular I worry that not having a |
There are probably multiple ways to do it; I was imagining something similar to the existing var str = "some string";
if (String.JS.charCodeAt(str, 0) == 0x73) {
// ...
} Of course, users coming from TS may find this alien or more verbose. However, this is also a great moment to point out that a better way to write this code would be: var str = "some string";
if (str.startsWith("s")) {
// ...
} Encoding-independent, easier to read, and more robust in the case where the string is empty 😄 . Of course, this is just a simple example, however it generalizes— a lot of seeming uses for |
We can't broke compatibility for existing strings, but we could create new subset of string which reimplement all methods as WTF-16-awared and remove rest unsafe operations. And it will also make it possible to almost seamlessly reformat one class of strings into another simply by changing the type declaration. Like: var strWTF16: str = ...
var isSChar = strWTF16.charCodeAt(0) == 0x73; // compilation error var strWTF16: str = ...
// If "str" is WTF16 this conversions will cost nothing. But if UTF8 it should call String.UTF8.decode
var isSChar = (strWTF16 as string).charCodeAt(0) == 0x73; // ok. |
But with that reasoning, wouldn't |
Yes; I'm not deeply familiar with AssemblyScript; I expect there's room for some creativity here. And @MaxGraey's idea of introducing new types looks like it would make some different options available as well. Another thing that may help is looking at what people are using
|
My general feeling there is that restricting or changing access to A deprecation warning on just |
That |
I think can't quite follow how var str = "???hello???world???";
var p1 = str.indexOf("hello");
if (~p1) {
let p2 = str.indexOf("world", p1 + "hello".length);
if (~p2) {
return str.substring(p1, p2 + "world".length);
}
}
return null; What would be a safer but still efficient alternative to this code sample? |
The conclusion drawn in #1653 (comment) may be of interest in context of what's being discussed here. |
Perhaps slightly less efficient due to doing two string concatenations instead of one, but if this is a common case, a function to do this kind of substring in the standard library could fix that. |
More realistic scenario. function capitalize(str: string): string {
return str.charAt(0).toUpperCase() + str.substring(1);
} in rust or Go for example which hasn't random access: pub fn capitalize(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
}
} so it required use iterators which pretty unnatural for strings in JS/TS. I don't know how this better reproduce with new utf-8 str API without using iterators. And anyway it will completely different experience |
Yeah; as I said, The AssemblyScript version there doesn't work correctly in various cases. For example, And to be clear, in this issue, I'm not suggesting removing |
Yeah, that's good point. So it could solve via iterator implicitly: function capitalizeUnicode(str: string): string {
const [firstChar, ...rest] = str;
return firstChar.toUpperCase() + rest.join('');
}
capitalizeUnicode('𑣙𑣙𑣙')
// > 𑢹𑣙𑣙 So probably this is best way how we could handle strings with UTF-8 without introduce some special API |
Treating Unicode strings as arrays often leads to bugs where code processes text in some languages correctly but not others. In JavaScript, it's surprising that
"🤦🏼♂️".length == 7
, and the advice to programmers often is: you usually don't want to look at.length
, because it isn't reliably what end users think of as characters, it isn't reliably the number of codepoints, and it isn't reliably related to the display width of the string.Similarly, functions names borrowed from JS using the term "Char", such as
fromCharCode
, are confusing to programmers coming from non-JS languages, since code units aren't always characters.So, what if AssemblyScript moved functions which work in terms of the underlying code-unit concept, such as
charCodeAt
, into aString.JS
namespace, similar to theString.UTF16
namespace? They'd all be available, and easily accessible. But, they'd be visually distinguished from the other string functions, making it clear where code-unit assumptions are being made. It would also leave more conceptual room in the baseString
namespace for new features in the future.Another effect of the name
String.JS
could be to signal to programmers that these functions won't necessarily always be optimal or natural in non-JS embeddings of Wasm, which may give AssemblyScript as a language more implementation flexibility in non-JS environments.All that said, I don't know where AssemblyScript stands on standard library API stability at this time. If breaking changes are out of scope, perhaps some of the above goals could at least be advanced through documentation.
The text was updated successfully, but these errors were encountered: