Skip to content

Commit bb64871

Browse files
committed
Rewrite Unicode section
1 parent c19fed3 commit bb64871

File tree

1 file changed

+27
-11
lines changed

1 file changed

+27
-11
lines changed

src/characters/unicode.md

Lines changed: 27 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,37 @@
11
# Unicode
22

3-
Most letters and symbols that are common in the English speaking world fit into
4-
a single `char`, so pretending that a `char` is always "a single
5-
letter or symbol" is generally a good enough mental model.
3+
At the lowest level, computers only work with numbers and have no comprehension of text. A Java `char` is just a number between 0 and 65535. To work with text, we need to agree how to represent strings as numbers.
64

7-
Where this falls apart is with things like emoji (👨‍🍳) which are generally considered to be one symbol, but
8-
cannot be represented in a single `char`.
5+
The good news is that there is an international standard for encoding strings as numbers, called [Unicode](https://en.wikipedia.org/wiki/Unicode), and everyone, including Java, has agreed to follow it.
96

10-
```java,no_run
11-
char chef = '👨‍🍳';
7+
The bad news is that Unicode is complicated, because human writing is complicated.
8+
9+
Unicode represents text as sequences of numbers called *code points*. As long as you only work with European languages, including English, you can pretend that a code point is just a Java `char`. For example, the letter *D* is assigned to code point 68, so the following are equivalent:
10+
11+
```java
12+
~void main() {
13+
char letterD = 'D';
14+
char alsoLetterD = 68;
15+
IO.println(letterD == alsoLetterD); // true
16+
~}
17+
```
18+
19+
However, not all Unicode code points fit into a `char`. A `char` can only have values between 0 and 65535, but Unicode code points can have values between 0 and 1,114,111. For example, the emoji 👨‍🍳 (code point 128104) cannot be represented by a single `char`:
20+
21+
```java,no_run,does_not_compile
22+
char chef = '👨‍🍳'; // Does not compile
1223
```
1324

14-
`char`s are actually "utf-16 code units". Many symbols require multiple "code units" to represent.
25+
Code points that cannot fit into a single `char` are represented as two `char`s, according to the rules of the Unicode encoding called [UTF-16](https://en.wikipedia.org/wiki/UTF-16), which Java uses. UTF-16 specifies the rules for encoding Unicode text as sequences of integers between 0 and 65535 (i.e. `char` values).
26+
27+
Another wrinkle is that a code point does not necessarily correspond to a single character shown on screen. For example, the flag of the European Union 🇪🇺 looks like a single character on screen, but is actually composed of two code points: 🇪 and 🇺 (code points 127466 and 127482 respectively).
28+
29+
Because of these gotchas, you should only work with individual `char` values if you know what you're doing. Most of the time, you will work with whole strings, which are the topic of the [next section](/strings.html).
30+
31+
To get a basic understanding of character encodings, you can read Joel Spolsky's *[The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)*.
1532

16-
For a full explanation, refer to this old Computerphile video.
33+
UTF-16, used by Java, is not the only way of encoding Unicode as a stream of bytes. Another, far more popular encoding is [UTF-8](https://en.wikipedia.org/wiki/UTF-8), which is used by most web pages. Java uses UTF-16 to represent strings internally, but uses UTF-8 for input/output by default.
1734

18-
It describes "utf-8", which is 8 bits per "code unit." Java's `char`
19-
uses 16 bits, but that is the only difference.
35+
For a full explanation of UTF-8, refer to this old Computerphile video:
2036

2137
<iframe width="560" height="315" src="https://www.youtube.com/embed/MijmeoH9LT4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

0 commit comments

Comments
 (0)