|
1 | 1 | # Unicode
|
2 | 2 |
|
3 |
| -Most letters and symbols that are common in the English speaking world fit into |
4 |
| -a single `char`, so pretending that a `char` is always "a single |
5 |
| -letter or symbol" is generally a good enough mental model. |
| 3 | +At the lowest level, computers only work with numbers and have no comprehension of text. A Java `char` is just a number between 0 and 65535. To work with text, we need to agree how to represent strings as numbers. |
6 | 4 |
|
7 |
| -Where this falls apart is with things like emoji (👨🍳) which are generally considered to be one symbol, but |
8 |
| -cannot be represented in a single `char`. |
| 5 | +The good news is that there is an international standard for encoding strings as numbers, called [Unicode](https://en.wikipedia.org/wiki/Unicode), and everyone, including Java, has agreed to follow it. |
9 | 6 |
|
10 |
| -```java,no_run |
11 |
| -char chef = '👨🍳'; |
| 7 | +The bad news is that Unicode is complicated, because human writing is complicated. |
| 8 | + |
| 9 | +Unicode represents text as sequences of numbers called *code points*. As long as you only work with European languages, including English, you can pretend that a code point is just a Java `char`. For example, the letter *D* is assigned to code point 68, so the following are equivalent: |
| 10 | + |
| 11 | +```java |
| 12 | +~void main() { |
| 13 | +char letterD = 'D'; |
| 14 | +char alsoLetterD = 68; |
| 15 | +IO.println(letterD == alsoLetterD); // true |
| 16 | +~} |
| 17 | +``` |
| 18 | + |
| 19 | +However, not all Unicode code points fit into a `char`. A `char` can only have values between 0 and 65535, but Unicode code points can have values between 0 and 1,114,111. For example, the emoji 👨🍳 (code point 128104) cannot be represented by a single `char`: |
| 20 | + |
| 21 | +```java,no_run,does_not_compile |
| 22 | +char chef = '👨🍳'; // Does not compile |
12 | 23 | ```
|
13 | 24 |
|
14 |
| -`char`s are actually "utf-16 code units". Many symbols require multiple "code units" to represent. |
| 25 | +Code points that cannot fit into a single `char` are represented as two `char`s, according to the rules of the Unicode encoding called [UTF-16](https://en.wikipedia.org/wiki/UTF-16), which Java uses. UTF-16 specifies the rules for encoding Unicode text as sequences of integers between 0 and 65535 (i.e. `char` values). |
| 26 | + |
| 27 | +Another wrinkle is that a code point does not necessarily correspond to a single character shown on screen. For example, the flag of the European Union 🇪🇺 looks like a single character on screen, but is actually composed of two code points: 🇪 and 🇺 (code points 127466 and 127482 respectively). |
| 28 | + |
| 29 | +Because of these gotchas, you should only work with individual `char` values if you know what you're doing. Most of the time, you will work with whole strings, which are the topic of the [next section](/strings.html). |
| 30 | + |
| 31 | +To get a basic understanding of character encodings, you can read Joel Spolsky's *[The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)*. |
15 | 32 |
|
16 |
| -For a full explanation, refer to this old Computerphile video. |
| 33 | +UTF-16, used by Java, is not the only way of encoding Unicode as a stream of bytes. Another, far more popular encoding is [UTF-8](https://en.wikipedia.org/wiki/UTF-8), which is used by most web pages. Java uses UTF-16 to represent strings internally, but uses UTF-8 for input/output by default. |
17 | 34 |
|
18 |
| -It describes "utf-8", which is 8 bits per "code unit." Java's `char` |
19 |
| -uses 16 bits, but that is the only difference. |
| 35 | +For a full explanation of UTF-8, refer to this old Computerphile video: |
20 | 36 |
|
21 | 37 | <iframe width="560" height="315" src="https://www.youtube.com/embed/MijmeoH9LT4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
|
0 commit comments