Playing with unicode in deep
· 6 min read
The smallest unit of all texts we see on the screen is one character. But you may wonder about:
- How one character is displayed on the screen?
- How one character is kept in memory or disk in binary format(0 or 1)?
Let's dive into the Unicode to solve these questions.
In Unicode, a character maps to something called code point which is a magic number written as hex like: U+20AC and is still just a abstract layer.
| Layer | Representation |
|---|---|
| screen | glyph |
| abstraction | unicode character |
| abstraction | unicode code point |
| disk | variable-length bytes(1 to 4 bytes) |
How that code point is represented in memory or on disk?
UTF-8, UTF-16, and UTF-32 help translate unicode code point into binary data in 8-bit bytes which can be saved in disk or be transported in network.
UTF-8 is character-to-bytes(1 to 4 bytes) encoding standard across almost all system and application.
