Playing with unicode in deep
· 6 min read
The smallest unit of all texts we see on the screen is one character. But you may wonder about:
- How one character is displayed on the screen?
- How one character is kept in memory or disk in binary format(0 or 1)?
Let's dive into the Unicode to solve these questions.
In Unicode, a character maps to something called code point which is a magic number written as hex like: U+20AC
and is still just a abstract layer.
Layer | Representation |
---|---|
screen | glyph |
abstraction | unicode character |
abstraction | unicode code point |
disk | variable-length bytes(1 to 4 bytes) |
How that code point is represented in memory or on disk?
UTF-8
, UTF-16
, and UTF-32
help translate unicode code point into binary data in 8-bit bytes which can be saved in disk or be transported in network.
UTF-8
is character-to-bytes(1 to 4 bytes) encoding standard across almost all system and application.