One post tagged with "unicode"

Playing with unicode in deep

December 8, 2023 · 6 min read

Backend & Applied ML Engineer

The smallest unit of all texts we see on the screen is one character. But you may wonder about:

How one character is displayed on the screen?
How one character is kept in memory or disk in binary format(0 or 1)?

Let's dive into the Unicode to solve these questions.

In Unicode, a character maps to something called code point which is a magic number written as hex like: U+20AC and is still just a abstract layer.

Layer	Representation
screen	glyph
abstraction	unicode character
abstraction	unicode code point
disk	variable-length bytes(1 to 4 bytes)

How that code point is represented in memory or on disk?

UTF-8, UTF-16, and UTF-32 help translate unicode code point into binary data in 8-bit bytes which can be saved in disk or be transported in network.

UTF-8 is character-to-bytes(1 to 4 bytes) encoding standard across almost all system and application.