Skip to main content

One post tagged with "unicode"

View All Tags

Playing with unicode in deep

· 6 min read
Frank Chen
Backend & Applied ML Engineer

The smallest unit of all texts we see on the screen is one character. But you may wonder about:

  1. How one character is displayed on the screen?
  2. How one character is kept in memory or disk in binary format(0 or 1)?

Let's dive into the Unicode to solve these questions.

In Unicode, a character maps to something called code point which is a magic number written as hex like: U+20AC and is still just a abstract layer.

LayerRepresentation
screenglyph
abstractionunicode character
abstractionunicode code point
diskvariable-length bytes(1 to 4 bytes)

How that code point is represented in memory or on disk?

UTF-8, UTF-16, and UTF-32 help translate unicode code point into binary data in 8-bit bytes which can be saved in disk or be transported in network.

UTF-8 is character-to-bytes(1 to 4 bytes) encoding standard across almost all system and application.