Skip to main content

Wiki Unicode

The smallest unit of all texts we see on the screen is one character. But you may wonder about:

  1. How one character is displayed on the screen?
  2. How one character is kept in memory or disk in binary format(0 or 1)?

Let's dive into the Unicode to solve these questions.

In Unicode, a character maps to something called code point which is a magic number written as hex like: U+20AC and is still just a abstract layer.

LayerRepresentation
screenglyph
abstractionunicode character
abstractionunicode code point
diskvariable-length bytes(1 to 4 bytes)

How that code point is represented in memory or on disk?

UTF-8, UTF-16, and UTF-32 help translate unicode code point into binary data in 8-bit bytes which can be saved in disk or be transported in network.

UTF-8 is character-to-bytes(1 to 4 bytes) encoding standard across almost all system and application.

FAQ

How a character is displayed on the screen?

software maps each character to its glyph(a grid of pixels), draw these pixels onto the screen.

How to find out whether the file uses UTF-8 or ASCII or other encoding schemas?

It's not always foolproof because there is no universal mandate or requirement that all files must specify their encoding. But it's a good practice to add BOM(Byte Order Mark) at the beginning of a UTF-8 encoded file.

Can I set UTF-16 as locale in Linux?

No, you cannot. Linux use UTF-8 encoding which is compatible with ASCII.

What happens when printing a UTF-16 file in Linux?

# >>> '€'.encode("utf16") -> b'\xff\xfe\xac '
$ echo -n -e \\xff\\xfe\\xac\\x20 > a.txt
$ hexdump -C a.txt
00000000 ff fe ac 20 |... |
00000004
$ file a.txt
a.txt: Unicode text, UTF-16, little-endian text, with no line terminators
$ cat a.txt
���
$ iconv -f UTF-16LE -t UTF-8 a.txt

How can I check a UTF-8 file has a BOM?

Create a file without BOM,

>>> f.flush()
>>> b'\xe2\x82\xac'.decode()
'€'
>>> '€'.encode()
b'\xe2\x82\xac'
>>> bom=b"\xef\xbb\xbf"
>>> f=open("a.txt", "wb+")
>>> f.write(b'\xe2\x82\xac')
3
>>> f.flush()
$ file a.txt
a.txt: Unicode text, UTF-8 text, with no line terminators

Create a BOM adhere file,

>>> f.seek(0)
0
>>> f.truncate(0)
0
>>> f.write(b'\xef\xbb\xbf\xe2\x82\xac')
6
>>> f.flush()
$ file a.txt
a.txt: Unicode text, UTF-8 (with BOM) text, with no line terminators
$ hexdump -C a.txt
00000000 ef bb bf e2 82 ac |......|
00000006

Why we can copy and paste the unicode characters into a shell?

When we do copying on the screen, we're copying the character's UTF8 encoded bytes which is in the memory, not the code point.

# b'\xe2\x82\xac'.decode() -> '€'
$ echo -e \\xe2\\x82\\xac | xclip -selection clipboard

Then you can use your mouse right click to copy to the shell and you will see .

How a string is stored in memory when Python running?

Unicode in JSON

JSON(natively a text format) support the unicode character to be escaped or not. When being escaped, the character will be replaced with the unicode code point, then which will be represented in 6 or 8 ascii characters occupying 6 or 8 bytes. When not being escaped, the character will be represented as just one unicode character as itself occupying 1 to 4 bytes if using UTF-8.

Escaping will cost more storage but will be compatible in ASCII-only environments, as escaping force all characters to be ASCII characters.

Case 1: Characters escaped,

>>> import json
>>> b=b'{"text": "\u4f60\u597d"}'
>>> json.loads(b)
{'text': '你好'}jsn
>>> json.dumps(json.loads(b))
'{"text": "\\u4f60\\u597d"}'
>>> json.dumps(json.loads(b), ensure_ascii=False)
'{"text": "你好"}'
>>> f=open("a.txt", "w+")
>>> f.write(json.dumps(json.loads(b)))
24
>>> f.flush()
$ cat a.txt
{"text": "\u4f60\u597d"}#
$ hexdump -C a.txt
00000000 7b 22 74 65 78 74 22 3a 20 22 5c 75 34 66 36 30 |{"text": "\u4f60|
00000010 5c 75 35 39 37 64 22 7d |\u597d"}|
00000018

Case 2: Characters not escaped,

>>> f.seek(0)
0
>>> f.truncate(0)
0
>>> f.write(json.dumps(json.loads(b), ensure_ascii=False))
14
>>> f.flush()
$ cat a.txt
{"text": "你好"}#
$ hexdump -C a.txt
00000000 7b 22 74 65 78 74 22 3a 20 22 e4 bd a0 e5 a5 bd |{"text": "......|
00000010 22 7d |"}|
00000012

Base64

Base64 is binary-to-text encoding schema which make bytes data to be represented in ASCII characters to be human readable.

Python

C application

Let's have a look at how the Unicode is represented in a C executable file.

  1. char
cat > unicode.c << EOL
#include <stdio.h>

int main(){
printf("Hello, World 你好🤨!\n");
return 0;
}
EOL
gcc unicode.c -o unicode.out
root@112b172acfff:/workspaces/liviaerxin.github.io/# hexdump -C unicode.out | grep -A3 "Hello, World"
00002000 01 00 02 00 48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 |....Hello, World|
00002010 20 e4 bd a0 e5 a5 bd f0 9f a4 a8 21 00 00 00 00 | ..........!....|
00002020 01 1b 03 3b 34 00 00 00 05 00 00 00 00 f0 ff ff |...;4...........|
00002030 68 00 00 00 20 f0 ff ff 90 00 00 00 30 f0 ff ff |h... .......0...|

View each character's UTF8 encoding respectively,

# utf-8 encoding bytes
>>> "你".encode()
b'\xe4\xbd\xa0'
>>> "好".encode()
b'\xe5\xa5\xbd'
>>> "🤨".encode()
b'\xf0\x9f\xa4\xa8'

So char in the C output file stores the UTF8 encoding bytes, not the code points.

  1. wide char,
cat > unicode.c << EOL
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(int argc, char *argv[])
{
setlocale(LC_ALL, "C.UTF-8");
wchar_t* msg = L"Hello, World 你好🤨!";
printf("%ls\n", msg);
return 0;
}
EOL
root@112b172acfff:/workspaces/liviaerxin.github.io/# hexdump -C unicode.out
00002000 01 00 02 00 00 00 00 00 43 2e 55 54 46 2d 38 00 |........C.UTF-8.|
00002010 48 00 00 00 65 00 00 00 6c 00 00 00 6c 00 00 00 |H...e...l...l...|
00002020 6f 00 00 00 2c 00 00 00 20 00 00 00 57 00 00 00 |o...,... ...W...|
00002030 6f 00 00 00 72 00 00 00 6c 00 00 00 64 00 00 00 |o...r...l...d...|
00002040 20 00 00 00 60 4f 00 00 7d 59 00 00 28 f9 01 00 | ...`O..}Y..(...|
# unicode code point
>>> hex(ord("H"))
'0x48'
>>> hex(ord("你"))
'0x4f60'
>>> hex(ord("好"))
'0x597d'
>>> hex(ord("🤨"))
'0x1f928'

wchar in the C output file stores the code point(also in little endian), not the UTF8 encoding bytes.

In conclusion, char and wchar lead different encoding in C.

Resources

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software

Pragmatic Unicode | Ned Batchelder