Wiki Assembly
Why Assembly
language is important to to learn?
- The most low-level language that is closely tied to the hardware such as
CPU
.Assembly
code implements a symbolic (human-readable) representation of the binary machine code.Assembly
language is written to follow theCPU
execution logic directly.
Assembly
language facilities a deeper understanding howCPU
actually do its job.
Why Assembly
language is critical elementary foundation to other higher-level language, E.g. C
?
Assembly
code is the important medium for compiling C
code to machine code.
When C
program being compiled to an binary object file, the GCC
compiler will do following:
C
code will be compiled intoAssembly
codeAssembly
code will be translated into machine code
Is Assembly
language cross-platform?
No, Assembly
language is specific in the specific platform. E.g. X86
CPU-architecture has its own Assembly
instruction sets as well as the arm
CPU.
Assembly
is CPU-dependent as machine code is CPU-dependent, while C
language is CPU-independent for cross-platform.
Is the first version of GCC
written in Assembly
?
No, C
started with the BCPL
language,
Assembler
- GNU assembler (GAS)
- x86-64 GNU assembler
- AT&T syntax
- aarch64 GNU assembler
- aarch64/arm64 syntax
- x86-64 GNU assembler
- Clang Assembler
- external Assembler
- GNU Assembler
- LLVM’s integrated assembler
- https://clang.llvm.org/docs/Toolchain.html#assemble
- external Assembler
- Netwide assembler (NASM)
- Intel syntax
- x86-64
- macOS
- linux
- windows
- MASM
- Intel syntax
A example NASM assembly file print.asm
to print message
; print.asm
; nasm -f elf64 print.asm && ld print.o && ./a.out ; echo $?
; objdump -d a.out
section .data
message db, "Welcome, to, Segmentation, Faults!, "
section .text
global _start
_printMessage:
mov rax, 4 ; System call number for sys_write
mov rbx, 1 ; File descriptor 1 is stdout
mov rcx, message ; Pointer to the message string
mov rdx, 32 ; Length of the message
int 0x80 ; Call kernel
ret ; Return from the function
_exit:
mov rax, 1 ; System call number for sys_exit
mov rbx, 0 ; Exit code 0
int 0x80 ; Call kernel
_start:
call _printMessage ; Call the print message function
mov rax, 1 ; System call number for sys_exit
mov rbx, 1 ; Exit code 0
int 0x80 ; Call kernel
; sum.asm
; nasm -f elf64 sum.asm && ld sum.o && ./a.out ; echo $?
; objdump -d a.out
section .text
global _start
; Function to calculate the sum of two integers
sum:
mov rax, rdi ; Move the first argument (a) to rax
add rax, rsi ; Add the second argument (b) to rax
ret ; Return with the result in rax
_start:
; Example usage of the sum function
mov rdi, 5 ; First argument: a = 5
mov rsi, 7 ; Second argument: b = 7
call sum ; Call the sum function
; The result is now in rax
; It can be used or printed, depending on the context
mov rdi, rax ; Exit code 0
; Exit the program
mov rax, 60 ; System call number for sys_exit
syscall ; Make the system call
Memory Layout
The structure of an assembly file generally consists of serval section:
.text
section:.text
section is generally read-only. It is typically used for storing executable code, and it is not intended to be modified during program execution..text
section contains the machine code instructions that the processor will execute..text
section contains global constant data.
.data
section:.data
section is writable. It is used for storing initialized data that can be modified during the execution of the program..data
section contains global variable data.
.bss
section:- It's mostly the same with
.data
section except it's used for storing uninitialized data
- It's mostly the same with
.rodata
section:- It is used for read-only data, such as constant strings.
Here's a simple example illustrating the use of these sections:
.section .text
.global _start
_start:
// Code goes here
.section .data
my_data:
.word 42 // Initialized data
.section .bss
my_uninitialized_data:
.skip 4 // Uninitialized data, occupies 4 bytes
.section .rodata
my_string:
.asciz "Hello, World!" // Read-only data
A compiled program's memory layout consists of these segments.
A running program's memory layout consists of these segments, and also heap
and stack
memory.
Memory Layout of a Running Program
A running program typically consists of serval segments or sections, each serving a specific purpose but common sections include:
Stack
:- Stores local variables and function call information.
- Memory is automatically allocated and de-allocated as functions are called and return.
- Register(
sp
inarm64
, stack pointer) is used to manage and point to the stack memory. - Size is limited(may lead to stack overflow).
- set via
ulimit -s
in linux.
- set via
Heap
:- Dynamic memory managed by programmer at runtime.
- Memory is allocated and deallocated explicitly using functions like
malloc
/free
inC
, andnew
/delete
inC++
,brk
system call inassembly
etc. - Store dynamic data that can be shared across functions. Data lifecycle is not bound to functions.
- Size is much larger than the stack,
Data
(.data
,.bss
):- Stores global variables/constants, separated into initialized and uninitialized
Text
(.text
):- Stores the code being executed
CS 225 | Stack and Heap Memory
Label
Label
Instruction
Assembly instructions are human readable representation of the machine code as CPU can only understand the machine code.
Instruction: Opcode + Oprand
Opcode
Intel x86 Assembler Instruction Set Opcode Table
Oprand
Data area
Register Operand
Register Operand
mov rdi, rsi
Immediate Operand
Immediate Operand
mov rdi, 0x21
mov rdi, 5
mov edi, 0x21314151
In aarch64
, the immediate value is subject to:
- Arithmetic instructions (
add{s}
,sub{s}
,cmp
,cmn
) take a 12-bit unsigned immediate with an optional 12-bit left shift. - Move instructions (
movz
,movn
,movk
) take a 16-bit immediate optionally shifted to any 16-bit-aligned position within the register. - Address calculations (
adr
,adrp
) take a 21-bit signed immediate, although there's no actual syntax to specify it directly - to do so you'd have to resort to assembler expression trickery to generate an appropriate "label". - Logical instructions (
and{s}
,orr
,eor
,tst
) take a "bitmask immediate".
Memory Operand
mov rdi, [sdi]
Instruction Encoding
Assembler will encode the human-readable instruction into machine code.
- In
aarch64
, the encoding instruction is fixed-size(4 bytes) machine code. - In
x86_64
, the encoding instruction is non-fixed-size(up to 16 bytes) machine code.
Encoding Real x86 Instructions
Let's have a glimpse on the impact of the fixed/non-fixed encoding.
In order to load 32-bit integer, x86_64
need only one instruction while more instructions are needed for aarch64
to do that.
Load a 32-bit integer 0x1a2b3c4d
in x86_64
,
mov rid, 0x1a2b3c4d
Load a 32-bit integer 32-bit 0x1a2b3c4d
in aarch64
(the immediate value in mov
must be in the range of 16-bit
, so it needs two instructions),
movz x1, 0x3c4d
movk x1, 0x1a2b, lsl 16
NASM x86_64 cheat sheet
NASM x86_64 cheat sheet
GAS aarch64 cheat sheet
GAS aarch64 cheat sheet
Assembly's Role in Compiler
In the compiling process, a compiler such as GCC will translate C
code into Assembly
code for different CPU architectures, then use its corresponding Assembler to translate the Assembly
code to the machine code which is CPU dependent.
The Assembly
plays intermediate role in the compiler, while higher language like C
sits upfront and machine code runs at the bottom.
I have another writing to introduce my understanding of the compiler from the practice more than the theoretical point of view, and how to write a compiler.
Resources
Notes on x86-64 Assembly and Machine Code · GitHub
Assembly 1: Basics – CS 61 2018
Assembly Language & Computer Architecture Lecture (CS301)
The Hub of Heliopolis - Getting Started with x86-64 Assembly on Linux