BDL: Part 9 (Designing an intermediate assembly language)
Introduction
In the previous post we went through a number of updates to the compiler and introduced the semantic analysis phase for type checking and symbol table generation. Here we will present the idea of linear intermediate representations and discuss the design we want to use for our language.
As the name indicates, an intermediate representation (IR) is just that, some passing data structure being used between the beginning and the end of the compilation process. The parsing tree we have used so far is an example of a graphical IR. Trees and graphs allow us to have a high-level overview of the source code and can be easily traversed for things like type checking or semantic analysis.
The end result of this compiler is to output bytecode, assembly code or a fully linked binary. These require us to linearize the source graph so that the instructions can be executed in a sequential order. Instead of generating assembly or bytecode directly, we can output a low-level representation of this linear code, which can be more easily translated to other architectures. This is our next IR, which we will lovingly name BASM (bdl assembly).
To stack or not to stack
The two most popular linear IRs are “one-address codes” (OAC) and “three-address codes” (TAC). The former is reminiscent of stack machines and the later more closely models RISC processors.
OAC IRs have operations that mostly deal with stack manipulation and require fewer registers. These IRs operators can be succinctly stored and make use of at most one parameter. On the other hand TAC contains a destination, an operator and at most two source parameters.
It will be much more clear with an example, so let’s imagine the following code:
(+ 1 2 (- 10 2 1))
A potential OAC translation (and the stack status after running this operation) could be:
CODE | STACK
----------|---------
push 1 | [1]
push 2 | [1 2]
add | [3]
push 10 | [3 10]
push 2 | [3 10 2]
sub | [3 8]
push 1 | [3 8 1]
sub | [3 7]
add | [10] <- final result is on the top of the stack
An equivalent TAC translation, without using the stack:
t1 = 1 + 2
t2 = 10 - 2
t3 = t2 - 1
t4 = t1 + t3 <- final result is on t4
Another TAC example, targeting an infinite registers machine. Note how this is much closer to the final assembly representation. In fact, this mimics more closely what a RISC processor might do:
load r1, 1
load r2, 2
add r3, r1, r2
load r4, 10
load r5, 2
sub r6, r4, r5
load r7, 1
sub r8, r6, r7
add r9, r3, r8 <- final result is on r9
In practice, real hardware have a limited number of registers so a register allocation step is needed for the final compilation. However, this low level representation could be easily compiled to a VM that could allocate as many registers as needed during execution. This is the form I’m interested in exploring as a target for linear IR and bytecode. Note that the generation of these instructions generates a new register as a destination for each operation, but this could have also be written as:
load r1, 1
load r2, 2
add r1, r1, r2
load r2, 10
load r3, 2
sub r2, r2, r3
load r3, 1
sub r2, r2, r3
add r1, r1, r2 <- final result is on r1
The number of registers was reduced to 3 in this form. This could be generated directly or wait until a later optimization stage. For example note how we load the numeric constant 1
on r1
and r7
in the first form and r1
and r3
in the second one. Since the first form keeps distinct names for each register, the load operations could be removed. This is not possible on the second form, since r1
is modified by the add
operation and the initial context is lost. A smart compiler could reduce the first form to remove the number of load/store operations, since memory access tend to be one of the most expensive parts of a program.
load r1, 1
load r2, 2
add r3, r1, r2
load r4, 10
sub r5, r4, r2
sub r6, r5, r1
add r7, r3, r6
Of course there are many optimizations available that could be used to improve the quality of the generated code, and this is why TAC IRs are used in practice. In the following section we will draft the details of the proposed TAC IR.
BASM
As mentioned, we are going to target a low-level linear IR that could be interpreted in a bytecode VM, which we will also build alongside the IR. This way, we will effectively be targeting a VM with infinite registers. To keep things simple, we will start with only integer values, leaving floating point registers and operations for later. In addition of an infinite number of integer registers (denoted as r1, r2, ... rn
), we have a number of special registers:
rsp
(stack pointer register): Points to the top of the stack.rar
(activation record register): Points to the activation record for the current procedure.
Each instruction is composed of an operator and a maximum of three operands. The operator tell us what kind of instruction it is (load
, store
, add
, jump
, etc.) whereas the operands can indicate source/destination registers, constants or labels. We will not include a large number of operations, so we can store the operator into a single byte. Constants are signed or unsigned numbers and must fit into a 64 byte register. Labels are temporary points in our program that can be used primarily for jumping, allowing us to perform IR translation in one go without having to backtrack to calculate offsets.
typedef enum OperandType {
OP_TYPE_REG,
OP_TYPE_CONST,
OP_TYPE_LABEL,
} OperandType;
typedef struct Operand {
OperandType type;
union {
struct {
size_t num;
} reg;
struct {
size_t num;
} label;
struct {
union {
u64 uval;
s64 sval;
};
} const;
};
} Operand;
typedef struct Instruction {
Operator op;
Operand dst;
Operand src_a;
Operand src_b;
} Instruction;
A program, is just a series of instructions sequentially executed. For debugging purposes we can still keep track of the line/column number from which the instruction originates. This debugging information is kept on a separate array so that the interpreter don’t get bogged down by cache performance.
typedef struct LineInfo {
size_t line;
size_t col;
} LineInfo;
typedef struct ProgramBASM {
Instruction *inst;
LineInfo *lines;
} ProgramBASM;
Let us now discuss the different types of operations in our ASM-like language.
Arithmetic
Arithmetic operations deal exclusively with numbers. The destination operand (dst
) as well as the first source register (src_a
) can only be of register type. The second source register (src_b
) can be either a constant or another register. Floating point operations will behave just the same (addf, subf...
) but will use floating point registers instead.
OP DST SRC_A SRC_B
add r1 r2 r3 -> r1 = r3 + r2
sub r1 r2 r3 -> r1 = r3 - r2
mul r1 r2 r3 -> r1 = r3 * r2
div r1 r2 r3 -> r1 = r3 / r2
mod r1 r2 r3 -> r1 = r3 % r2
add r1 r2 c1 -> r1 = c1 + r2
sub r1 r2 c1 -> r1 = c1 - r2
mul r1 r2 c1 -> r1 = c1 * r2
div r1 r2 c1 -> r1 = c1 / r2
mod r1 r2 c1 -> r1 = c1 % r2
Memory (Load/store)
We need a way of loading and saving data between memory and registers. This allow us to interact with local variables, the stack, files, etc. First of all, load operations (ldX
) move a value from memory into a register. Since integer types can have different sizes (8, 16, 32 and 64 bits) we need the corresponding memory operations. The dst
operand must be of register type and indicates the register that will store the value loaded from memory. The src_a
and src_b
operands can be either constants or registers, however, if src_a
is a constant, src_b
will be ignored. This mode is used to load a constant value into a register. When src_a
is a register, the value loaded into memory corresponds to MEMORY[src_a + src_b]
. In other words, src_a
contains the base address from which to load the data and src_b
is the offset from that address. When src_b
is a constant of value 0
we consider it an immediate load.
OP DST SRC_A SRC_B
ld8 r1 c1 -> (u8)c1
ld8 r1 r2 c1 -> u8 *p; r1 = p[r2 + c1]
ld8 r1 r2 r3 -> u8 *p; r1 = p[r2 + r3]
ld16 r1 c1 -> (u16)c1
ld16 r1 r2 c1 -> u16 *p; r1 = p[r2 + c1]
ld16 r1 r2 r3 -> u16 *p; r1 = p[r2 + r3]
ld32 r1 c1 -> (u32)c1
ld32 r1 r2 c1 -> u32 *p; r1 = p[r2 + c1]
ld32 r1 r2 r3 -> u32 *p; r1 = p[r2 + r3]
ld64 r1 c1 -> (u64)c1
ld64 r1 r2 c1 -> u64 *p; r1 = p[r2 + c1]
ld64 r1 r2 r3 -> u64 *p; r1 = p[r2 + r3]
Similarly, for store operations (stX
), the dst
value will be stored in MEMORY[src_a + src_b]
. Note that there is not a immediate constant store operation.
OP DST SRC_A SRC_B
st8 r1 r2 c1 -> u8 *p; p[r2 + c1] = r1
st8 r1 r2 r3 -> u8 *p; p[r2 + r3] = r1
st16 r1 r2 c1 -> u16 *p; p[r2 + c1] = r1
st16 r1 r2 r3 -> u16 *p; p[r2 + r3] = r1
st32 r1 r2 c1 -> u32 *p; p[r2 + c1] = r1
st32 r1 r2 r3 -> u32 *p; p[r2 + r3] = r1
st64 r1 r2 c1 -> u64 *p; p[r2 + c1] = r1
st64 r1 r2 r3 -> u64 *p; p[r2 + r3] = r1
Finally, we can copy data register to register without having to go through memory. Both src_a
and dst
must be registers, where src_a
is the register to copy and dst
is the destination of said copy. Depending on the type of copy, it is equivalent of using an and
mask on src_a
as shown below.
OP DST SRC_A SRC_B
cp8 r1 r2 -> r1 = r2 & 0xFF
cp16 r1 r2 -> r1 = r2 & 0xFFFF
cp32 r1 r2 -> r1 = r2 & 0xFFFFFFFF
cp64 r1 r2 -> r1 = r2 & 0xFFFFFFFFFFFFFFFF
Jumps and control flow
We implement logic using branching. We have access to conditional and unconditional jump operations. Jumping effectively sets the program counter (PC) to the given memory location. All jump operations take a label in the dst
, which is the target location for the jump. Conditional jump operations compare src_a
and src_b
and will execute the jump if the condition is met.
OP DST SRC_A SRC_B
jmp l1 -> jump to `l1` unconditionally
jeq l1 r1 r2 -> jump to `l1` if r1 == r2
jneq l1 r1 r2 -> jump to `l1` if r1 != r2
jgt l1 r1 r2 -> jump to `l1` if r1 > r2
jlt l1 r1 r2 -> jump to `l1` if r1 < r2
jge l1 r1 r2 -> jump to `l1` if r1 >= r2
jle l1 r1 r2 -> jump to `l1` if r1 <= r2
In order to serialize the labels in our code, we make use of a special instruction that will not be present in the compiled bytecode or final compilation form. The label instruction is effectively a NOP operation used to annotate the next non label instruction to be executed. The dst
operand contains the numeric identifier for this label.
OP DST SRC_A SRC_B
lab l1
Here is a rough example of an if-else statement after IR serialization:
; BDL
(if (= 2 1) 3 4)
; BASM
ld64 r1 2
ld64 r2 1
jeq l1 r1 r2
ld64 r3 4
jmp l2
lab l1
ld64 r3 3
lab l2
When compiling to bytecode, each instruction has a fixed size so we can substitute all labels for the equivalent instruction offset.
; BASM (after label to offset substitution)
ld64 r1 2
ld64 r2 1
jeq 3 r1 r2
ld64 r3 4
jmp 2
ld64 r3 3
Bitwise ops
Even though we haven’t added these to our language, it is useful to consider bitwise operations. These are things like shifting/rotating bits left/right, and bitwise and, or, not and xor operations. All of these require a register for dst
and src_a
. src_b
can be either a constant or another register depending on the operation. Left and right shifts (Xshif
) move the bit sequence left or right, inserting zeros right or left respectively. The original signedness of src_a
doesn’t matter, meaning that a negative number will still pad the left side with zeroes when shifting right.
OP DST SRC_A SRC_B
lshif r1 r2 r3 -> r1 = r2 << r3
lshif r1 r2 c1 -> r1 = r2 << c1
rshif r1 r2 r3 -> r1 = r2 >> r3
rshif r1 r2 c1 -> r1 = r2 >> c1
Examples:
ld8 r2 0b11011101
lshif r1 r2 2 -> r1 = 0b01110100
rshif r1 r2 3 -> r1 = 0b00011011
Rotating bits in a N bit architecture (by default our VM will target N == 64) performs the same left/right bit shifting as above, but instead of inserting zeroes, the bit sequence to the right/left respectively is inserted into the hole.
OP DST SRC_A SRC_B
lrot r1 r2 r3 -> r1 = (r2 << r3) & (r2 >> (N - r3))
lrot r1 r2 c1 -> r1 = (r2 << c1) & (r2 >> (N - c1))
rrot r1 r2 r3 -> r1 = (r2 >> r3) & (r2 << (N - r3))
rrot r1 r2 c1 -> r1 = (r2 >> c1) & (r2 << (N - c1))
Examples (N = 8 for demonstration purposes):
ld8 r2 0b11011101
lrot r1 r2 2 -> r1 = 0b01110111
rrot r1 r2 3 -> r1 = 0b10111011
The logical operations are pretty self explanatory. Note that the logical not
doesn’t need src_b
, since it just inverts the bit sequence in src_a
.
OP DST SRC_A SRC_B
not r1 r2 -> r1 = ~r2
and r1 r2 r3 -> r1 = r2 & r3
and r1 r2 c1 -> r1 = r2 & c1
or r1 r2 r3 -> r1 = r2 | r3
or r1 r2 c1 -> r1 = r2 | c1
xor r1 r2 r3 -> r1 = r2 ^ r3
xor r1 r2 c1 -> r1 = r2 ^ c1
Examples:
ld8 r2 0b11011101
ld8 r3 0b00001011
not r1 r2 -> r1 = 0b00100010
and r1 r2 r3 -> r1 = 0b00001001
or r1 r2 r3 -> r1 = 0b11011111
xor r1 r2 r3 -> r1 = 0b11010110
Conclusion
In this article we have introduced a low-level assembly like language to be used both as an intermediate representation for optimization and code generation and as input for a virtual machine with infinite registers. We haven’t yet explored how to perform things like procedure calls or access to local or scoped variables, which require the definition of a calling convention and an explanation of activation records. This will be the topic of a future article, but for now we have a base to start compiling some basic bytecode programs and to develop a simple virtual machine to test our programs. Until next time!