badd10de.dev

BDL: Part 9 (Designing an intermediate assembly language)


Introduction

In the previous post we went through a number of updates to the compiler and introduced the semantic analysis phase for type checking and symbol table generation. Here we will present the idea of linear intermediate representations and discuss the design we want to use for our language.

As the name indicates, an intermediate representation (IR) is just that, some passing data structure being used between the beginning and the end of the compilation process. The parsing tree we have used so far is an example of a graphical IR. Trees and graphs allow us to have a high-level overview of the source code and can be easily traversed for things like type checking or semantic analysis.

The end result of this compiler is to output bytecode, assembly code or a fully linked binary. These require us to linearize the source graph so that the instructions can be executed in a sequential order. Instead of generating assembly or bytecode directly, we can output a low-level representation of this linear code, which can be more easily translated to other architectures. This is our next IR, which we will lovingly name BASM (bdl assembly).

To stack or not to stack

The two most popular linear IRs are “one-address codes” (OAC) and “three-address codes” (TAC). The former is reminiscent of stack machines and the later more closely models RISC processors.

OAC IRs have operations that mostly deal with stack manipulation and require fewer registers. These IRs operators can be succinctly stored and make use of at most one parameter. On the other hand TAC contains a destination, an operator and at most two source parameters.

It will be much more clear with an example, so let’s imagine the following code:

(+ 1 2 (- 10 2 1))

A potential OAC translation (and the stack status after running this operation) could be:

CODE      | STACK
----------|---------
push 1    | [1]
push 2    | [1 2]
add       | [3]
push 10   | [3 10]
push 2    | [3 10 2]
sub       | [3 8]
push 1    | [3 8 1]
sub       | [3 7]
add       | [10]     <- final result is on the top of the stack

An equivalent TAC translation, without using the stack:

t1 = 1 + 2
t2 = 10 - 2
t3 = t2 - 1
t4 = t1 + t3         <- final result is on t4

Another TAC example, targeting an infinite registers machine. Note how this is much closer to the final assembly representation. In fact, this mimics more closely what a RISC processor might do:

load r1, 1
load r2, 2
add  r3, r1, r2
load r4, 10
load r5, 2
sub  r6, r4, r5
load r7, 1
sub  r8, r6, r7
add  r9, r3, r8      <- final result is on r9

In practice, real hardware have a limited number of registers so a register allocation step is needed for the final compilation. However, this low level representation could be easily compiled to a VM that could allocate as many registers as needed during execution. This is the form I’m interested in exploring as a target for linear IR and bytecode. Note that the generation of these instructions generates a new register as a destination for each operation, but this could have also be written as:

load r1, 1
load r2, 2
add  r1, r1, r2
load r2, 10
load r3, 2
sub  r2, r2, r3
load r3, 1
sub  r2, r2, r3
add  r1, r1, r2      <- final result is on r1

The number of registers was reduced to 3 in this form. This could be generated directly or wait until a later optimization stage. For example note how we load the numeric constant 1 on r1 and r7 in the first form and r1 and r3 in the second one. Since the first form keeps distinct names for each register, the load operations could be removed. This is not possible on the second form, since r1 is modified by the add operation and the initial context is lost. A smart compiler could reduce the first form to remove the number of load/store operations, since memory access tend to be one of the most expensive parts of a program.

load r1, 1
load r2, 2
add  r3, r1, r2
load r4, 10
sub  r5, r4, r2
sub  r6, r5, r1
add  r7, r3, r6

Of course there are many optimizations available that could be used to improve the quality of the generated code, and this is why TAC IRs are used in practice. In the following section we will draft the details of the proposed TAC IR.

BASM

As mentioned, we are going to target a low-level linear IR that could be interpreted in a bytecode VM, which we will also build alongside the IR. This way, we will effectively be targeting a VM with infinite registers. To keep things simple, we will start with only integer values, leaving floating point registers and operations for later. In addition of an infinite number of integer registers (denoted as r1, r2, ... rn), we have a number of special registers:

Each instruction is composed of an operator and a maximum of three operands. The operator tell us what kind of instruction it is (load, store, add, jump, etc.) whereas the operands can indicate source/destination registers, constants or labels. We will not include a large number of operations, so we can store the operator into a single byte. Constants are signed or unsigned numbers and must fit into a 64 byte register. Labels are temporary points in our program that can be used primarily for jumping, allowing us to perform IR translation in one go without having to backtrack to calculate offsets.

typedef enum OperandType {
    OP_TYPE_REG,
    OP_TYPE_CONST,
    OP_TYPE_LABEL,
} OperandType;

typedef struct Operand {
    OperandType type;
    union {
        struct {
            size_t num;
        } reg;

        struct {
            size_t num;
        } label;

        struct {
            union {
                u64 uval;
                s64 sval;
            };
        } const;
    };
} Operand;

typedef struct Instruction {
    Operator op;
    Operand dst;
    Operand src_a;
    Operand src_b;
} Instruction;

A program, is just a series of instructions sequentially executed. For debugging purposes we can still keep track of the line/column number from which the instruction originates. This debugging information is kept on a separate array so that the interpreter don’t get bogged down by cache performance.

typedef struct LineInfo {
    size_t line;
    size_t col;
} LineInfo;

typedef struct ProgramBASM {
    Instruction *inst;
    LineInfo *lines;
} ProgramBASM;

Let us now discuss the different types of operations in our ASM-like language.

Arithmetic

Arithmetic operations deal exclusively with numbers. The destination operand (dst) as well as the first source register (src_a) can only be of register type. The second source register (src_b) can be either a constant or another register. Floating point operations will behave just the same (addf, subf...) but will use floating point registers instead.

OP    DST   SRC_A SRC_B
add   r1    r2    r3    -> r1 = r3 + r2
sub   r1    r2    r3    -> r1 = r3 - r2
mul   r1    r2    r3    -> r1 = r3 * r2
div   r1    r2    r3    -> r1 = r3 / r2
mod   r1    r2    r3    -> r1 = r3 % r2

add   r1    r2    c1    -> r1 = c1 + r2
sub   r1    r2    c1    -> r1 = c1 - r2
mul   r1    r2    c1    -> r1 = c1 * r2
div   r1    r2    c1    -> r1 = c1 / r2
mod   r1    r2    c1    -> r1 = c1 % r2

Memory (Load/store)

We need a way of loading and saving data between memory and registers. This allow us to interact with local variables, the stack, files, etc. First of all, load operations (ldX) move a value from memory into a register. Since integer types can have different sizes (8, 16, 32 and 64 bits) we need the corresponding memory operations. The dst operand must be of register type and indicates the register that will store the value loaded from memory. The src_a and src_b operands can be either constants or registers, however, if src_a is a constant, src_b will be ignored. This mode is used to load a constant value into a register. When src_a is a register, the value loaded into memory corresponds to MEMORY[src_a + src_b]. In other words, src_a contains the base address from which to load the data and src_b is the offset from that address. When src_b is a constant of value 0 we consider it an immediate load.

OP    DST   SRC_A SRC_B
ld8   r1    c1          -> (u8)c1
ld8   r1    r2    c1    -> u8  *p; r1 = p[r2 + c1]
ld8   r1    r2    r3    -> u8  *p; r1 = p[r2 + r3]
ld16  r1    c1          -> (u16)c1
ld16  r1    r2    c1    -> u16 *p; r1 = p[r2 + c1]
ld16  r1    r2    r3    -> u16 *p; r1 = p[r2 + r3]
ld32  r1    c1          -> (u32)c1
ld32  r1    r2    c1    -> u32 *p; r1 = p[r2 + c1]
ld32  r1    r2    r3    -> u32 *p; r1 = p[r2 + r3]
ld64  r1    c1          -> (u64)c1
ld64  r1    r2    c1    -> u64 *p; r1 = p[r2 + c1]
ld64  r1    r2    r3    -> u64 *p; r1 = p[r2 + r3]

Similarly, for store operations (stX), the dst value will be stored in MEMORY[src_a + src_b]. Note that there is not a immediate constant store operation.

OP    DST   SRC_A SRC_B
st8   r1    r2    c1    -> u8  *p; p[r2 + c1] = r1
st8   r1    r2    r3    -> u8  *p; p[r2 + r3] = r1
st16  r1    r2    c1    -> u16 *p; p[r2 + c1] = r1
st16  r1    r2    r3    -> u16 *p; p[r2 + r3] = r1
st32  r1    r2    c1    -> u32 *p; p[r2 + c1] = r1
st32  r1    r2    r3    -> u32 *p; p[r2 + r3] = r1
st64  r1    r2    c1    -> u64 *p; p[r2 + c1] = r1
st64  r1    r2    r3    -> u64 *p; p[r2 + r3] = r1

Finally, we can copy data register to register without having to go through memory. Both src_a and dst must be registers, where src_a is the register to copy and dst is the destination of said copy. Depending on the type of copy, it is equivalent of using an and mask on src_a as shown below.

OP    DST   SRC_A SRC_B
cp8   r1    r2          -> r1 = r2 & 0xFF
cp16  r1    r2          -> r1 = r2 & 0xFFFF
cp32  r1    r2          -> r1 = r2 & 0xFFFFFFFF
cp64  r1    r2          -> r1 = r2 & 0xFFFFFFFFFFFFFFFF

Jumps and control flow

We implement logic using branching. We have access to conditional and unconditional jump operations. Jumping effectively sets the program counter (PC) to the given memory location. All jump operations take a label in the dst, which is the target location for the jump. Conditional jump operations compare src_a and src_b and will execute the jump if the condition is met.

OP    DST   SRC_A SRC_B
jmp   l1                -> jump to `l1` unconditionally
jeq   l1    r1    r2    -> jump to `l1` if r1 == r2
jneq  l1    r1    r2    -> jump to `l1` if r1 != r2
jgt   l1    r1    r2    -> jump to `l1` if r1 >  r2
jlt   l1    r1    r2    -> jump to `l1` if r1 <  r2
jge   l1    r1    r2    -> jump to `l1` if r1 >= r2
jle   l1    r1    r2    -> jump to `l1` if r1 <= r2

In order to serialize the labels in our code, we make use of a special instruction that will not be present in the compiled bytecode or final compilation form. The label instruction is effectively a NOP operation used to annotate the next non label instruction to be executed. The dst operand contains the numeric identifier for this label.

OP    DST   SRC_A SRC_B
lab   l1

Here is a rough example of an if-else statement after IR serialization:

; BDL
(if (= 2 1) 3 4)

; BASM
ld64   r1   2
ld64   r2   1
jeq    l1   r1   r2
ld64   r3   4
jmp    l2
lab    l1
ld64   r3   3
lab    l2

When compiling to bytecode, each instruction has a fixed size so we can substitute all labels for the equivalent instruction offset.

; BASM (after label to offset substitution)
ld64   r1   2
ld64   r2   1
jeq    3    r1   r2
ld64   r3   4
jmp    2
ld64   r3   3

Bitwise ops

Even though we haven’t added these to our language, it is useful to consider bitwise operations. These are things like shifting/rotating bits left/right, and bitwise and, or, not and xor operations. All of these require a register for dst and src_a. src_b can be either a constant or another register depending on the operation. Left and right shifts (Xshif) move the bit sequence left or right, inserting zeros right or left respectively. The original signedness of src_a doesn’t matter, meaning that a negative number will still pad the left side with zeroes when shifting right.

OP    DST   SRC_A SRC_B
lshif r1    r2    r3    -> r1 = r2 << r3
lshif r1    r2    c1    -> r1 = r2 << c1
rshif r1    r2    r3    -> r1 = r2 >> r3
rshif r1    r2    c1    -> r1 = r2 >> c1

Examples:

ld8   r2 0b11011101
lshif r1 r2 2 -> r1 = 0b01110100
rshif r1 r2 3 -> r1 = 0b00011011

Rotating bits in a N bit architecture (by default our VM will target N == 64) performs the same left/right bit shifting as above, but instead of inserting zeroes, the bit sequence to the right/left respectively is inserted into the hole.

OP    DST   SRC_A SRC_B
lrot  r1    r2    r3    -> r1 = (r2 << r3) & (r2 >> (N - r3))
lrot  r1    r2    c1    -> r1 = (r2 << c1) & (r2 >> (N - c1))
rrot  r1    r2    r3    -> r1 = (r2 >> r3) & (r2 << (N - r3))
rrot  r1    r2    c1    -> r1 = (r2 >> c1) & (r2 << (N - c1))

Examples (N = 8 for demonstration purposes):

ld8   r2 0b11011101
lrot  r1 r2 2 -> r1 = 0b01110111
rrot  r1 r2 3 -> r1 = 0b10111011

The logical operations are pretty self explanatory. Note that the logical not doesn’t need src_b, since it just inverts the bit sequence in src_a.

OP    DST   SRC_A SRC_B
not   r1    r2          -> r1 = ~r2
and   r1    r2    r3    -> r1 = r2 & r3
and   r1    r2    c1    -> r1 = r2 & c1
or    r1    r2    r3    -> r1 = r2 | r3
or    r1    r2    c1    -> r1 = r2 | c1
xor   r1    r2    r3    -> r1 = r2 ^ r3
xor   r1    r2    c1    -> r1 = r2 ^ c1

Examples:

ld8   r2 0b11011101
ld8   r3 0b00001011
not   r1 r2    -> r1 = 0b00100010
and   r1 r2 r3 -> r1 = 0b00001001
or    r1 r2 r3 -> r1 = 0b11011111
xor   r1 r2 r3 -> r1 = 0b11010110

Conclusion

In this article we have introduced a low-level assembly like language to be used both as an intermediate representation for optimization and code generation and as input for a virtual machine with infinite registers. We haven’t yet explored how to perform things like procedure calls or access to local or scoped variables, which require the definition of a calling convention and an explanation of activation records. This will be the topic of a future article, but for now we have a base to start compiling some basic bytecode programs and to develop a simple virtual machine to test our programs. Until next time!