badd10de.dev

BDL: Part 8 (A simple type system)


Introduction/Language update

Hello there! Some of the ideas described in the previous post have been slowly percolating in the background. The rewrite of BDL to adopt some of those ideas is going well. The lexer hasn’t changed much, it is still quite simple, only requiring the detection of numeric types (but not yet parsing them). A numeric token is marked when it starts with -/+ followed by a numeric character (or just starting on the later). Additionally, some token types have been added for builtin functions or constants:

static const Keyword keywords[] = {
    KEYWORD("nil",    TOKEN_NIL),
    KEYWORD("true",   TOKEN_TRUE),
    KEYWORD("false",  TOKEN_FALSE),
    KEYWORD("lambda", TOKEN_LAMBDA),
    KEYWORD("if",     TOKEN_IF),
    KEYWORD("def",    TOKEN_DEF),
    KEYWORD("set",    TOKEN_SET),
    KEYWORD("fun",    TOKEN_FUN),
    KEYWORD("struct", TOKEN_STRUCT),
    KEYWORD("+",      TOKEN_ADD),
    KEYWORD("-",      TOKEN_SUB),
    KEYWORD("*",      TOKEN_MUL),
    KEYWORD("/",      TOKEN_DIV),
    KEYWORD("%",      TOKEN_MOD),
    KEYWORD("not",    TOKEN_NOT),
    KEYWORD("and",    TOKEN_AND),
    KEYWORD("or",     TOKEN_OR),
    KEYWORD("=",      TOKEN_EQ),
    KEYWORD("<",      TOKEN_LT),
    KEYWORD(">",      TOKEN_GT),
    KEYWORD("<=",     TOKEN_LE),
    KEYWORD(">=",     TOKEN_GE),
};

The parser’s task is to interpret the sequences of tokens to generate a series of root expressions/subtrees according to the grammar of the language. For example, the following generates three root expressions (Variable definition, function declaration and function call):

(def a:s64 1)

(fun fib (n: s64): s64
    (if (<= n 2)
        1
        (+ (fib (- n 1)) (fib (- n 2)))))

(fib a)

Currently we require type annotations for symbols in variable declarations, function parameters/return types but more on that later. Each of these trees is composed of a number of nodes, not too dissimilar from what we used in previous iterations of the language. Nodes are tagged union structs that keep track of the line/column of the expression that generated them and the relevant information for each node type:

typedef enum NodeType {
    NODE_BUILTIN,
    NODE_NUMBER,
    NODE_BOOL,
    NODE_STRING,
    NODE_SYMBOL,
    NODE_TYPE,
    NODE_DEF,
    NODE_SET,
    NODE_FUN,
    NODE_BLOCK,
    NODE_IF,
    NODE_FUNCALL,
} NodeType;

typedef struct Node {
    size_t id;
    NodeType type;
    size_t line;
    size_t col;
    struct Type *expr_type;

    union {
        // Numbers.
        struct {
            bool negative;
            size_t integral;
            size_t fractional;
        } number;

        // String/symbol/type.
        StringView string;

        // Boolean.
        bool boolean;

        // Builtin primitive.
        struct {
            TokenType type;
            struct Node **args;
        } builtin;

        // Function call.
        struct {
            struct Node *name;
            struct Node **args;
        } funcall;

        // Variable definition.
        struct {
            struct Node *symbol;
            struct Node *value;
            struct Node *type;
        } def;

        // Variable assignment.
        struct {
            struct Node *symbol;
            struct Node *value;
        } set;

        // Function definition.
        struct {
            struct Node *name;
            struct Node **param_names;
            struct Node **param_types;
            struct Node *return_type;
            struct Node *body;
        } fun;

        // Block statements.
        struct {
            struct Node **expr;
        } block;

        // If statement.
        struct {
            struct Node *cond;
            struct Node *expr_true;
            struct Node *expr_false;
        } ifexpr;
    };
} Node;

The parser still performed via recursive descent like so:

Node *
parse_next(Parser *parser) {
    Token tok = peek_token(parser);
    switch (tok.type) {
        case TOKEN_NUMBER: { return parse_number(parser); } break;
        case TOKEN_STRING: { return parse_string(parser); } break;
        case TOKEN_SYMBOL: { return parse_symbol(parser); } break;
        case TOKEN_TRUE:
        case TOKEN_FALSE: { return parse_bool(parser); } break;
        case TOKEN_LPAREN: { return parse_paren(parser); } break;
        case TOKEN_EOF: { return NULL; } break;
        case TOKEN_LCURLY: { return parse_block(parser); } break;
        default: {
            push_error(ERR_TYPE_PARSER, ERR_UNKNOWN_TOK_TYPE, tok.line, tok.col);
            return NULL;
        } break;
    }
}

Where each of the relevant token types is handled in individual functions that can call each other. Note that there are currently four special function calls that need to be handled separately (def, set, fun,if), and we now have the concept of builtin functions, like addition, division, or boolean comparisons. Builtin functions can take any number of arguments, whereas this is not currently allowed in user defined functions:

Node *
parse_paren(Parser *parser) {
    next_token(parser); // Skip paren.

    Token tok = peek_token(parser);

    switch (tok.type) {
        // Builtin functions.
        case TOKEN_ADD:
        case TOKEN_SUB:
        case TOKEN_MUL:
        case TOKEN_DIV:
        case TOKEN_MOD:
        case TOKEN_NOT:
        case TOKEN_AND:
        case TOKEN_OR:
        case TOKEN_EQ:
        case TOKEN_LT:
        case TOKEN_GT:
        case TOKEN_LE:
        case TOKEN_GE: { return parse_builtin(parser); } break;
        // Special functions.
        case TOKEN_DEF: { return parse_def(parser); } break;
        case TOKEN_SET: { return parse_set(parser); } break;
        case TOKEN_FUN: { return parse_fun(parser); } break;
        case TOKEN_IF: { return parse_if(parser); } break;
        default: break;
    }

    return parse_funcall(parser);
}

One recent addition to the language is the inclusion of expression blocks, defined by curly braces containing a number of expressions { e1 e2 ... eN }. A block returns the result of the last expression in it eN. These can be considered syntactic sugar for something like (begin e1 e2 ... eN) in Scheme and also introduce a new scope which enable us to do something like:

(def a:str "lexical scoping allows variable shadowing")

(def example:s64
    (if true {
        (def a:s64 2)
        (def b:s64 4)
        (+ a b)
    } {
        (def a:s64 8)
        (+ a 42)
    }))

This is a pretty silly example, but hopefully the usage is clear. This also affects function definitions. Now functions take a single expression as a body, but curly braces can be used for the previous behaviour:

(fun foo (c: s64):s64 {
    (def a:s64 2)
    (def b:s64 4)
    (+ a b c)
})

(foo 10) ; -> returns 16

I think this chimeric mixture between C-like and lisp languages is pretty cool, and standardizes the behaviour of blocks and lexical scoping. Bear in mind that a function with a single expression or a block expression create a single scope in both cases, being a special case of block usage.

In the next section I will discuss the type system and semantic analysis in its current state, but before that I would also like to briefly mention that there are some new options in the compiler that allow us to visualize the data structures of intermediate data structures:

Usage: bdl [options] <filename filename ...>

        -h              Show usage.
        -p [l|p|s|t]    Print mode for [l]exing, [p]arsing, [s]emantic analysis, symbol [t]ables.

The compiler generates plain text files that when applicable, can be fed to graphviz to generate nice plots to help us debug and understand our code. For example, for the initial code snippet with the fibonacci function:

(def a:s64 1)

(fun fib (n: s64): s64
    (if (<= n 2)
        1
        (+ (fib (- n 1)) (fib (- n 2)))))

(fib a)

Lexing:

[   1:1    ] LPAREN
[   1:2    ] DEF
[   1:6    ] SYMBOL -> a
[   1:7    ] COLON
[   1:8    ] SYMBOL -> s64
[   1:12   ] NUMBER -> 10
[   1:14   ] RPAREN
[   3:1    ] LPAREN
[   3:2    ] FUN
[   3:6    ] SYMBOL -> fib
[   3:10   ] LPAREN
[   3:11   ] SYMBOL -> n
[   3:12   ] COLON
[   3:14   ] SYMBOL -> s64
[   3:17   ] RPAREN
[   3:18   ] COLON
[   3:20   ] SYMBOL -> s64
[   4:5    ] LPAREN
[   4:6    ] IF
[   4:9    ] LPAREN
[   4:10   ] LE
[   4:13   ] SYMBOL -> n
[   4:15   ] NUMBER -> 2
[   4:16   ] RPAREN
[   5:9    ] NUMBER -> 1
[   6:9    ] LPAREN
[   6:10   ] ADD
[   6:12   ] LPAREN
[   6:13   ] SYMBOL -> fib
[   6:17   ] LPAREN
[   6:18   ] SUB
[   6:20   ] SYMBOL -> n
[   6:22   ] NUMBER -> 1
[   6:23   ] RPAREN
[   6:24   ] RPAREN
[   6:26   ] LPAREN
[   6:27   ] SYMBOL -> fib
[   6:31   ] LPAREN
[   6:32   ] SUB
[   6:34   ] SYMBOL -> n
[   6:36   ] NUMBER -> 2
[   6:37   ] RPAREN
[   6:38   ] RPAREN
[   6:39   ] RPAREN
[   6:40   ] RPAREN
[   6:41   ] RPAREN
[   8:1    ] LPAREN
[   8:2    ] SYMBOL -> fib
[   8:6    ] SYMBOL -> a
[   8:7    ] RPAREN
[   9:1    ] EOF

Parsing:

Parse tree example

Semantic analysis tree:

Parse tree example after semantic analysis/type annotation

Semantic analysis symbol tables:

Symbol table example

The idea for these visualizations came from this Twitter thread and I am so happy I discovered it. These visualizations already helped me find a number of bugs during development and they where pretty easy to implement.

Semantic analysis

Sometimes semantic analysis is bundled in the parser step, but I like the idea of keeping phases separated. The lexer generates a token list (tokens), the parser uses said list to generate a number of expression trees (grammar) and the semantic analyzer annotates these trees, generates symbol tables and performs type checking (semantics). The two main things that we want to do in the semantic analysis stage is to ensure that symbols are defined before usage and that function and variable types match the expected behaviour. We performed these steps simultaneously, since both involve the traversal of the expression trees. We do, however, allow function definitions in the global scope to be anywhere to avoid having to forward declare them. In the future we may consider splitting this into two passes, but the bulk of the work is on the type checking side so it is fine for now.

The management of symbol tables should be familiar if you have been following along with the rest of the series. We achieve lexical scoping by having a hierarchical list of hash tables (remember Environments?).

typedef struct Scope {
    size_t id;
    struct Scope *parent;
    HashTable *symbols;
    HashTable *types;
} Scope;

As you can see, we also keep track of defined types, which we populate with default types (u16, u8, s64, str, bool, etc.). At the moment of this writing, structs, arrays and references have not been fully implemented yet, so there are no user defined types for now, but they will be in the future.

For type checking, we are going to use a very simple monomorphic (i.e. not polymorphic) approach. That is, types need to match exactly and there are no generic types. Constants such as "hello" or true have implicit types of str and bool respectively. Expressions like def or fun don’t return anything, and thus have type void. Numbers are a bit more complicated, but for now we follow a very simple approach: if a number has a fractional part the type is f64 and if not s64. This will likely change in the future, specially once we allow for type annotations for numeric constants, but it will do for now.

If expressions return a value, so if the expression contains the optional else branch, both true and false expressions need to return values of the same type. In case we don’t have an else expression, the expression if true must be of type void. Furthermore, the conditional expression must have type bool.

In case of def expressions, we ensure that the given annotation and the type of the value expression match. Similarly, we check calls to user defined functions to ensure the arguments match the formal parameters and the return type also meets the expectation.

Builtin functions are treated specially, for example we ensure that all arguments given to not, and, and or are of type bool and that numeric expressions (+, -, <=, etc.) have numeric arguments. Here we run into an issue though, because certain numeric types can be safely converted to other types, for example (+ a:u8 b:u64) should perform a u64 summation, since a, of type u8 safely fits inside a u64. We do however disallow mixing up floating point values with integers or signed and unsigned together. In the future we will likely have casting functions, but this requires thinking about some considerations I’m not ready to tackle right now. This type system may be scrapped in the future in favour of a more powerful polymorphic one, so I want to keep things simple for now.

We traverse the tree by recursively calling the resolve_type function, which returns false if an error occurred at any stage or true otherwise.

bool
resolve_type(ParseTree *ast, Scope *scope, Node *node) {
    if (node->expr_type != NULL) {
        return true;
    }
    switch (node->type) {
        case NODE_BUILTIN: {
        ...

The current scope is being passed around and as previously mentioned a new scope is instantiated for block nodes or function definitions:

        case NODE_BLOCK: {
            scope = alloc_scope(scope);
            array_push(ast->scopes, scope);
            for (size_t i = 0; i < array_size(node->block.expr); ++i) {
                Node *expr = node->block.expr[i];
                if (!resolve_type(ast, scope, expr)) {
                    return false;
                }
            }
            Node *last_expr = node->block.expr[array_size(node->block.expr) - 1];
            node->expr_type = last_expr->expr_type;
        } break;

In symbol tables we currently store the following Symbol value:

typedef enum SymbolType {
    SYMBOL_VAR,
    SYMBOL_PAR,
    SYMBOL_FUN,
} SymbolType;

typedef struct Symbol {
    Node *name;
    SymbolType type;

    union {
        struct {
            Node *type;
        } var;

        struct {
            Node **param_types;
            Node *return_type;
        } fun;
    };
} Symbol;

That is, as Symbol can store a variable (Declared with def), function parameter or a function type, which also keeps track of parameter types as well as return values. We insert a symbol with the following function, which allows the insertion if the current scope don’t have a symbol with the same name:

bool
insert_symbol(Scope *scope, Node *symbol, Symbol *val) {
    // Check if symbol already exists.
    HashTable *symbols = scope->symbols;
    if (ht_lookup(symbols, symbol) != NULL) {
        push_error(ERR_TYPE_PARSER, ERR_SYMBOL_REDEF, symbol->line, symbol->col);
        return false;
    }
    ht_insert(symbols, symbol, val);
    return true;
}

When looking for symbols, however we traverse all parent scopes to see if it was defined previously.

Symbol *
find_symbol(Scope *scope, Node *node) {
    while (scope != NULL) {
        Symbol *val = ht_lookup(scope->symbols, node);
        if (val != NULL) {
            return val;
        }
        scope = scope->parent;
    }
    push_error(ERR_TYPE_PARSER, ERR_UNKNOWN_SYMBOL, node->line, node->col);
    return NULL;
}

All in all, the semantic analysis is pretty compact clocking at more or less 500 lines (not including the hash table implementation). It can use some improvements, and there are some considerations currently marked as TODOs, but hopefully is solid enough for now. Type inference will also be nice to have and it is certainly not difficult to add to def expressions if we just obviate the need for explicit annotations, but it will probably not be implemented for function parameters/return types until a more sophisticated type system is put in place.

Conclusion

It’s been a wild ride so far, and I have a big sense of Deja-Vu from rewriting the same compiler modules over an over, but I think this allowed me to understand my preferred way of implementing these things. I’m pretty happy with the current lexer and parser implementation, but the type system will likely change as we iterate over some new ideas. Likewise, I doubt the symbol tables contain sufficient information for further compilation stages, but the framework is in place and can be easily extended once there is the need. As mentioned, there are still some important features missing, however, advancing through different compilation stages typically yields new insights that may change the way we structure previous steps, so there is a case for gradual progress. For the next post, I’ll attempt to design a low-level linear IR that could be used for optimization and bytecode/assembly code generation. Stay tuned!