badd10de.dev

BDL: Part 2 (Object types and parsing)


Introduction

After generating a series of tokens, we can now start the parsing process. The goal of the parser is to create an abstract syntax tree (AST) storing a representation of the current expression. An interesting property of Lisp languages is that the AST can be used directly, since lists are stored internally as linked objects. This lends itself well to a recursive implementation for parsing, displaying and evaluating the code.

The goal for our interpreter today is to build and AST (reporting errors as necessary) from the list of tokens, and echo back a string representation of the given input:

BDL REPL (Press Ctrl-D or Ctrl-C to exit)
bdl> (+ 1 a b ("c d" false) ())
(:+ 1 :a :b ("c d" false) ())

This doesn’t look very impressive, but under the hood the aforementioned transformations will be performed and the same mechanism used for displaying the output will later be used for expression evaluation.

Objects and object types

Our language is dynamically typed, and we need a way of identifying each entity at runtime. We will be calling the base type Object, though it has nothing to do with Object Oriented Programming. Maybe Datum would be a better name, but I’m afraid it may be more confusing for people. In any case, objects can be of several types, for now we consider the following:

typedef enum ObjectType {
    OBJ_TYPE_FIXNUM,
    OBJ_TYPE_BOOL,
    OBJ_TYPE_NIL,
    OBJ_TYPE_SYMBOL,
    OBJ_TYPE_STRING,
    OBJ_TYPE_PAIR,
    OBJ_TYPE_PROCEDURE,
    OBJ_TYPE_ERR,
} ObjectType;

We talked about most of these in the previous article, with the exception of procedure. We will talk more in depth about procedures in the future, but for now we consider these to be referring to built-in primitive functions, like + or -. To create polymorphism in C, there are a couple of alternatives. Here we will use the concept of a tagged union for clarity, but we will revisit the internal object representation in the future when its time to think about optimization. The Object struct contains a ObjectType field followed by an area of memory that could be interpreted differently depending on the type tag:

typedef struct Object {
    ObjectType type;
    union {
        // OBJ_TYPE_FIXNUM
        ssize_t fixnum;

        // OBJ_TYPE_STRING
        struct {
            char *string;
            size_t string_n;
        };

        // OBJ_TYPE_PAIR
        struct {
            struct Object *car;
            struct Object *cdr;
        };

        // OBJ_TYPE_SYMBOL
        struct {
            char *symbol;
            size_t symbol_n;
        };

        // OBJ_TYPE_PROCEDURE
        struct Object *(*proc)(struct Object *args);
    };
} Object;

The total size of each allocated object will be the size of the type tag and the maximum size of each possible type plus any necessary padding. In the previous example sizeof(Object) == 24. The largest value in the union is composed of two pointers, which in a x86_64 occupy 8 bytes each. ObjectType only requires 4 bytes, but our processors like memory to be aligned to specific boundaries, so we have 4 extra bytes of padding. This is an enormous amount of memory to use for every single entity in our code, but making things clear is more important right now than maximum speed. In fact, tree based interpreters like the one we are building are already slow, so we probably address this point when we wish to optimize the code further.

In any case, astute readers may have noticed that there is OBJ_TYPE_BOOL or OBJ_TYPE_NIL in the structure. This is because we will treat those objects as singletons, which will make it so that we don’t have to allocate memory for each of those and the empty list (nil) and company are always the same for all cases. We also keep track of a special object for denoting errors.

// Singletons.

Object *obj_nil;
Object *obj_true;
Object *obj_false;
Object *obj_err;

Next, we create a number of constructors for the relevant types and a helper function to append a string type.

Object *
alloc_object(ObjectType type) {
    Object *obj = malloc(sizeof(Object));
    obj->type = type;
    return obj;
}

Object *
make_fixnum(ssize_t num) {
    Object *obj = alloc_object(OBJ_TYPE_FIXNUM);
    obj->fixnum = num;
    return obj;
}

Object *
make_procedure(Object *(*proc)(struct Object *args)) {
    Object *obj = alloc_object(OBJ_TYPE_PROCEDURE);
    obj->proc = proc;
    return obj;
}

Object *
make_pair(Object *car, Object *cdr) {
    Object *obj = alloc_object(OBJ_TYPE_PAIR);
    obj->car = car;
    obj->cdr = cdr;
    return  obj;
}

Fixnums, procedures and pairs are simple. Symbols require us to copy the symbol name internally, and strings are initialized as empty and need to be appended a given string.

Object *
make_symbol(StringView sv) {
    Object *obj = alloc_object(OBJ_TYPE_SYMBOL);
    obj->symbol = malloc(sizeof(char) * sv.n);
    memcpy(obj->symbol, sv.start, sv.n);
    obj->symbol_n = sv.n;
    return obj;
}

Object *
make_string(void) {
    Object *obj = alloc_object(OBJ_TYPE_STRING);
    obj->string = NULL;
    obj->string_n = 0;
    return obj;
}

void
append_string(Object *obj, const StringView sv) {
    assert(obj != NULL);
    assert(obj->type == OBJ_TYPE_STRING);
    if (sv.n == 0) {
        return;
    }
    obj->string = realloc(obj->string, (obj->string_n + sv.n) * sizeof(char));
    memcpy(obj->string + obj->string_n, sv.start, sv.n);
    obj->string_n += sv.n;
}

We also need to create an initialization function that will instantiate the singleton objects. It should be ran exactly once at the beginning of our program.

void
init(void) {
    // Initialize singletons.
    obj_nil = alloc_object(OBJ_TYPE_NIL);
    obj_true = alloc_object(OBJ_TYPE_BOOL);
    obj_false = alloc_object(OBJ_TYPE_BOOL);
    obj_err = alloc_object(OBJ_TYPE_ERR);
}

We also shouldn’t forget about freeing our objects. We will be building a tree, so we must be able to traverse it and free everything inside. We have to be careful not to free the singleton objects:

void
free_objects(Object *root) {
    switch (root->type) {
        case OBJ_TYPE_BOOL: break;
        case OBJ_TYPE_NIL: break;
        case OBJ_TYPE_ERR: break;
        case OBJ_TYPE_PROCEDURE:
        case OBJ_TYPE_FIXNUM: {
            free(root);
        } break;
        case OBJ_TYPE_SYMBOL: {
            if (root->symbol != NULL) {
                free(root->symbol);
            }
            free(root);
        } break;
        case OBJ_TYPE_STRING: {
            if (root->string != NULL) {
                free(root->string);
            }
            free(root);
        } break;
        case OBJ_TYPE_PAIR: {
            if (root->car != NULL) {
                free_objects(root->car);
            }
            if (root->cdr != NULL) {
                free_objects(root->cdr);
            }
            free(root);
        } break;
    }
}

Finally we will use the same recursive technique to display our objects. We have a special case for printing pairs, since when the pairs contain a non nil object in the cdr they are typically represented separated by a dot ..

void display(Object *root);

void
display_pair(Object *root) {
    display(root->car);
    if (root->cdr->type == OBJ_TYPE_PAIR) {
        printf(" ");
        display_pair(root->cdr);
    } else if (root->cdr == obj_nil) {
        return;
    } else {
        printf(" . ");
        display(root->cdr);
    }
}

void
display(Object *root) {
    switch (root->type) {
        case OBJ_TYPE_FIXNUM: {
            printf("%zd", root->fixnum);
        } break;
        case OBJ_TYPE_BOOL: {
            if (root == obj_true) {
                printf("true");
            } else {
                printf("false");
            }
        } break;
        case OBJ_TYPE_NIL: {
            printf("()");
        } break;
        case OBJ_TYPE_STRING: {
            printf("\"%.*s\"", (int)root->string_n, root->string);
        } break;
        case OBJ_TYPE_SYMBOL: {
            printf(":%.*s", (int)root->symbol_n, root->symbol);
        } break;
        case OBJ_TYPE_PAIR: {
            printf("(");
            display_pair(root);
            printf(")");
        } break;
        case OBJ_TYPE_PROCEDURE: {
            printf("#{procedure}");
        } break;
        case OBJ_TYPE_ERR: {
            printf("#{error}");
        } break;
    }
    return;
}

Parsing tokens

We will be using a similar approach as we did with our character scanner. First, we build some helper functions to visit tokens without altering the original input:

typedef struct Visitor {
    Tokens tokens;
    size_t current;
} Visitor;

Token
peek_token(const Visitor *visitor) {
    return visitor->tokens.buf[visitor->current];
}

Token
next_token(Visitor *visitor) {
    return visitor->tokens.buf[visitor->current++];
}

bool
has_next_token(const Visitor *visitor) {
    return visitor->current < visitor->tokens.size;
}

Let’s setup the base parsing function, which is just a giant switch statement.

Object *
parse_tree(Visitor *vs) {
    Token tok = next_token(vs);
    switch (tok.type) {
        case TOKEN_FIXNUM: {
            return parse_fixnum(tok);
        } break;
        case TOKEN_TRUE: {
            return obj_true;
        } break;
        case TOKEN_FALSE: {
            return obj_false;
        } break;
        case TOKEN_RPAREN: {
            error_push((Error){
                .type = ERR_TYPE_PARSER,
                .value = ERR_UNBALANCED_PAREN,
                .line = tok.line,
                .col = tok.column,
            });
            return obj_err;
        } break;
        case TOKEN_QUOTE: {
            Object *quote_sym = make_symbol((StringView){"quote", 5});
            Object *next_obj = parse_tree(vs);
            if (next_obj == obj_err) {
                free_objects(quote_sym);
                return obj_err;
            }
            return make_pair(quote_sym, make_pair(next_obj, obj_nil));
        } break;
        case TOKEN_LPAREN: {
            Object *obj = parse_list(vs);
            if (obj == obj_err) {
                error_push((Error){
                    .type = ERR_TYPE_PARSER,
                    .value = ERR_UNBALANCED_PAREN,
                    .line = tok.line,
                    .col = tok.column,
                });
            }
            return obj;
        } break;
        case TOKEN_STRING: {
            Object *obj = make_string();
            append_string(obj, tok.value);
            return obj;
        } break;
        case TOKEN_SYMBOL: {
            return make_symbol(tok.value);
        } break;
        case TOKEN_NIL: {
            return obj_nil;
        } break;
        default: {
            break;
        } break;
    }
    error_push((Error){
        .type = ERR_TYPE_PARSER,
        .value = ERR_EOF_REACHED,
        .line = tok.line,
        .col = tok.column,
    });
    return obj_err;
}

We handle the fixnum parsing on a separate function for simplicity.

Object *
parse_fixnum(Token tok) {
    ssize_t num = 0;
    int sign = 1;
    for (size_t i = 0; i < tok.value.n; i++) {
        char c = tok.value.start[i];
        if (c == '-') {
            sign = -1;
            continue;
        }
        num = num * 10 + (c - '0');
    }
    return make_fixnum(num * sign);
}

The list parsing also happens in a separate function and it allow us to create a recursive descent parser by calling the base function for each subsequent token.

Object *
parse_list(Visitor *vs) {
    Token tok = peek_token(vs);
    if (tok.type == TOKEN_EOF) {
        return obj_err;
    }
    Object *next_obj = parse_tree(vs);
    if (next_obj == obj_err) {
        return obj_err;
    }
    Object *root = make_pair(next_obj, obj_nil);
    Object *list = root;
    while (has_next_token(vs)) {
        Token tok = peek_token(vs);
        if (tok.type == TOKEN_RPAREN) {
            next_token(vs);
            break;
        }
        if (tok.type == TOKEN_EOF) {
            free_objects(root);
            return obj_err;
        }
        next_obj = parse_tree(vs);
        if (next_obj == obj_err) {
            free_objects(root);
            return obj_err;
        }
        list->cdr = make_pair(next_obj, obj_nil);
        list = list->cdr;
    }
    return root;
}

We have to be careful when finding errors and free the memory if its going to go out of scope. We don’t want to be leaking memory right? We also added a bunch of new errors:

static const char* error_msgs[] = {
    [ERR_UNKNOWN] = "error: something unexpected happened",
    [ERR_UNMATCHED_STRING] = "error: unmatched string delimiter",
    [ERR_UNBALANCED_PAREN] = "error: unbalanced parentheses",
    [ERR_NOT_IMPLEMENTED] = "error: not implemented",
    [ERR_EOF_REACHED] = "error: EOF reached",
    [ERR_UNKNOWN_TOKEN] = "error: unknown token",
};

Conclusion

To hook everything up we only need to adjust the process_source function as follows.

    Visitor visitor = (Visitor){
        .tokens = tokens,
        .current = 0,
    };
    while (has_next_token(&visitor) && peek_token(&visitor).type != TOKEN_EOF) {
        Object *root = parse_tree(&visitor);
        if (root == obj_err || errors_n != 0) {
            free_objects(root);
            break;
        }
        display(root);
        printf("\n");
        free_objects(root);
    }

The parse_tree function returns the output of a single statement. If our program has multiple statements we want them to be executed one after another, for example:

(+ 1 2 3 4)

(* 2 3 4)

This should evaluate the first expression 10 and then the second 24. In many Scheme implementations, even when all expressions are evaluated only the results from the last one will be shown. We will talk about that very soon, but for now you can test the implementation with different edge cases and see if we are generating the right AST.

In the next article we will talk about expression evaluation, introduce the concept of “environments” and add our first primitive procedures. As usual the code at this point in time can be found on tag v0.3. See you soon!