badd10de.dev

BDL: Part 1 (Lexing and error handling)


Introduction

On the previous post we created the boilerplate for our interpreter. Today we are going to create a “lexer” for our language. The goal of a lexer is to traverse the source code to create a list of tokens. The tokens are composed of a token type and the actual “lexeme” (value) of the token. We will also be storing some other metadata such as line/column number in each token to help in error reporting. Speaking of error reporting, we start building an error reporting infrastructure inspired by the one described in Crafting Interpreters. But first, let’s talk about the language.

Scheme, lists and parenthesis oh my!

Scheme is a simple, dynamically typed, Lisp based programming language. Pretty much the entire language could be summarized as:

(operator operand0 operand1 ...)

In broad strokes, the first element of a list (represented by parenthesis) is a procedure or binding we want to evaluate, and operands are the arguments for those procedures. Of course this is not entirely true, in Lisp based languages, we can also represent data as lists. In other words, the data is the code and the code is the data, it just depends on the context and how we interpret it.

Lists can be nested and the evaluation will happen recursively, so for example:

(+ 1 2 (* 2 3 4 (- 100 95)))

Results in the sum of 1 + 2 + (2 * 3 * 4 * (100 - 95)) if we were using infix notation.

One powerful tool in Scheme is “quotation”. Using quotes we can denote a piece of code as data. So while (+ 1 2) evaluates to 3, '(+ 1 2) evaluates to (+ 1 2). We will talk about quoting later, but for now this should be enough to give you a brief understanding.

To start we will be supporting the following primitive objects:

Tokens

With all of this knowledge, we can represent the entirety of the language with the following token types, which include the parentheses, single quote, and the aforementioned primitive types. At the end of the parsing we will be pushing the EOF token:

typedef enum TokenType {
    TOKEN_UNKNOWN = 0,
    TOKEN_LPAREN,
    TOKEN_RPAREN,
    TOKEN_QUOTE,
    TOKEN_TRUE,
    TOKEN_FALSE,
    TOKEN_NIL,
    TOKEN_FIXNUM,
    TOKEN_SYMBOL,
    TOKEN_STRING,
    TOKEN_EOF,
} TokenType;

As mentioned before, for each token we store its type and string value in the form of a StringView. This allow us to avoid unnecessary string allocations for the different lexems. We also keep track of the starting line and column of the token for easier error reporting:

typedef struct Token {
    TokenType type;
    StringView value;
    size_t line;
    size_t column;
} Token;

To store a list of tokens, we will be using a dynamic array as explained before for the run_stdin function. We will also create a helper method for adding tokens to the buffer:

typedef struct Tokens {
    Token *buf;
    size_t size;
    size_t cap;
} Tokens;

#define TOK_BUF_CAP 256

void
token_push(Tokens *tokens, Token tok) {
    if (tokens->buf == NULL) {
        tokens->size = 0;
        tokens->cap = TOK_BUF_CAP;
        tokens->buf = malloc(tokens->cap * sizeof(Token));
    } else if (tokens->size == tokens->cap) {
        tokens->cap *= 2;
        tokens->buf = realloc(tokens->buf, tokens->cap * sizeof(Token));
    }
    tokens->buf[tokens->size] = tok;
}

We also create a Scanner structure and associated procedures to help us traverse the source code while keeping track of the line/column number and character offset from the beginning of the string.

typedef struct Scanner {
    StringView current;
    size_t line_number;
    size_t col_number;
    size_t offset;
} Scanner;

char
scan_next(Scanner *scanner) {
    char c = sv_next(&scanner->current);
    if (c == '\n') {
        scanner->line_number++;
        scanner->col_number = 1;
    } else {
        scanner->col_number++;
    }
    scanner->offset++;
    return c;
}

char
scan_peek(const Scanner *scanner) {
    return sv_peek(&scanner->current);
}

bool
scan_has_next(const Scanner *scanner) {
    return scanner->current.n != 0;
}

We also create a couple of helper functions to skip whitespace, test if a given character is a delimiter and find the primitive type of a StringView value:

void
skip_whitespace(Scanner *scanner) {
    while (scan_has_next(scanner)) {
        char c = scan_peek(scanner);
        switch (c) {
            case ' ':
            case '\f':
            case '\n':
            case '\r':
            case '\t':
            case '\v': {
                scan_next(scanner);
            } break;
            default: {
                return;
            } break;
        }
    }
}

bool
is_delimiter(char c) {
    switch (c) {
        case EOF:
        case '\0':
        case ';':
        case '"':
        case '\'':
        case '(':
        case ')':
        case ' ':
        case '\f':
        case '\n':
        case '\r':
        case '\t':
        case '\v': {
            return true;
        } break;
    }
    return false;
}

TokenType
find_primitive_type(StringView value) {
    bool is_fixnum = true;
    for (size_t i = 0; i < value.n; i++) {
        char c = value.start[i];
        if (i == 0 && c == '-' && value.n > 1) {
            continue;
        }
        if (!(c >= '0' && c <= '9')) {
            is_fixnum = false;
            break;
        }
    }
    if (is_fixnum) {
        return TOKEN_FIXNUM;
    }
    if (sv_equal(&value, &(StringView){"true", 4})) {
        return TOKEN_TRUE;
    }
    if (sv_equal(&value, &(StringView){"false", 5})) {
        return TOKEN_FALSE;
    }
    return TOKEN_SYMBOL;
}

With this we can create a new procedure to tokenize the input, note how the Scanner copies the original StringView value to avoid modifying it.

Tokens
tokenize(const StringView *sv) {
    Tokens tokens = (Tokens){0};
    Scanner scanner = (Scanner){
        .current = *sv,
        .line_number = 1,
        .col_number = 1,
    };

    while (scan_has_next(&scanner)) {
        skip_whitespace(&scanner);
        size_t line = scanner.line_number;
        size_t col = scanner.col_number;
        size_t offset = scanner.offset;
        char c = scan_next(&scanner);
        switch (c) {
            case ';': {
            } break;
            case '"': {
            } break;
            case '\'': {
            } break;
            case '(': {
            } break;
            case ')': {
            } break;
            default: {
            } break;
        }
    }
    return tokens;
}

The easiest case to deal with is comments (;). We just have to skip everything from the comment character until a newline character is encountered.

    case ';': {
        while ((c = scan_next(&scanner)) != '\n' && c != '\0') {}
    } break;

The TOKEN_QUOTE, TOKEN_RPAREN, TOKEN_LPAREN and TOKEN_NIL don’t require any value anc can be generated directly when we encounter them:

    case '\'': {
        Token token =  (Token){
            .type = TOKEN_QUOTE,
            .line = line,
            .column = col,
        };
        push_token(&tokens, token);
    } break;
    case '(': {
        if (scan_peek(&scanner) == ')') {
            scan_next(&scanner);
            Token token =  (Token){
                .type = TOKEN_NIL,
                .line = line,
                .column = col,
            };
            push_token(&tokens, token);
        } else {
            Token token =  (Token){
                .type = TOKEN_LPAREN,
                .line = line,
                .column = col,
            };
            push_token(&tokens, token);
        }
    } break;
    case ')': {
        Token token =  (Token){
            .type = TOKEN_RPAREN,
            .line = line,
            .column = col,
        };
        push_token(&tokens, token);
    } break;

If we find a string, we need to traverse the source until we find a non escaped closing double quotes. This is the only instance in which our lexer can generate an error. We will leave the error reporting for the next section:

    char prev = c;
    bool found = false;
    size_t n = 0;
    while (scan_has_next(&scanner)) {
        c = scan_next(&scanner);
        if (c == '"' && prev != '\\') {
            found = true;
            break;
        }
        prev = c;
        n++;
    }
    if (!found) {
        // TODO: Report error: couldn't find the closing quotes.
    }
    Token token =  (Token){
        .value = (StringView){
            .start = &sv->start[offset + 1],
            .n = n,
        },
        .type = TOKEN_STRING,
        .line = line,
        .column = col,
    };
    push_token(&tokens, token);

Finally, we consider the rest of the primitive identifiers (true/false, fixnums and symbols). We keep reading characters until we find a delimiter and we then build the token and find the corresponding type with the helper function find_primitive_type:

    size_t n = 1;
    while (scan_has_next(&scanner)) {
        c = scan_next(&scanner);
        if (is_delimiter(c)) {
            break;
        }
        n++;
    }
    if (c == EOF || c == '\0') {
        break;
    }
    Token token =  (Token){
        .value = (StringView){
            .start = &sv->start[offset],
            .n = n,
        },
        .type = TOKEN_SYMBOL,
        .line = line,
        .column = col,
    };
    token.type = find_primitive_type(token.value);
    push_token(&tokens, token);

We can now test the tokenizer by modifying the process_source procedure on the main file (Don’t forget to free the token buffer!).

void
process_source(const StringView *source) {
    Tokens tokens = tokenize(source);

    // Print tokens.
    for (size_t i = 0; i < tokens.size; i++) {
        Token tok = tokens.buf[i];
        print_token(tok);
    }

    free(tokens.buf);
}

Note that we implemented a simple print_token function to test that the lexer is working as expected. Try to run it with the files from the example folder or by using the REPL.

Errors

Let’s talk about errors. First of all, there are different type of errors that can occur, for example, lexing errors, parsing errors or runtime errors. For the first two, we want to report the line and column of the error and the file name if applicable in addition to the error. Most text editors have a way of parsing error messages to aid in code navigation, for example:

filename.c:208:37: error: ‘an_example_var’ undeclared (first use in this function)

We want to output a similar error format, but when adding files via stdin or REPL commands, the errors may be slightly different.

To keep things simple, we will have a limited number of possible error types, each with an associated base text string. An error is then composed of an error type and error value:

typedef enum ErrorType {
    ERR_TYPE_LEXER,
    ERR_TYPE_PARSER,
    ERR_TYPE_RUNTIME,
} ErrorType;

typedef enum ErrorValue {
    ERR_UNKNOWN = 0,
    ERR_UNMATCHED_STRING,
} ErrorValue;

typedef struct Error {
    ErrorType type;
    ErrorValue value;
    size_t line;
    size_t col;
} Error;

static const char* error_msgs[] = {
    [ERR_UNKNOWN] = "error: something unexpected happened",
    [ERR_UNMATCHED_STRING] = "error: unmatched string delimiter",
};

It is possible that if there are any errors, we could still continue gracefully with our program execution. For this reason, we want to allow storing a number of errors in a buffer to handle them on request. Since we don’t want to spam the users with thousands of error messages just because an error happens at the beginning, we store a small static buffer to keep track of the number of errors in the pipe. For now, we can say that the maximum number of errors to display at any given time is 16:

#define ERR_MAX_NUMBER 16
static Error errors[ERR_MAX_NUMBER];
static size_t errors_n = 0;

void
error_push(Error error) {
    if (errors_n < ERR_MAX_NUMBER) {
        errors[errors_n++] = error;
    }
}

Let’s go back and add error output to the missing string delimiter for our lexer:

    ...

    if (!found) {
        error_push((Error){
                .type = ERR_TYPE_LEXER,
                .value = ERR_UNMATCHED_STRING,
                .line = line,
                .col = col,
        });
        return tokens;
    }

    ...

We can now handle the case where errors_n != 0 for each of the possible ways of running our program.

void
process_source(const StringView *source) {
    Tokens tokens = tokenize(source);
    if (errors_n != 0) {
        if (tokens.buf != NULL) {
            free(tokens.buf);
        }
        return;
    }

...

void
run_repl(void) {
    printf("BDL REPL (Press Ctrl-D or Ctrl-C to exit)\n");
    while (true) {
        printf(REPL_PROMPT);
        StringView sv = read_line();
        if (sv.start == NULL) {
            return;
        }
        process_source(&sv);

        // Check if there were any errors.
        if (errors_n != 0) {
            for (size_t i = 0; i < errors_n; i++) {
                Error err = errors[i];
                for (size_t j = 0; j < err.col + sizeof(REPL_PROMPT) - 2; j++) {
                    putchar(' ');
                }
                printf("|\n");
                for (size_t j = 0; j < err.col + sizeof(REPL_PROMPT) - 2; j++) {
                    putchar(' ');
                }
                printf("%s\n", error_msgs[err.value]);
            }
            errors_n = 0;
            continue;
        }

...

void
run_file(char *file_name) {
    ...
    process_source(&sv);

    // Check if there were any errors.
    if (errors_n != 0) {
        for (size_t i = 0; i < errors_n; i++) {
            Error err = errors[i];
            printf("%s", file_name);
            if (err.line != 0) {
                printf(":%ld:%ld", err.line, err.col);
            }
            printf("%s\n", error_msgs[err.value]);
        }
        errors_n = 0;
    }

...

void
run_stdin(void) {
    ...
    process_source(&sv);

    // Check if there were any errors.
    if (errors_n != 0) {
        for (size_t i = 0; i < errors_n; i++) {
            Error err = errors[i];
            printf("stdin");
            if (err.line != 0) {
                printf(":%ld:%ld: ", err.line, err.col);
            }
            printf("%s\n", error_msgs[err.value]);
        }
        errors_n = 0;
    }

...

For files and stdin we output an error that can be parsed by text editors, but on the REPL we get a bit fancy and display an indicator of where the error happened, since we can only have a single line:

BDL REPL (Press Ctrl-D or Ctrl-C to exit)
bdl> (this is a "test for unterminated strings
                |
                error: unmatched string delimiter
bdl>

Conclusion

That is all for now! You can fetch the source code at this point with the v0.2 tag. Hope you are enjoying so far! Next up: Parsing.