badd10de.dev

BDL: A Scheme inspired programming language - Part 0


Introduction

I’ve recently started the implementation of a new programming language for fun and to better understand the different moving parts involved. I’ve chosen to implement a Scheme inspired language due to its power, flexibility and simplicity. With that said, I don’t expect this language to be compatible with existing implementations, but instead I’ll implement primitives, built-in functions and library functions as I need them. I’ve added some resources at the bottom of this page to read people that are more qualified than me to talk about compilers and programming languages, but maybe some of the concepts I’ll be writing here are of interest to you.

For the bootstrap implementation, I’ll be building a tree-walking interpreter that can read programs from standard input, from a given file or list of files or interactively via a read-eval-print-loop (REPL). This initial implementation can then be used to generate a self-hosted compiler, but we will talk about that once we have a fully fledged interpreter. This bootstrap interpreter will be written in C99, use no external libraries and it will start a bit rough around the edges, but hopefully it will get better as we go.

The code is freely available and at the moment of this writing I’ve already implemented a good part of the interpreter. However, I’ve pushed a new, mostly empty branch so that you can follow along with this writing. This gives me the benefit of hindsight and allows me to perform some light refactoring that will get merged back into the main branch.

Who bootstraps the bootstrappers?

If you download the code from the v0.0 branch you should have a simple Makefile, a src/bootstrap/main.c file with the program skeleton and some other directories (examples, tests) that you can just ignore for now. They just contain some example code for the language and the expected results of running the examples for testing purposes.

We will be using a unity build compilation approach, which means we will really need to compile and link the main.c file. The Makefile is a convenience, but you should be able to compile the code throughout the project with:

cc src/bootstrap/main.c -o bdl

where cc is your compiler of choice and bdl is the output binary. For now I’m just interested in supporting Linux and macOS, but I don’t know, it may also work on Windows (haven’t tried to compile C code there in a long while).

The skeleton file uses the POSIX getopts to parse command line parameters and flags. We want the following behaviour:

  1. Interpret input files in the given order: bdl file1.bdl file2.bdl ...
  2. Run an interactive REPL: bdl -i
  3. Read the standard input if no input files were specified: cat file1.bdl | bdl or echo "(+ 1 2 3)" | bdl

For now, the goal is to echo the given input to stdout for the three supported usages. For this we will need a way of reading commands interactively, with optional support for line editing (arrows or chords) and history. But since we will be working with strings for the first stage, and C is infamous for the way it handles string manipulation, we will define a little utility to help us work more cleanly.

The string view (StringView)

In C, strings are simply a series of raw bytes followed by the null terminator character ('\0'). This have some advantages, since the memory footprint of a string is very compact, but operations on strings can perform poorly or be difficult to handle. For example calling the library function strlen requires iterating over each byte until the null terminator is found. Additionally, we want to avoid performing unnecessary heap allocations, so in later stages, when we want to split the source string into tokens, we want a way of storing references to the original string rather than copying parts of it around. Enter…the StringView:

typedef struct StringView {
    char *start;
    size_t n;
} StringView;

The StringView contains a pointer to a memory location denoting the starting point of the substring (start) and the number of characters it contains (n). With this simple structure, we can define a couple of functions that allow us to iterate over it or peek the next character:

char
sv_next(StringView *sv) {
    if (sv->n == 0) {
        return '\0';
    }
    char c = sv->start[0];
    sv->start++;
    sv->n--;
    return c;
}

char
sv_peek(const StringView *sv) {
    if (sv->n == 0) {
        return '\0';
    }
    return sv->start[0];
}

We also include a couple of helper methods to compare two StringViews or write them to stdout or an output file:

bool
sv_equal(const StringView *a, const StringView *b) {
    if (a->n != b->n) {
        return false;
    }
    for (size_t i = 0; i < a->n; i++) {
        if (a->start[i] != b->start[i]) {
            return false;
        }
    }
    return true;
}

void
sv_write(const StringView *sv, FILE *file) {
    for (size_t i = 0; i < sv->n; i++) {
        putc(sv->start[i], file);
    }
}

Processing files

The simplest way of processing files is to read each of the files fully into memory. Computers these days have enough memory that this shouldn’t be a problem, and if is, we can always break down big files into small ones. The way to do this in C is a bit ugly, but here is the updated run_file function:

void
run_file(char *file_name) {
    FILE *file = fopen(file_name, "r");
    if (!file) {
        fprintf(stderr, "error: couldn't open input file: %s\n", file_name);
        exit(EXIT_FAILURE);
    }

    // Read entire file into memory.
    fseek(file, 0, SEEK_END);
    size_t file_size = ftell(file);
    fseek(file, 0, SEEK_SET);

    char *source = malloc(file_size + 1);
    fread(source, 1, file_size, file);
    source[file_size] = 0;

    StringView sv = (StringView){
        .start = source,
        .n = file_size,
    };

    process_input(&sv);

    free(source);
    fclose(file);
}

We also make our process_source function take a StringView and for now we just write its contents to stdout:

void
process_source(const StringView *source) {
    sv_write(source, stdout);
}

To try it out you can use it to dump the main function to file:

make
./build/bdl src/bootstrap/main.c

Reading from stdin

Reading the input from stdin is a bit more tricky, if only because we don’t know in advance how long the input will be. For this purpose we will be using the concept of dynamic arrays by allocating a small buffer, and if the number of bytes read before reaching EOF fills the buffer, we would reallocate the buffer with double its previous capacity:

#define STDIN_BUF_SIZE 16

void
run_stdin(void) {
    size_t buf_size = 0;
    size_t buf_cap = STDIN_BUF_SIZE;
    char *source = malloc(sizeof(char) * buf_cap);

    char c;
    while ((c = getchar()) != EOF) {
        if (buf_size == buf_cap) {
            buf_cap *= 2;
            source = realloc(source, buf_cap * sizeof(char));
        }
        source[buf_size] = c;
        buf_size++;
    }

    StringView sv = (StringView){
        .start = source,
        .n = buf_size,
    };

    process_source(&sv);

    free(source);
}

Interactive execution (REPL)

To handle the REPL commands we need a small buffer space (who is going to write more than 1024 characters directly on the REPL?) that can be statically allocated. Every time we call read_line, we clear the buffer and progressively fill up the buffer until we encounter a newline. Note that we only accept ascii characters as input, and if the EOF character is found we want to exit the REPL gracefully.

#define RL_BUF_SIZE 1024
static char readline_buf[RL_BUF_SIZE];

StringView
read_line(void) {
    // Clear buffer.
    for (size_t i = 0; i < RL_BUF_SIZE; i++) {
        readline_buf[i] = 0;
    }

    // Barebones readline implementation.
    size_t n = 0;
    char c;
    while ((c = getchar()) != '\n') {
        if (c == '\b') {
            readline_buf[n] = '\0';
            n--;
        } else if (c == EOF || c == '\0') {
            return (StringView){ .start = NULL, .n = 0 };
        } else if ((c >= ' ' && c <= '~') && n < RL_BUF_SIZE) {
            readline_buf[n] = c;
            n++;
        }
    }

    StringView sv = (StringView){
        .start = (char *)&readline_buf,
        .n = n,
    };
    return sv;
}

The resulting StringView can be passed to the process_source function as with the other methods. Our new run_repl function is as follows:

void
run_repl(void) {
    printf("BDL REPL (Press Ctrl-D or Ctrl-C to exit)\n");
    while (true) {
        printf(REPL_PROMPT);
        StringView sv = read_line();
        if (sv.start == NULL) {
            return;
        }
        process_source(&sv);
        if (sv.n != 0) {
            printf("\n");
        }
    }
}

Conclusion

And with that we finish the initial setup! You can find the final source code with the v0.1 tag. Next up we will talk a bit about the Scheme language and build our lexer that will extract the tokens from the source code.

Resources