badd10de.dev

BDL: Part 7 (Designing a language)


Introduction

It’s been a while since the last post and in the meantime I’ve been testing a lot of things I was interested in, including the implementation of a bytecode VM and x86_64 code generation via NASM. The former was aided by the Crafting Interpreters book and the later was a nice experiment in stack-based native compilation, learning some more details of x86 assembly as I moved around. Both of these experiments are at different levels of completion, and the native codegen got to a point of implementing tail recursive function calls. They have been valuable learning experiences, but I discovered some shortcomings in my approach and a lack of design and scope in the language. I also started reading more compiler books, I found Engineering: A compiler very interesting, even if a bit dense at times. As I read this book, I’m confronted with new questions about type systems, intermediate representations and potential for optimization.

I did some soul searching about what I want this language to be, for this reason I believe is time to thing about this project in a more structured way. I’ll be using this post as a sketchboard of ideas and wishes for the language based on my actual preferences instead of just as a learning exercise (which is still is!). If you have been following along, you can check the most recent commit at the time of this writing or the one before the intermediate representation refactoring. Bear in mind that a lot has changed since the last entry, so feel free to check the bdl repo history at your own leisure.

A departure from expectations

The first thing I want to consider is that this is currently a toy language and I’m not sure I want or need it to have any sort of adoption. I have to design the language for me, because I want to actually use it for my projects and have fun with it!

Syntax

Possibly the least important aspect to consider at the start, in my opinion. It’s not that is not useful to design a comfortable syntax, after all, it is the UX of your language. The issue is that people get caught up in minute details of syntax instead of focusing on other important language design considerations. For now I don’t mind to have a parenthesised language. Lisps are easy to parse and understand. Yes the closing parentheses can be annoying sometimes, but that’s why we use text editors to help us navigate them. Maybe this will change in the future, but for now we will be keeping the Lisp nature of the language with a few extensions discussed below. Some syntax examples will be shown below, but this doesn’t express a final decision on the finished language.

Symbolic computations

The first departure from a traditional Lisp is the removal of symbolic computations. Lisp was created with this purpose in mind, and I can’t argue the power of quoting, code as data and the homoiconicity of the language. For my use case, at least for now, this property seems a bit wasted. I’m not sure if I want a language that is easily extensible, as this power creates dissonance when reading unfamiliar (or forgotten) code snippets. Furthermore, this adds some extra complexity to the language that I’m not ready to address. This is not to say that it can’t be implemented, but right now I choose to forgo thinking about quoting and eval in favor of a more structured approach to the language.

The type system

Lisps are typically dynamically typed languages. There are some dialects that implement type hints or optional typing, but not being statically typed by default have some implications, specially when thinking about code generation. Additionally, type checking may be necessary at runtime, depending on compiler optimizations, which also can hinder performance.

Personally I like the ergonomics of dynamic typing, but I certainly prefer static typing, specially when refactoring a large codebase. I want this language to be statically typed but with a strong inference system. Potentially a Hindley–Milner type system could be set in place, but at first I think is best to use a simple type system without parametric polymorphism.

Colon syntax

The type of a variable or primitive can be specified with colon syntax. The colon character : starts a type definition. The next non-whitespace character starts the type name, which continues until a non whitespace character is found, meaning :u64 and : u64 are all equivalent.

For example, for defining a u16 variable foo containing the value 10:

(def foo:u16 10)

Once the variable has been defined, updating the value doesn’t require the type annotation. Additionally, omitting the type information is legal as long as the type can be inferred:

(def foo 7)

For functions and lambdas the arguments and return value types can be specified:

(fun foo (a: u16 b: u32): u32
    ...
    )

(lambda(a: u16 b: u32):u32 ...)

Base types

The language should support differently sized signed and unsigned integers (s8, s16, s32, s64, u8, u16, u32, u64). Similarly, IEEE floating point values should also be included (f32, f64). Other numeric types could also be supported in the future, such as big numbers, complex or rationals. The full numeric tower as described by Scheme should be considered, and it may be implemented at a later date, but it has low priority for the initial implementation stages. It may be desirable to include some shorthand for signed or unsigned integers that fit best the architecture (akin to size_t or ssize_t in C). When parsing numeric types by default base 10 is used, but we should support hexadecimal 0xaabb and binary 0b11001101 notation. Potentially scientific notation as well (1e10) but with lower priority.

The language should support UTF-8 encoded characters, so we may need some sort of rune type. The initial implementation will just use the ASCII range for characters using u8 as the base type.

Booleans are to be included, either as a single type that can take only two values or as two distinct singleton types true and false.

The null type nil may also be included to express the lack of value for different operations.

Arrays

Arrays are extremely important, specially in modern CPUs, where cache locality have a massive impact on performance. It makes sense to support an array type that is ergonomic to use. We could support multi-dimensional arrays, but we should start the implementation from 1D arrays and go from there. We can differentiate between static arrays (arr) and dynamic arrays (darr). Static arrays contain exclusively the data, whereas dynamic arrays also require a current size and capacity and may grow if pushing elements when full.

struct DArray {
    size_t size;
    size_t capacity;
    u8 *data;
};

struct Array {
    u8 data[array_size];
};

We should be able to easily access array elements using bracket syntax with an index both for reading (my-array[10]) or updating values ((set! my-array[10] value)). Arrays can be created with a primitive procedure that includes the array length: (make-array u64 10)

Strings

The string type should be immutable and store their length along with the string data. We may want to consider separating strings into long strings and short strings. Short strings could be stored in a single 64bit register, whereas long strings require a pointer indirection. Strings should store their length with the data to avoid having to traverse the string to find it. In broad strokes a string implementation in C could be defined as:

struct String {
    size_t length;
    u8 *data;
};

struct ShortString {
    // This is overkill, since for ASCII the maximum length is 8. Bear in mind
    // this is only an example, a more sophisticated implementation of short
    // strings can be used.
    u8 length;
    u8 data[7];
};

Structures

We should allow the user to define new types derived from base types or previously defined user types. When defining a struct, all fields should have a type annotation. For example to create the user type my-type with a u16 filed named a and a f32 field named b:

(struct my-type
    a: u16
    b: f32)

To zero initialize a struct type and assign it to variable x:

(def x:my-type {})

If you want to manually initialize one field while zero initializing the rest:

(def x:my-type {.b 0.8})

To initialize all fields:

(def x:my-type {.a 10 .b 30.8})

We may want to add a way of specifying default values for fields instead of zero initializing them, for example:

(struct my-type
    a: u16 ! 42
    b: f32 ! 0.1)

; or

(struct my-type
    a 42: u16
    b 0.1: f32)

I haven’t settled yet on the best approach for this, so it will have lower priority on the initial implementation.

We also should be able to print structs by default and accessing fields uses dot notation:

(print x)
; -> {.a 10 .b 30.8}

(print x.a)
; -> 10

We may want to allow unions as well as structs but this may change depending on how we evolve the type system.

Pointers

I would like to include pointers for direct memory manipulation, but would like to have a mechanism for avoiding out of bounds access. I’m not sure yet what is the best approach for this or if this is even realistic. For the time being, I think we should be able to create a pointer by taking the address of an existing variable using the @ operator.

(def foo:u32 42)
(def bar @foo) ; bar is now a pointer to u32

To specify a pointer with colon syntax we use the @ symbol before the type. In the previous example, bar is of type :@u32. Without inference:

(def bar:@u32 @foo)

Setting a value:

(print foo) ; -> 42
(set! bar 10)
(print foo) ; -> 10

If we want to set the value at index * sizeof(type):

(def a (make-array u16 5))
(def baz @a) ; Note that it is pointing to the data section, not the array object.
(set! baz[2] 10)
(print a) ; -> [0 0 10 0 0]

The way of operating with pointers is subject to change, since I’m still figuring things out.

Execution models (VM vs native and the “computer in a box”)

Ideally the compiler should support compilation to bytecode (to be executed by a virtual machine) and generation of native code for x86_64 and arm/aarch64. VMs are flexible, can be easily ported and are easier to debug than native code, but they incur a performance overhead and may make more difficult the distribution of standalone executables. I think that having VM that includes a graphics/input/io system is very valuable for quick sketchs and/or utilities. Something akin to uxn and the vavara VM is very interesting to me.

The choice of intermediate representations may be informed by the type of execution model. In the end I would like to have an optimizing compiler, and it seems the use of three-address code (TAC) is widely used for this purpose. On the other hand I’m more familiar with stack-based VMs and code generation and have already written some prototypes in this form for bdl. Should I implement a stack VM or register based? Seems like going from TAC IR to stack bytecode is simpler than the contrary, but TAC maps better for native compilation. Hopefully I’ll make up my mind soon as to what to do for this topic. I need to finish some more books and find more resources with pros and cons for both approaches.

To close or not to close

Implementing closures have been a pain, but also satisfying. I’m still debating if their inclusion is justified for my usual programming workflow. I floated this question on the fedi and people smarter than me suggested that they should be free to implement if we are performing escape analysis, which we may have to do anyway when using pointers. Specifically, the question, is if we should solve the “upwards-funarg problem”. Ultimately, I may choose not to implement closures, but once the rest of the infrastructure is in place I could reconsider.

Memory models

Seems like all memory models have their disadvantages. I’m still debating if I should opt for memory management, tracing garbage collection or automatic reference counting. Seems like all of them have their disadvantages, but I also don’t want to be worrying about memory in the same way as I do with C. Maybe a hybrid model would be best, in which GC or ARC is enabled by default but we could chose to disable it in favor of manual memory management.

Conclusion

This entry is more of a casual design document than an actual tutorial of any kind, but I hope I can use it as a way of clarifying my thoughts on this language and what I want or am prepared to tackle. If you have some comments on this article, please hit me up on the fedi, I would love to hear your opinion. Until next time!