badd10de.dev

BDL: Part 6 (Dynamic arrays and hash tables)


Introduction

If you are following along, we just finalized a simple garbage collector for our language. At this point we have a fully functional language, so I decided to have some fun and implement a (slow) version of the rule 110 cellular automata:

(def sequence (list 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1))

(fun print-line (seq)
    (if (or (nil? seq) (= 0 (car seq)))
        (print "  ")
        (print "##"))
    (if (not (nil? seq)) (print-line (cdr seq))))

(fun rule-110 (seq n)
    (fun new-generation (partial-seq)
        (def a (if (nil? partial-seq) 0 (car partial-seq)))
        (def b (if (or (nil? partial-seq)
                    (nil? (cdr partial-seq)))
                    0
                    (car (cdr partial-seq))))
        (def c (if (or (nil? partial-seq)
                    (nil? (cdr partial-seq))
                    (nil? (cdr (cdr partial-seq))))
                    0
                    (car (cdr (cdr partial-seq)))))
        (def number (+ (* a 2 2) (* b 2) c))
        (def cell (cond ((= number 0) 0)
                        ((= number 1) 1)
                        ((= number 2) 1)
                        ((= number 3) 1)
                        ((= number 4) 0)
                        ((= number 5) 1)
                        ((= number 6) 1)
                        ((= number 7) 0)))
        (if (nil? partial-seq)
            ()
            (cons cell (new-generation (cdr partial-seq)))))
    (print-line seq)
    (newline)
    (def seq (cons 0 (new-generation seq)))
    (if (= n 0)
        ()
        (rule-110 seq (- n 1))))

This function should print something like this:

                                 ##
                               ####
                             ######
                           ####  ##
                         ##########
                       ####      ##
                     ######    ####
                   ####  ##  ######
                 ##############  ##
               ####          ######
             ######        ####  ##
           ####  ##      ##########
         ##########    ####      ##
       ####      ##  ######    ####
     ######    ########  ##  ######
   ####  ##  ####    ##########  ##

During this implementation I realized that there were a couple of bugs in our code. First of all, we want to be able to pass the empty list nil as an argument to functions. This is a small change, but another issue appeared: When we have nested functions, we can’t shadow the parameter names or things don’t work properly. For example:

(fun rule-110 (seq n)
    (fun new-generation (seq)
...

In this case, if we use seq as the name for the new-generation parameter, recursively calling (const (new-generation (cdr seq))) will not properly resolve the binding, using the seq variable from the previous call. I decided to not worry about this bug for now, but we should be aware that this is an issue. Soon we will start working on a second interpreter going from a tree-walking interpreter to a bytecode one, so this issue is probably better handled at that stage.

In preparation to that part I decided to clean things up a little by adding header files for all C files, since we have a number of circular references and I don’t want to be forward declaring functions a million times. I don’t have anything to say about that, since it just consists on adding some declarations and comments, and moving some things around. Instead, today we are going to be working on creating a couple of reusable data structures in C that we can use to clean up the current interpreter and in preparation for the future.

A generic dynamic array

You are already familiar with these; growable arrays that can we use for storing stacks, lists of tokens, etc. We have a bunch of type specific implementations scattered around, but I would like to unify these to avoid duplicating the same functions all over. First lets talk usage. I want an ergonomic way of using this structure, as it is very common. The easier it is to use, the more I will use it when appropriate. Here is an example:

#include "darray.h"

typedef struct Person {
    size_t age;
    size_t height;
} Person;

int main(void) {
    // Initialization.
    Person *person = NULL;
    array_init(person, 1);

    // Adding elements.
    Person p1 = {40, 180};
    Person p2 = {10, 130};
    Person p3 = {82, 176};
    array_push(person, p1);
    array_push(person, p2);
    array_push(person, p3);

    // Accessors.
    printf("size: %ld\n", array_size(person));
    printf("cap: %ld\n", array_cap(person));
    printf("p1: %ld %ld\n", person[0].age, person[0].height);
    printf("p2: %ld %ld\n", person[1].age, person[1].height);
    printf("p3: %ld %ld\n", person[2].age, person[2].height);

    // Pop.
    Person p4 = array_pop(person);
    printf("p4: %ld %ld\n", p4.age, p4.height);
    for (size_t i = 0; i < array_size(person); i++) {
        printf("i: %ld age: %ld\n", i, person[i].age);
    }

    // Free memory.
    array_free(person);
    return 0;
}

As you can see, we have an struct of an arbitrary size. A pointer to that struct is initialized as NULL and then we pass it to the array_init function with the number of initial elements we want to store. We call then add any number of elements with array_push and the array will grow as needed. We can access individual array elements with the arr[i] notation. The array_free function is used to properly release the memory, note that we can’t use free for reasons we will discuss in the implementation details. Finally we have array_pop to use the structure as a stack and some accessor functions array_size and array_cap.

There are a couple of ways of achieving this for example we could have an array type that contains size/capacity the type size and an array of bytes like so:

typedef struct Array {
    size_t size;
    size_t cap;
    size_t type_size;
    char *bytes;
} Array;

But this would make it so that all our array types have to be described in that manner so instead of:

typedef struct Thing {
    Array foo;
    Array bar;
    Array baz;
} Thing;

I prefer to have something like this, since it gives you some more information at first glance:

typedef struct Thing {
    size_t *foo;
    bytes *bar;
    OtherThing *baz;
} Thing;

A small trick can be used to get a better API. We can embed the size and capacity at the beginning of an array, and then point the data immediately after that:

bytes
|0|1|2|3|4|5|6|7|8|9|a|b|c|d|e|f| | | | | | | | | | | | | |
 ^size           ^capacity       ^data-pointer

We need to be careful and make sure our data is properly memory aligned, and when freeing the data, we need to free the memory since the beginning (the size pointer), not from the data pointer. We start by defining the array header and some macros to access the fields of an array according to this method:

typedef struct ArrayHeader {
    size_t size;
    size_t cap;
} ArrayHeader;

#define array_head(ARR) ((ArrayHeader *)((char *)(ARR) - sizeof(ArrayHeader)))
#define array_size(ARR) ((ARR) ? array_head(ARR)->size : 0)
#define array_cap(ARR)  ((ARR) ? array_head(ARR)->cap : 0)

Pretty self explanatory. Lets add some more macros for the rest of the interface functions:

#define array_init(ARR,N)  ((ARR) = _array_reserve(N, sizeof(*(ARR))))
#define array_push(ARR, T) \
    ((ARR) = _array_maybe_grow(ARR, sizeof(T)), \
     (ARR)[array_head(ARR)->size++] = (T))
#define array_pop(ARR)   (ARR)[--array_head(ARR)->size]
#define array_free(ARR) ((ARR) ? free(array_head(ARR)), (ARR) = NULL : 0)

This can be a bit confusing to read but hopefully is clear enough. The key is avoiding passing sizeof(T) at every call site, since that can get quite tedious, so these macros help us having a cleaner interface. The only thing left is to add the initialization function and the _array_maybe_grow which will grow the array if needed, otherwise it will leave it alone.

static inline void *
_array_reserve(size_t num_elem, size_t type_size) {
    char *p = malloc(num_elem * type_size + sizeof(ArrayHeader));
    p += sizeof(ArrayHeader);
    array_head(p)->size = 0;
    array_head(p)->cap = num_elem;
    return p;
}

static inline void *
_array_maybe_grow(void *arr, size_t type_size) {
    ArrayHeader *head = array_head(arr);
    if (head->cap == head->size) {
        if (head->cap == 0) {
            head->cap++;
        } else {
            head->cap *= 2;
        }
        head = realloc(head, head->cap * type_size + sizeof(ArrayHeader));
    }
    arr = (char *)head + sizeof(ArrayHeader);
    return arr;
}

We are also going to add a way of inserting an arbitrary number of bytes to a dynamic array. This is helpful for strings, for example:

#define array_insert(ARR, SRC, N) \
    ((ARR) = _array_insert(ARR, SRC, N, sizeof(*(ARR))))

static inline
char * _array_insert(char *arr, const char *src, size_t n_bytes, size_t type_size) {
    ArrayHeader *head = array_head(arr);
    size_t new_size = n_bytes + head->size;
    if (new_size >= head->cap * type_size) {
        if (head->cap == 0) {
            head->cap = 1;
        }
        while (new_size >= head->cap * type_size) {
            head->cap *= 2;
        }
        head = realloc(head, head->cap * type_size + sizeof(ArrayHeader));
    }
    arr = (char *)head + sizeof(ArrayHeader);
    memcpy((arr + head->size), src, n_bytes);
    head->size = new_size;
    return arr;
}

And that is all! This shouldn’t look much more different that what you are used to seeing thus far. We can add this file to the project and do a small refactor to remove redundant things. For example:

if (env->buf == NULL) {
    env->size = 0;
    env->cap = ENV_BUF_CAP;
    env->buf = malloc(env->cap * sizeof(EnvEntry));
} else if (env->size == env->cap) {
    env->cap *= 2;
    env->buf = realloc(env->buf, env->cap * sizeof(EnvEntry));
}
env->buf[env->size++] = (EnvEntry){symbol, value};

becomes:

array_push(env->entries, entry);

Hash tables for objects

In some cases we have been using fairly inefficient data structures. For example, environments are better stored as hash tables than dynamic arrays. We care that lookup speed is fast, and with small arrays, this is fine, but as we have more and more variables or have to perform more lookups, it is important to go from O(n) to O(1). Hash tables are the perfect structure for this kind of associative relationships mapping symbols to values. In C we don’t have a readily available hash table implementation, but soon this will change, as we will make one ourselves.

Much ink has been spilled over different advances in hash tables for many a year. Here I’m not going to give a very detailed explanation of them, just some basic concepts and a simple implementation.

As previously mentioned, a hash table maps a key to a value in an efficient way. But what constitutes a key or a value can have different meanings. For example we could have an int to string, a string to int, a string to string, etc. For our purposes we could make an Object -> Object or Object* -> Object *. However we should be able to build a generic hash table that takes any type as either key or value, symbolically K -> V, provided a type specific hash function is given.

The first decision we need to make is if we are going to store things by value or by reference. That is, do we keep an internal copy of key-value pairs or pointers to the original values? For simplicity we will use the later approach, since it’s mostly how we have been approaching things in this project. Be aware that this could have some performance benefits, but also dangerous drawbacks if the pointers ever get invalidated or freed.

A hash table could be thought of as a number of buckets (or slots) to store our key-value pairs and a hash function that would tell us which bucket to use. For example, let’s say that our keys and values are both integers between 1 and 20 and our table contains 10 buckets. If we use the modulo operation as the hash function we would have these assigned slots for the following sequence of insertions:

OPERATION          HASH       SLOT
insert({9, V})  -> 9 % 10  -> slot: 9
insert({13, V}) -> 13 % 10 -> slot: 3
insert({16, V}) -> 16 % 10 -> slot: 6
insert({2, V})  -> 2 % 10  -> slot: 2

I’m betting you can already spot one issue with this, what happens if we try to map 9 and 19? With that hashing function they will both go to the slot number 9! We could use a different hashing function of course, but inevitably we will run into this issue as long as the number of buckets is smaller than the possible values we can have. Since allocating 2^64 slots is far from ideal, what we actually need is some way of dealing with hash collisions.

The two main strategies for handling collisions are “chaining” and “open addressing”. Briefly, chaining deals with collisions by storing a linked list of elements in each bucket. With open addressing, if a bucket is full, we try a different slot using some heuristic and insert the item there instead. You can read more about these strategies on the Wikipedia page for hash tables.

For now we will implement a simple open addressing model with linear probing that later can be used to add some optimizations, such as Robin-Hood hashing, that seems to be one of the contenders for faster performing general purpose hash tables.

One more issue we have to consider is that even with linear probing, the backing array will be full eventually. We will handle this by growing the table if the load factor (size / capacity) is above a certain threshold.

API

For now let’s imagine we have the following usage case:

int main(void) {
    // Initialize GC.
    gc_init();

    // Initialize key-value objects.
    Object *k1 = MAKE_SYM("Alice");
    Object *k2 = MAKE_SYM("Bob");
    Object *k3 = MAKE_SYM("Dog");
    Object *v1 = make_fixnum(10);
    Object *v2 = make_fixnum(49);
    Object *v3 = make_fixnum(333);

    // Initialize hash table.
    HashTable *table = ht_init();

    // Add some key-value pairs.
    ht_insert(table, k1, v1);
    ht_insert(table, k2, v2);
    ht_insert(table, k3, v3);

    // Test lookups.
    Object *alice_val = ht_lookup(table, k1);
    Object *bob_val = ht_lookup(table, MAKE_SYM("Bob"));
    Object *dog_val = ht_lookup(table, k3);

    // Verify things work as expected.
    if (v1 == alice_val) {
        printf("Alice match!\n");
    }
    if (v2 == bob_val) {
        printf("Bob match!\n");
    }
    if (v3 == dog_val) {
        printf("Dog match!\n");
    }

    ht_free(table);
    return 0;
}

We can perform object insertions in our table and lookups. Later we will talk about deletion as well. Remember that our table doesn’t store a copy of the key-value objects, only a reference value, so if an object gets freed/garbage collected, we may have a missing reference issue. With the current default GC parameters this should still be fine, however. Note how we are testing the lookups against the same object but also when using a key that semantically has the same value, but functionally is a different object in the heap: Object *bob_val = ht_lookup(MAKE_SYM("Bob"));.

Implementation

We can start with defining the structure of the table and key value pairs:

typedef struct HashTablePair {
    Object *key;
    Object *value;
} HashTablePair;

typedef struct HashTable {
    HashTablePair *pairs;
} HashTable;

We will add later a couple of things to our HashTable struct, so if it looks barren or unnecessary right now, that is why. We initialize the table as follows. We want the pairs to have NULL values as initialization.

HashTable *
ht_init(void) {
    HashTable *table = malloc(sizeof(HashTable));
    table->pairs = NULL;
    array_init(table->pairs, HT_MIN_CAP);
    for (size_t i = 0; i < array_cap(table->pairs); i++) {
        table->pairs[i] = (HashTablePair){NULL, NULL};
    }
    return table;
}

We create a lookup function that obtains a position from a hashing function ht_hash. We assume that the hashing function will always will give us a value that will fit within the pairs array, never out of bounds. We iteratively check if the key actually match with the object stored in the given position, if the hash doesn’t correspond to an initialized key or if we traverse the entire array without finding a proper match we return NULL. Otherwise the stored reference for the value will be returned.

Object *
ht_lookup(const HashTable *table, const Object *key) {
    size_t position = ht_hash(table, key);
    size_t probe_position = position;

    // Verify the key in that position is the same. If not perform linear
    // probing to find it.
    HashTablePair *pairs = table->pairs;
    while (true) {
        if (pairs[probe_position].key == NULL) {
            return NULL;
        }
        if (obj_eq(pairs[probe_position].key, key)) {
            break;
        }
        if (probe_position == array_cap(pairs) - 1) {
            probe_position = 0;
        } else {
            probe_position++;
        }
        if (probe_position == position) {
            return NULL;
        }
    }
    return pairs[probe_position].value;
}

The insertion is divided in two parts: check if we must grow the hash table and properly insert the given pair. The insertion follows the same pattern as the lookup, we just need to make sure that we increase the table size if we are inserting a new element and not updating an existing one.

void
_ht_insert(HashTable *table, const Object *key, const Object *value) {
    size_t position = ht_hash(table, key);
    size_t probe_position = position;

    // Verify the key in that position is free. If not, use linear probing to
    // find the next free slot.
    HashTablePair *pairs = table->pairs;
    while (true) {
        if (pairs[probe_position].key == NULL) {
            array_head(pairs)->size++;
            break;
        }
        if (obj_eq(pairs[probe_position].key, key)) {
            break;
        }
        if (probe_position == array_cap(pairs) - 1) {
            probe_position = 0;
        } else {
            probe_position++;
        }
    }
    pairs[probe_position].key = (Object *)key;
    pairs[probe_position].value = (Object *)value;
}

void
ht_insert(HashTable *table, const Object *key, const Object *value) {
    _ht_maybe_grow(table);
    _ht_insert(table, key, value);
    return;
}

Growing the table will happen when the load factor surpasses the selected threshold. A new array is allocated with double the capacity of the previous one. It is subsequently cleared and the key-value pairs in the previous array gets rehashed for the updated table. Finally the previous array gets freed.

void
_ht_maybe_grow(HashTable *table) {
    HashTablePair *pairs = table->pairs;
    if (ht_load_factor(table) < HT_LOAD_THRESHOLD) {
        return;
    }

    // Create a new array with 2x capacity.
    table->pairs = NULL;
    array_init(table->pairs, array_cap(pairs) * 2);
    for (size_t i = 0; i < array_cap(table->pairs); i++) {
        table->pairs[i] = (HashTablePair){NULL, NULL};
    }

    // Hash everything in the table for the new array capacity.
    for (size_t i = 0; i < array_cap(pairs); i++) {
        if (pairs[i].key != NULL) {
            _ht_insert(table, pairs[i].key, pairs[i].value);
        }
    }

    // Free the old array.
    array_free(pairs);
}

The last thing we need to provide is a way of freeing the table and all memory within.

void
ht_free(HashTable *table) {
    if (table == NULL) {
        return;
    }
    if (table->pairs == NULL) {
        return;
    }
    array_free(table->pairs);
    free(table);
}

But what about the hash function? Here is an example of the simplest hash function we can use:

uint64_t
ht_hash(const HashTable *table, const Object *key) {
    return 0;
}

Yeah this is pretty dumb, since every single thing we lookup or insert will have a collision (except the first element). However with this function we can test that the testing code works properly. If we initialized a table with a capacity of 2 we can test also if growing the table works properly after inserting 3 elements. But we obviously can do better!

A lot of people invested a lot of time trying to find fast, well behaved and uniform hash functions. For a hash table there are two things we need:

  1. A hash function that outputs a large random-looking number (possibly u64).
  2. A way of mapping a large number into a smaller value (constrained to the capacity of the hash table).

For the first point, lot of options are available, but personally I like to use a simple xor + circular shift hash function for large values and the identity function for data that fits in a 64 bits. More on that later.

Mapping a large number into a small value is typically made with the modulo operator, which can be far from ideal since it can be slow. Alternatively, if the table is a power of two, we can just chop out the upper bits of the hash number with a logical and (hash & (cap - 1)), which is the equivalent of the modulo operation. I learned of a different method in an article by Malte Skarupke, that emphasizes that the Fibonacci hashing proposed by Donald Knuth in The Art of Computer Programming can be used for this purpose. I like this approach, since the multiplicative Fibonacci hashing is not very good by itself, but it can be used as a finalizer for a simpler hash function (like the previously mentioned xor-shift function), resulting in similar insertion and lookup performance.

I’ve tested this technique for different types of data in other hash table implementations and performance tends to be excellent, which is why I’m also using it here. Someday I may post a number of benchmarks comparing this method to alternative solutions in C and C++, but that will have to wait. This places the restriction that our hash table has to grow by powers of two, so be aware of that when initializing the table. Here is the Fibonacci hash:

static inline uint64_t
_fibonacci_hash(uint64_t hash, size_t shift_amount) {
    return (hash * UINT64_C(11400714819323198485)) >> (64 - shift_amount);
}

The shift_amount correspond with the number of bits used to represent the capacity for example, if the capacity is 1024 bytes long:

cap := 1024 = 2 ^ 10, shift_amount := 10

We don’t want to be calculating this for every hash, so we store the result in the hash table struct:

typedef struct HashTable {
    HashTablePair *pairs;
    uint8_t shift_amount;
} HashTable;

Since we always grow the table by a power of 2, we just need to increment one the shift amount every time we grow the table:

    ...
    // Create a new array with 2x capacity.
    table->pairs = NULL;
    array_init(table->pairs, array_cap(pairs) * 2);
    for (size_t i = 0; i < array_cap(table->pairs); i++) {
        table->pairs[i] = (HashTablePair){NULL, NULL};
    }
    table->shift_amount++;
    ...

The only thing left to do is to write the hash function for Objects. As mentioned, we use a xor_shift_hash for strings and symbols and the identity for numbers. For other types we use the pointer address instead. The constant in the string hash is a bunch of random numbers I got from a random number generator, but it is not terribly important, you could as well use 0xbadd10debadd10de :), though the random numbers should work better I guess. Remember we are not trying to get the best most uniform hash function, just a simple function that compiles to the least number of instructions while doing its job of scrambling the bits around.

static inline uint64_t
_xor_shift_hash(const char *key, size_t n) {
    uint64_t hash = 0x65d9d65f6a19574f;
    char *last = (char *)key + n;
    while (key != last) {
        hash ^= (uint64_t)*key++;
        hash = (hash << 8) | (hash >> (64 - 8));
    }
    return hash;
}

The hash table for Objects is as follows:

uint64_t
ht_hash(const HashTable *table, const Object *key) {
    uint64_t hash;
    switch (key->type) {
        case OBJ_TYPE_FIXNUM: {
            hash = key->fixnum;
        } break;
        case OBJ_TYPE_STRING: {
            hash = _xor_shift_hash(key->string, array_size(key->string));
        } break;
        case OBJ_TYPE_SYMBOL: {
            hash = _xor_shift_hash(key->symbol, array_size(key->symbol));
        } break;
        case OBJ_TYPE_BOOL:
        case OBJ_TYPE_NIL:
        case OBJ_TYPE_PAIR:
        case OBJ_TYPE_LAMBDA:
        case OBJ_TYPE_PROCEDURE:
        case OBJ_TYPE_ERR: {
            hash = (uintptr_t)key;
        } break;
    }
    hash = _fibonacci_hash(hash, table->shift_amount);
    return hash;
}

And with that we have a fully functioning hash table, although we are currently not supporting deletion. Most of this hash table could be made generic by using void pointers and storing an equality function between types and the main hash function:

typedef uint64_t (HashFunction)(const struct HashTable *table, const void *data);
typedef bool (EqFunction)(const void *a, const void *b);

typedef struct HashTablePair {
    void *key;
    void *value;
} HashTablePair;

typedef struct HashTable {
    HashTablePair *pairs;
    HashFunction hash_function;
    EqFunction eq_function;
    uint8_t shift_amount;
} HashTable;

We will not yet go that far, but it is a possibility in the future. We can now substitute our environments with these hash tables, which I did for the current version. However! Using hash tables for environments proved to actually be slower than the arrays of the previous implementation, probably because of cache locality, but we actually may speed things up with other tricks. I’ll leave it as is for now as a pedagogical example, an O(1) algorithm is actually slower than O(n)!.

Conclusion

In this entry we have focused on a couple of basic data structure implemented in C rather than language features. Having these in place, however, will make the development moving forward much more flexible. In the next entry we will start the skeleton for a bytecode VM interpreter. As usual, you can download the code at this point in time if you go to tag v0.7. See you around!