badd10de.dev

Game Boy Advance Programming


Architecture

The GBA contains a 16 MHz ARM CPU and has available 96kB of video memory. There is no support for floating point or hardware division operations. The GBA is a little endian machine and expects data alignment to 32 bit boundaries. The CPU uses a 3 stage pipeline (Up to 3 instructions are being decoded in parallel).

Memory sections

area        start       end         length  port-size  description
System ROM  0000:0000h  0000:3FFFh  16kb    32 bit     BIOS memory. You can execute it, but not read it (i.o.w, touch, don't look).
EWRAM       0200:0000h  0203:FFFFh  256kb   16 bit     External work RAM. Is available for your code and data. If you're using a multiboot cable, this is where the downloaded code goes and execution starts (normally execution starts at ROM). Due to the 16bit port, you want this section's code to be THUMB code.
IWRAM       0300:0000h  0300:7FFFh  32kb    32 bit     This is also available for code and data. The 32-bit bus and the fact that it's embedded in the CPU make this the fastest memory section. The 32bit bus means that ARM instructions can be loaded at once, so put your ARM code here.
IO RAM      0400:0000h  0400:03FFh  1kb     16 bit     Memory-mapped IO registers. These have nothing to do with the CPU registers you use in assembly so the name can be a bit confusing. Don't blame me for that. This section is where you control graphics, sound, buttons and other features.
PAL RAM     0500:0000h  0500:03FFh  1kb     16 bit     Memory for two palettes containing 256 entries of 15-bit colors each. The first is for backgrounds, the second for sprites.
VRAM        0600:0000h  0601:7FFFh  96kb    16 bit     Video RAM. This is where the data used for backgrounds and sprites are stored. The interpretation of this data depends on a number of things, including video mode and background and sprite settings.
OAM         0700:0000h  0700:03FFh  1kb     32 bit     Object Attribute Memory. This is where you control the sprites.
PAK ROM     0800:0000h  var         var     16 bit     Game Pak ROM. This is where the game is located and execution starts, except when you're running from a multiboot cable. The size is variable, but the limit is 32 MB. It's a 16bit bus, so THUMB code is preferable over ARM code here.
Cart RAM    0E00:0000h  var         var     8 bit      This is where saved data is stored. Cart RAM can be in the form of SRAM, Flash ROM or EEPROM. Programmatically they all do the same thing: store data. The total size is variable, but 64kb is a good indication.

Source: https://www.coranac.com/tonc/text/hardware.htm#sec-memory

When saving to the Cart RAM (SRAM), we probably want to avoid using the first and last bytes, since they can become corrupted when changing cartridge or powering up.

Puting code and data in different locations can have a dramatic performance impact or crash the program entirely. Also:

Q: How do I put large (> 16 KB) arrays into a GBA program without it crashing?

The linker script included with devkitARM puts arrays and other variables into
IWRAM unless you tell it otherwise. Trouble is IWRAM is only 32 KiB. For arrays
that you don't plan to modify, use the keyword const, which will instruct the
linker to put the entire array into ROM (or EWRAM for .mb programs). For arrays
that you do plan to modify, put them into EWRAM using a section attribute on the
array's definition:

    __attribute__((section (".sbss"))) char foo[8192];
    Puts the variable in EWRAM and initializes it to zero at program start.
    (Initializer values are ignored.)

    __attribute__((section (".ewram"))) char foo[8192] = {3, 4, /*... */ };
    Puts the variable in EWRAM and initializes it to the given values at program
    start. (This uses space in the binary even if initializer values are not
    given.)

Source: https://forum.gbadev.org/viewtopic.php?t=418

Normally we want to put hot code and data into the fastest memory possible (IWRAM in most cases). We can do this with compiler macros, but we should always check that we have enough space available and that the code and data is actually going to the section we wanted. This is a quick one-liner to check where functions and data are located with objdump:

arm-none-eabi-objdump -x [file.elf] | sort | less

For 16bit data, the VRAM should be as fast as the IWRAM, so if needed we can store 16bit LUTs to reap some performance benefits.

Video

The screen is 240x160 pixels for a total of 32768 colors. Colors are 15 bits xbbbbbgggggrrrrr, with 5 bits for each RGB component (32 shades of red, green an blue or a range of 0–31). The 16th bit (x) is the transparency index, when working with palettes it will mean that the pixel is transparent. The screen refresh rate is of 59.73 Hz. We can use 4 backgroud layers and one sprite layer. We have available 96kb of video memory (0x06000000–0x06017FFF), palette memory (0x05000000–0x050003FF) and OAM memory (0x07000000–0x070003FF).

An scanline is composed of the HDraw, where a row of 240 pixels are written to the screen followed by a pause (HBlank). After drawing 160 scanlines (VDraw), there is another pause (VBlank). To avoid glitchy effects, we would want to update the position of sprites during the VBlank pause. The following table shows the timings for the different screen drawing phases:

subject   length        cycles
pixel     1             4
HDraw     240px         960
HBlank    68px          272
scanline  Hdraw+Hbl     1232
VDraw     160*scanline  197120
VBlank    68*scanline   83776
total     VDraw+Vbl     280896

Source: https://www.coranac.com/tonc/text/video.htm

Two palettes are available, one for backgrounds (0x05000000–0x050001FF) and one for sprites (0x05000200–0x050003FF). Each palette has 256 entries of u16 colors for a total of 512 bytes per palette. There are two different ways of using palettes, the first one is to use the entire palette to for a 256 color range, and the second one is to thave 16 sub-palettes (palette banks) of 16 colors each.

We can draw to the screen via bitmaps, tiled backgrounds or sprites. If using bitmap (Like in Mode-3) we can draw pixels to the screen by writing to ((vu16*) MEM_VRAM), but trying to fill the screen with this method is too slow for practical purposes. Normally it is better to use tiled backgrounds, where 8x8 tiles are copied to one part of the video memory and then a tile map containing the indexes of the tiles for the screen is sent to a different region. In tile background modes, we send a maximum of a 30x20 map of numbers each frame. Sprites are 8x8 to 64x64 objects that are meant to be used in conjunction with tiled backgrounds or bitmaps, and they can be transformed independently of each other.

Note that the GBA’s screen may require gamma correction to display their colors correctly, since the LCD tends to be quite dark. A simple formula for this is the “affine brightness correction”, which can be applied on each R, G, and B component with:

component = component * 3 / 4 + 8

Display Control Register

The display control register (DISP_CTRL) located at 0x04000000 controls the properties of the screen and selects different video modes.

bits  name          description
0-2   Mode          Sets video mode. 0, 1, 2 are tiled modes; 3, 4, 5 are bitmap modes.
3     GB            Is set if cartridge is a GBC game. Read-only.
4     PS            Page select. Modes 4 and 5 can use page flipping for smoother animation. This bit selects the displayed page (and allowing the other one to be drawn on without artifacts).
5     HB            Allows access to OAM in an HBlank. OAM is normally locked in VDraw. Will reduce the amount of sprite pixels rendered per line.
6     OM            Object mapping mode. Tile memory can be seen as a 32x32 matrix of tiles. When sprites are composed of multiple tiles high, this bit tells whether the next row of tiles lies beneath the previous, in correspondence with the matrix structure (2D mapping, OM=0), or right next to it, so that memory is arranged as an array of sprites (1D mapping OM=1). More on this in the sprite chapter.
7     FB            Force a screen blank.
8-B   BG0-BG3, Obj  Enables rendering of the corresponding background and sprites.
D-F   W0-OW         Enables the use of windows 0, 1 and Object window, respectively. Windows can be used to mask out certain areas (like the lamp did in Zelda:LTTP).

Source: https://www.coranac.com/tonc/text/video.htm

Display Status Register

The display status register (DISP_STATUS) located at 0x04000004 can be used to obtain information about the current state of the screen or requesting certain interrupts.

bits  name  description
0     VbS   VBlank status, read only. Will be set inside VBlank, clear in VDraw.
1     HbS   HBlank status, read only. Will be set inside HBlank.
2     VcS   VCount trigger status. Set if the current scanline matches the scanline trigger ( DISP_VCOUNT == DISP_STATUS{8-F} )
3     VbI   VBlank interrupt request. If set, an interrupt will be fired at VBlank.
4     HbI   HBlank interrupt request.
5     VcI   VCount interrupt request. Fires interrupt if current scanline matches trigger value.
8-F   VcT   VCount trigger value. If the current scanline is at this value, bit 2 is set and an interrupt is fired if requested.

Source: https://www.coranac.com/tonc/text/video.htm

Scanline Counter Register

The scanline counter register (DISP_VCOUNT) located at 0x04000006 stores in the lower bits, the scanline that is being currently processed. Note that the count goes from 0-227, since it goes through the VBlank as well (0-159 scanlines draw to the screen wereas 160-226).

This can be used to implement a locked framerate by waiting for the VBlank before we perform updates:

static inline void
wait_vsync() {
    while(DISP_VCOUNT >= 160);
    while(DISP_VCOUNT < 160);
}

It is discouraged to use this method, since it is wasting CPU cycles and thus battery life.

Bitmap modes (mode 3, 4 and 5)

If we one of the bitmap and DISPLAY_BG_2 we can blit to the screen direcly by putting data on the MEM_VRAM. Here is an example of using a macro for easy acess to the x and y coordinates of the screen on mode 3:

// The GBA in mode 3 expects rbg15 colors in the VRAM, where each component
// (RGB) have a 0--31 range. For example, pure red would be rgb15(31, 0, 0).
typedef u16 Color;

// We can treat the screen as a HxW matrix. With the following macro we can
// write a pixel to the screen at the (x, y) position using:
//
//     FRAMEBUFFER[y][x] = color;
//
typedef Color Scanline[SCREEN_WIDTH];
#define FRAMEBUFFER ((Scanline*)MEM_VRAM)

The difference between the bitmap modes is that they have different width, height, bits per pixel (bbp), and the possibility of page-flipping. Modes 3 and 4 have a screen resolution of 240x160, with respective bbp of 16 and 8. Mode 4 supports page-flipping, whereas mode 3 doesn’t. Mode 5 have a 160x128 resolution at 16bpp and with page-flipping support. Essentially the size of the bitmap (in bytes) is width * height * bpp/8.

In Mode4 the buffer is of 8 bytes per pixel instead of 16. We can’t write the color directly, instead the color is stored in the palette memory at MEM_PAL. Note that in this mode MEM_PAL[0] is the background color. Because the GBA needs to meet memory alignment requirements, we can’t write a u8 into memory, instead we need to read a u16 word, mask and or the corresponding bits and wave the updated u16.

static inline void
put_pixel_m4(int x, int y, u8 col_index) {
    int buffer_index = (y * SCREEN_WIDTH + x) / 2;
    u16 *destination = &SCREEN_BUFFER[buffer_index];
    int odd = x & 0x1;
    if(odd) {
        *destination= (*destination & 0xFF) | (col_index << 8);
    } else {
        *destination= (*destination & ~0xFF) |  col_index;
    }
}

Other limitations of bitmap modes, is that they can only use one background layer, they have no hardware scrolling and the bitmap memory overlaps with sprite tiles memory (Starting at 0x06010000) which means only sprite tiles from 512 to 1023 are available in modes 3-5.

When talking about page-flipping, the concept is similar to that of double buffering. However, in double buffering, the buffer is copied to the screen, whereas in page-flipping, the “backbuffer” is made the new buffer instead, without copying anything.

The second page in Mode 4 for page flipping is located at 0x0600A000. To flip the page, we need to set the proper bit (4) with DISP_CTRL.

Tiled modes

If we want good performance for displaying graphics to the screen, it makes a lot of sense to use the built-in hardware capabilities of the GBA. Sprites and tiled backgrounds make use of the hardware and can be configured in different ways depending on the use case.

Tiled backgrounds have sizes between 128x128 and 1024x1024 pixels (32x32 and 128x128 tiles respectively). Sprites go from 8x8 to 64x64 pixels, and we can make use of 128 of them.

The first step for using these modes is setting up the right parameters in the display register DISP_CTRL to configure the mode, sprite and background behaviour, etc. There are other registers available for further configure each individual background layer (0x04000008 to 0x0400000f). Each of the 128 sprites has three attributes for controlling and mapping them, starting at 0x07000000.

With background tiles and sprites, we need to distinguish between the “tile data” (the location in memory where each tile’s color information is stored) and the “tile map”, sometimes called “screenblock entries” (indicating which tiles are used in the screen at the moment) with positional information and other attributes (bit depth, horizontal or vertical flipping, palette being used, etc). Sprites and tiled backgrounds can make use of the affine transformation matrix for scaling and rotation.

Tiles

Sprites and tiled backgrounds are composed of tiles. A basic tile is an 8x8 bitmap either at the default 4 bits per pixel (bpp) (16 color / 16 palettes / 32 bytes) or 8 bpp (256 colors / 1 palette / 64 bytes).

To store tiles into the VRAM, each tile is stored as a continuous bit stream. As an example, let’s imagine that each tile is represented by a number. For brevity, each tile will be 2x2.

Tiles:

| 0 | 0 | 1 | 1 | 2 | 2 | 3 | 0 |
| 0 | 0 | 1 | 1 | 2 | 2 | 3 | 3 |
| 4 | 4 | 5 | 5 | 6 | 6 | 7 | 7 |
| 4 | 4 | 5 | 5 | 6 | 6 | 7 | 7 |

VRAM:

| 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 4 | 5 | 5 | 5 | 5 | 6 | 6 | 6 | 6 | 7 | 7 | 7 | 7 |

The tiles are stored in charblocks/tile-blocks. Each charblock is 16kb, thus we can store 6 charblocks in the VRAM (96kb). In one charblock we can store 512 tiles at 4 bpp (32x16 tiles) or 256 at 8 bpp (32x8 tiles). The 6 charblocks in the VRAM are divided in 4 for their respective backgrounds (0-3) and 2 for sprites (4-5).

In terms of palettes, there are two, one for backgrounds (0x05000000) and another for the sprites (0x05000200). Each, contain 256 entries of 15 bit colors.

Sprites

The 3 basic steps for using sprites are:

  1. Loading the graphics/palette to the VRAM
  2. Configure OAM attributes
  3. Enable objects in DISPLAY_CTRL and set the maping mode.

Sprite tiles may be stored in 2D mode where big sprites are stored directly as a matrix or 1D mode, in which tiles from the sprite are sequential. Programming in 1D mode may be nicer to work with. For example if we have tiles with sprites 1 and 2:

Sprites 1 and 2:

|1|1|1|
|1|1|1|
|1|1|1|

|2|2|
|2|2|

2D mode:

|1|1|1|2|2|0|0|...
|1|1|1|2|2|0|0|...
|1|1|1|0|0|0|0|...

1D mode:

|1|1|1|1|1|1|1|1|1||2|2|2|2|...

The OAM memory is located at 0x07000000, and can store 1024 bytes. The attributes are divided in attributes (OBJ_ATTR_x) and the different elements in the affine matrix (OAM_AFFINA_Px). In memory, the streams for attributes and the affine matrix are interspersed. This means that we have the three attributes for sprite 0, the PA for affine 0, the next 3 attributes for sprite 1, the PB for affine 0, and so on. This means we have space for 128 OBJ_ATTR and 32 OBJ_AFFINE parameters.

The TONC tutorial access the OAM parameters via aligned structs. Before messing with that, I decided to try using direct access with the following macros. If this ends up being slower I’ll upgrade, but feels like this is simpler than having to deal with structs full of fill holes.

// Using macros instead of aligned structs for setting up OAM attributes and
// affine parameters.
#define OBJ_ATTR_0(N)    *((vu16*)(MEM_OAM + 0 + 8 * (N)))
#define OBJ_ATTR_1(N)    *((vu16*)(MEM_OAM + 2 + 8 * (N)))
#define OBJ_ATTR_2(N)    *((vu16*)(MEM_OAM + 4 + 8 * (N)))
#define OBJ_AFFINE_PA(N) *((vs16*)(MEM_OAM + 6 + 8 * 0 + 8 * 4 * (N)))
#define OBJ_AFFINE_PB(N) *((vs16*)(MEM_OAM + 6 + 8 * 1 + 8 * 4 * (N)))
#define OBJ_AFFINE_PC(N) *((vs16*)(MEM_OAM + 6 + 8 * 2 + 8 * 4 * (N)))
#define OBJ_AFFINE_PD(N) *((vs16*)(MEM_OAM + 6 + 8 * 3 + 8 * 4 * (N)))

The first OBJ_ATTR_0 register sets the y coordinate, type of rendering, blending/effects, color mode and sprite shape.

bits name  description
0-7  Y     Y coordinate. Marks the top of the sprite.
8-9  OM    (Affine) object mode. Use to hide the sprite or govern affine mode.
               00. Normal rendering.
               01. Sprite is an affine sprite, using affine matrix specified by attr1{9-D}
               10. Disables rendering (hides the sprite)
               11. Affine sprite using double rendering area. See affine sprites for more.
A-B GM     Gfx mode. Flags for special effects.
               00. Normal rendering.
               01. Enables alpha blending. Covered here.
               10. Object is part of the object window. The sprite itself isn't rendered, but serves as a mask for bgs and other sprites. (I think, haven't used it yet)
               11. Forbidden.
C   Mos   Enables mosaic effect. Covered here.
D   CM    Color mode. 16 colors (4bpp) if cleared; 256 colors (8bpp) if set.
E-F Sh    Sprite shape. This and the sprite's size (attr1{E-F}) determines the sprite's real size.

The first OBJ_ATTR_1 register sets the x coordinate, horizontal/vertical flipping and together with OBJ_ATTR_0 adjust the size of the sprite.

bits  name    description
0-8   X       X coordinate. Marks left of the sprite.
9-D   AID     Affine index. Specifies the OAM_AFF_ENTY this sprite uses. Valid only if the affine flag (attr0{8}) is set.
C-D   HF, VF  Horizontal/vertical flipping flags. Used only if the affine flag (attr0) is clear; otherwise they're part of the affine index.
E-F   Sz      Sprite size. Kinda. Together with the shape bits (attr0{E-F}) these determine the sprite's real size.

Here is an additional table showing the different size configurations.

shape/size  00    01     10     11
00          8x8   16x16  32x32  64x64
01          16x8  32x8   32x16  64x32
10          8x16  8x32   16x32  32x64

Finally, OBJ_ATTR_2 sets the base tile index, Z priority and palette bank when in 16-bit mode.

bits  name  description
0-9   TID   Base tile-index of sprite. Note that in bitmap modes this must be 512 or higher.
A-B   Pr    Priority. Higher priorities are drawn first (and therefore can be covered by later sprites and backgrounds). Sprites cover backgrounds of the same priority, and for sprites of the same priority, the higher OBJ_ATTRs are drawn first.
C-F   PB    Palette-bank to use when in 16-color mode. Has no effect if the color mode flag (attr0{C}) is set.

Source: https://www.coranac.com/tonc/text/regobj.htm

Since the OAM can’t be modified during the VDraw period, it is probably better to have a buffer where we update the attributes and affine parameters and copy it during the VBlank to MEM_OEM. This is probably why the TONC tutorial uses structs instead of macros for setting OBJ parameters:

OBJ_ATTR obj_buffer[128];
OBJ_AFFINE *const obj_aff_buffer= (OBJ_AFFINE*)obj_buffer;

We probably want to hide all sprites on initialization, otherwise they will be rendered at (0, 0) with the same color/tile than the zeroth tile.

Tiled backgrounds

We have 4 backgrounds available that can be individually configured. Their size go from 128x128 to 1024x1024 and operate on 8x8 tiles. The tilemaps are used to index the tiles to display. The tiles and tilemaps are stored in the VRAM, which is divided in charblocks and screenblocks, which overlap in memory. For example, at memory 0x06000000 is the first charblock and/or up to 8 screenblocks (0-7). Each screenblock is 2048 bytes long, and we can configure which ones to use via registers. It is our responsability to avoid overwriting charblock and screenblock memory. This is the overview for using tilemaps:

  1. Load graphics (tiles into charblocks and palette in palette memory).
  2. Load a map in one or more screenblocks.
  3. Set DISP_CTRL to a tiled mode and enable the desired background.
  4. Set background attributes with the register.

Depending on the mode, backgrounds can be used directly or as affine backgrouds:

mode  BG0  BG1  BG2  BG3
0     reg  reg  reg  reg
1     reg  reg  aff  -
2     -    -    aff  aff

There are 3 register for each background layer. BG_CTRL_x is the primary register, located at 0x04000008 + 2 * x (where x goes from 0 to 3) and is used to set background parameters.

bits  name  description
0-1   Pr    Priority. Determines drawing order of backgrounds.
2-3   CBB   Character Base Block. Sets the charblock that serves as the base for character/tile indexing. Values: 0-3.
6     Mos   Mosaic flag. Enables mosaic effect.
7     CM    Color Mode. 16 colors (4bpp) if cleared; 256 colors (8bpp) if set.
8-C   SBB   Screen Base Block. Sets the screenblock that serves as the base for screen-entry/map indexing. Values: 0-31.
D     Wr    Affine Wrapping flag. If set, affine background wrap around at their edges. Has no effect on regular backgrounds as they wrap around by default.
E-F   Sz    Background Size. Regular and affine backgrounds have different sizes available to them.

The size bits effect depends on the type of background (regular or affine):

Regular:

Sz-flag  (tiles)  (pixels)
00       32x32    256x256
01       64x32    512x256
10       32x64    256x512
11       64x64    512x512

Affine:

Sz-flag  (tiles)  (pixels)
00       16x16    128x128
01       32x32    256x256
10       64x64    512x512
11       128x128  1024x1024

The write-only registers BG_H_SCROLL_x and BG_V_SCROLL_x located at 0x04000010 + 4 * x and 0x04000012 + 4 * x respectively can be used to control the horizontal and vertical displacement. They wrap around, thus they act as the modulo of the map size. These registers set the top left position of the screen from the top left position of the map. Furthermore to scroll the map left we would increase the H displacement and vice versa.

The screenblocks data is composed of a number of entries corresponding with each tile in the screen. Each screenblock entry also contains h/v flip information as well as the palette bank to use in 4bpp mode:

bits  name    description
0-9   TID     Tile-index of the SE.
A-B   HF, VF  Horizontal/vertical flipping flags.
C-F   PB      Palette bank to use when in 16-color mode. Has no effect for 256-color bgs (REG_BGxCNT{6} is set).

Each screenblock can have 32x32 screen entries (equivalent to a 256x256 pixel map). For larger map sizes multiple screenblocks will be used, and depending on the size they will be nested in the following way (starting at screenblock index 0):

32x32  64x32  32x64  64x64
  0     0 1     0     0 1
                1     2 3

To find the screenblock index for a given tile x and tile y position we can use the following function:

size_t se_index(size_t tile_x, size_t tile_y, size_t map_width) {
    size_t sbb = ((tile_x >> 5) + (tile_y >> 5) * (map_width >> 5));
    return sbb * 1024 + ((tile_x & 31) + (tile_y & 31) * 32);
}

Source: https://www.coranac.com/tonc/text/regbg.htm

Bit packing

To save memory and speed up loading assets like sprites or background tiles we can use bitpacking, which will group the data into u32 in an efficient way. Particularly useful for 1bpp assets like fonts or simple interface elements, since we can store an 8x8 tile into two u32 words. We need to bear in mind that because of the little endian nature of the GBA, we need to pack the bits in both little bit and little byte order.

Here is a custom (probably slow) bit unpacking routine that can be used for unpacking 1bpp tiles:

u32
unpack_1bb(u8 hex) {
    const u32 conversion_u32[16] = {
        0x00000000, 0x00000001, 0x00000010, 0x00000011,
        0x00000100, 0x00000101, 0x00000110, 0x00000111,
        0x00001000, 0x00001001, 0x00001010, 0x00001011,
        0x00001100, 0x00001101, 0x00001110, 0x00001111,
    };
    u8 low = hex & 0xF;
    u8 high = (hex >> 4) & 0xF;
    return (conversion_u32[high] << 16) | conversion_u32[low];
}

It takes an 8bit hex value to generate an unpacked u32 so it will need to be called in the following way to unpack each row:

// Unpack a single tile from memory.
size_t counter = 0;
u32 a = data[i++];
u32 b = data[i++];
tile_mem[counter++] = unpack_1bb((a >> 24) & 0xFF);
tile_mem[counter++] = unpack_1bb((a >> 16) & 0xFF);
tile_mem[counter++] = unpack_1bb((a >> 8) & 0xFF);
tile_mem[counter++] = unpack_1bb((a & 0xFF);
tile_mem[counter++] = unpack_1bb((b >> 24) & 0xFF);
tile_mem[counter++] = unpack_1bb((b >> 16) & 0xFF);
tile_mem[counter++] = unpack_1bb((b >> 8) & 0xFF);
tile_mem[counter++] = unpack_1bb((b & 0xFF);

Apparently we can also use a BIOS function called BitUnpack for this purpose but I have yet to test it.

Input handling

The GBA has access to 10 buttons/keys. We can see which inputs are being used by reading the KEY_INPUTS register at 0x04000130 bits 0-9. The bits are active when they have a zero value, so a value of 1 means that the button/key is unpressed.

// Memory address for key input register
#define KEY_INPUTS  *((vu16*) 0x04000130)

// Alias for key pressing bits.
#define KEY_A      (1 << 0)
#define KEY_B      (1 << 1)
#define KEY_SELECT (1 << 2)
#define KEY_START  (1 << 3)
#define KEY_RIGHT  (1 << 4)
#define KEY_LEFT   (1 << 5)
#define KEY_UP     (1 << 6)
#define KEY_DOWN   (1 << 7)
#define KEY_R      (1 << 8)
#define KEY_L      (1 << 9)

// Check if the given key/button is currently pressed.
#define KEY_PRESSED(key) (~(KEY_INPUTS) & key)

We also have access to a different register, the KEY_CTRL. This register can be used for setting and controlling interrupts that will be triggered by button presses:

bits  name   description
0-9   keys   keys to check for raising a key interrupt.
E     I      Enables keypad interrupt
F     Op     Boolean operator used for determining whether to raise a key- interrupt or not. If clear, it uses an OR (raise if any of the keys of bits 0-9 are down); if set, it uses an AND (raise if all of those keys are down).

Direct Memory Access (DMA)

We can use DMA to copy data quickly in our application by taking advantage of the existing hardware for that purpose. When the DMA controller is active, the CPU is halted until the transfer is finished. There are 4 DMA channels of decreasing priority. Channel 0 has the highest priority, used with the internal RAM exclusively. Channels 1 and 2 are used for audio to copy data to sound buffers. Channel 3 has the lowest priority and can be used for general purpose memory copies.

We can control the DMA with 3 u32 registers. For channel N, we will put the source and destination addresses in DMA_SRC(N) (0x040000B0 + 12 * N) and DMA_DST(N) (0x040000B4 + 12 * N) respectively. We can set the parameters of the DMA channel with DMA_CTRL(N) (0x040000B8 + 12 * N):

bits   name  description
00-0F  N     Number of transfers (Number of CHUNKS, not bytes!).
15-16  DA    Destination adjustment.
                00: increment after each transfer (default)
                01: decrement after each transfer
                10: none; address is fixed
                11: increment the destination during the transfer, and reset it so that repeat DMA will always start at the same destination.
17-18  SA    Source Adjustment. Works just like the two bits for the destination. Note that there is no DMA_SRC_RESET; code 3 for source is forbidden.
19     R     Repeats the copy at each VBlank or HBlank if the DMA timing has been set to those modes.
1A     CS    Chunk Size. Sets DMA to copy by halfword (if clear) or word (if set).
1C-1D  TM    Timing Mode. Specifies when the transfer should start.
                00: start immediately.
                01: start at VBlank.
                10: start at HBlank.
                11: Never used it so far, but here's how I gather it works. For DMA1 and DMA2 it'll refill the FIFO when it has been emptied. Count and size are forced to 1 and 32bit, respectively. For DMA3 it will start the copy at the start of each rendering line, but with a 2 scanline delay.
1E     I     Interrupt request. Raise an interrupt when finished.
1F     En    Enable the DMA transfer for this channel.

Note that we can’t write to the ROM, for obvious reasons, and we can’t access addresses above 0x10000000, so we can only use 28 bits for DMA_SRC and 27 for DMA_DST.

If we are performing several copies using the same DMA register, we may want to stop the previous transfer before starting the new one:

DMA_CTRL(N) = 0;
DMA_SRC(N) = src_addr;
DMA_DST(N) = dst_addr;
DMA_CTRL(N) = count | attrs;

Source: https://www.coranac.com/tonc/text/dma.htm

It is important to note that the DMA is a separate piece of hardware, and we need to be careful about the scope of the variables we send to it. For example, the following will not work if called from a function:

void
init_sprite_pal(size_t starting_index, Color col) {
    Color colors[16] = {
        0x1FFF, 0x1FFF, 0x1FFF, 0x1FFF,
        0x1FFF, 0x1FFF, 0x1FFF, 0x1FFF,
        0x1FFF, 0x1FFF, 0x1FFF, 0x1FFF,
        0x1FFF, 0x1FFF, 0x1FFF, 0x1FFF,
    };

    dma_copy(&PAL_BUFFER_SPRITES[starting_index], colors, 16 * sizeof(Color), 3);
}

Instead, we need to make the colors array static/global so that the address doesn’t go out of scope or use an address from the main function scope:

void
init_sprite_pal(size_t starting_index, Color col) {
    static Color colors[16] = {
        0x1FFF, 0x1FFF, 0x1FFF, 0x1FFF,
        0x1FFF, 0x1FFF, 0x1FFF, 0x1FFF,
        0x1FFF, 0x1FFF, 0x1FFF, 0x1FFF,
        0x1FFF, 0x1FFF, 0x1FFF, 0x1FFF,
    };

    dma_copy(&PAL_BUFFER_SPRITES[starting_index], colors, 16 * sizeof(Color), 3);
}

BIOS Calls

Bios calls work by means of software interrups. The GBA BIOS has 42 avaiable interrupts for a variety of purposes. To call software interrupts, we need to make use of the swi N assembly instruction. This is a list of all available interrupts:

id    Name
0x00  SoftReset
0x01  RegisterRamReset
0x02  Halt
0x03  Stop
0x04  IntrWait
0x05  VBlankIntrWait
0x06  Div
0x07  DivArm
0x08  Sqrt
0x09  ArcTan
0x0A  ArcTan2
0x0B  CPUSet
0x0C  CPUFastSet
0x0D  BiosChecksum
0x0E  BgAffineSet
0x0F  ObjAffineSet
0x10  BitUnPack
0x11  LZ77UnCompWRAM
0x12  LZ77UnCompVRAM
0x13  HuffUnComp
0x14  RLUnCompWRAM
0x15  RLUnCompVRAM
0x16  Diff8bitUnFilterWRAM
0x17  Diff8bitUnFilterVRAM
0x18  Diff16bitUnFilter
0x19  SoundBiasChange
0x1A  SoundDriverInit
0x1B  SoundDriverMode
0x1C  SoundDriverMain
0x1D  SoundDriverVSync
0x1E  SoundChannelClear
0x1F  MIDIKey2Freq
0x20  MusicPlayerOpen
0x21  MusicPlayerStart
0x22  MusicPlayerStop
0x23  MusicPlayerContinue
0x24  MusicPlayerFadeOut
0x25  MultiBoot
0x26  HardReset
0x27  CustomHalt
0x28  SoundDriverVSyncOff
0x29  SoundDriverVSyncOn
0x2A  GetJumpList

Detailed information about what these functions do can be found on the gbatek reference.

To use these BIOS functions we need the following directives written in assembly:

  1. Where to put the code (e.g. .text) and what type of code (.code 16 for THUMB instructions).
  2. Word alignment. We can either align to words (.align 2) or half-words (.balign 4). These only work for the function, and must be set for each one of them.
  3. Scope. We normally will want these things to be on the global scope. For example, .global foobar will create a function that we can call from C with void foobar(args).
  4. Thumb indicator. Despite having .code 16 we also must specify .thumb_func.
  5. Label. Indicates where the function starts.
  6. BIOS call with the swi instruction (For example swi 0x06 for the division function call).
  7. Return to the caller with the bx lr function call.

Here is the full example from the TONC tutorial for a division BIOS call:

@ In tonc_bios.s

@ at top of your file
    .text           @ aka .section .text
    .code 16        @ aka .thumb

@ for each swi (like division, for example)
    .align 2        @ aka .balign 4
    .global Div
    .thumb_func
Div:
    swi     0x06
    bx      lr

It seems that using inline assembly with asm volatile("swi 0x06") doesn’t work anymore. For this reason, the assembly code should be fully compiled separately and linked afterwards.

Interrupts

There are three interrupt registers, the master interrupt control register IRQ_CTRL (0x04000208), the interrupt enable register IRQ_ENABLE (0x04000200) and the IRQ_ACK (0x04000202) used for checking if an interrupt has occurred and acknowledge that is being handled.

The IRQ_CTRL control register must be set to 1 to enable interrupts, otherwise they will be ignored. In addition to enabling specific interrupts with IRQ_ENABLE we will likely need to enable other bits in the appropriate registers. For example the IRQ_VBLANK interrupt also needs bit 0x3 of DISP_STATUS to be also set. Here is the table for both IRQ_ENABLE and IRQ_ACK with corresponding requirements, adapted from TONC:

bits  name  description
0     Vbl   VBlank interrupt. Also requires DISP_STATUS{3}
1     Hbl   HBlank interrupt. Also requires DISP_STATUS{4} Occurs after the HDraw, so that things done here take effect in the next line.
2     Vct   VCount interrupt. Also requires DISP_STATUS{5}. The high byte of DISP_STATUS gives the VCount at which to raise the interrupt. Occurs at the beginning of a scanline.
3-6   Tm    Timer interrupt, 1 bit per timer. Also requires TIMER_CTRL_x{6}. The interrupt will be raised when the timer overflows.
7     Com   Serial communication interrupt. Apparently, also requires REG_SCCNT{E}. To be raised when the transfer is complete. Or so I'm told, I really don't know squat about serial communication.
8-B   Dma   DMA interrupt, 1 bit per channel. Also requires DMA_CTRL(N){1E}. Interrupt will be raised when the full transfer is complete.
C     K     Keypad interrupt. Also requires KEY_CTRL{E}. Raised when any or all or the keys specified in KEY_CTRL are down.
D     C     Cartridge interrupt. Raised when the cart is removed from the GBA.

Source: https://www.coranac.com/tonc/text/interrupts.htm

To acknowledge that we are handling an interrupt, we need to write a bit on said interrupt in IRQ_ACK, even if the bit is already set. This register is used for checking if an interrupt has been raised and clearing it in the aforementioned way. Note that we should only set the bit we are interested in, so we must do: IRQ_ACK = IRQ_x not IRQ_ACK |= IRQ_x, since the latter will set and clear all of the bits for the interrupts that are active. Note that if we use BIOS routines that use interrupts, we also need to acknowledge those in the IRQ_ACK_BIOS (0x03007FF8), which follows the same bit structure as IRQ_ACK.

Unfortunately, the interrupt process is not very friendly from the C point of view. In principle when an interrupt is triggered, the BIOS will jump to 0x03007FFC, so we could theoretically have a function pointer to that location to handle the interrupts from there. However, we need to run that function in ARM mode instead of THUMB. We also have to switch the CPU to interrupt mode from system and back as well as saving the necessary registers, which can only be done in assembly. From our point of view we should:

  1. Disable IRQs to avoid nesting and unpleasant complexity.
  2. Switch from IRQ to SYSTEM execution mode, for which we need to store the current stack pointer, r0-r3, ip and lr registers first.
  3. Run the interrupt service routine (ISR) for the interrupt.
  4. Return from SYSTEM to IRQ execution modes.
  5. Restore the saved registers.
  6. Acknowledge that the interruption has been dealt with.
  7. Re-enable interrupts.

This is quite complicated to address, but with a bit of elbow grease I managed to create my own version of an interrupt handler. It uses this table to store function pointers to the different interrupt types.

IrsFunc irs_table[] = {
    [IRQ_VBLANK ] = NULL,
    [IRQ_HBLANK ] = NULL,
    [IRQ_VCOUNT ] = NULL,
    [IRQ_TIMER_0] = NULL,
    [IRQ_TIMER_1] = NULL,
    [IRQ_TIMER_2] = NULL,
    [IRQ_TIMER_3] = NULL,
    [IRQ_SERIAL ] = NULL,
    [IRQ_DMA_0  ] = NULL,
    [IRQ_DMA_1  ] = NULL,
    [IRQ_DMA_2  ] = NULL,
    [IRQ_DMA_3  ] = NULL,
    [IRQ_KEYPAD ] = NULL,
    [IRQ_GAMEPAK] = NULL,
};

Interrupts are enabled with the irq_init() function. The function irs_set(IrqIndex ids, IrsFunc func) can be used to enable all required bits in the different registers for each interrupt. If the function pointer is set to NULL, the interrupt will be disabled instead. To be able to use certain interrupts that don’t require explicit handling (Such as the BIOS VSync) a stub can be passed instead:

irq_init();
irs_set(IRQ_VBLANK, irs_stub);
irs_set(IRQ_HBLANK, irs_hblank_func);

A custom irs_main function was written in ARM assembly to handle all different interrupts and acknowledgements.

Sound

Sound on the GBA can be quite a hairy process, and there are a ton of registers to address. Here is a modified table with the nomenclature of different registers to which I added my preferred naming scheme, since I don’t mind a bit more verbosity in exchange of readability:

offset  function                         old          new           tonc           mine
60h     channel 1 (sqr) sweep            REG_SG10     SOUND1CNT_L   REG_SND1SWEEP  SOUND_SQUARE1_SWEEP
62h     channel 1 (sqr) len, duty, env   REG_SG10     SOUND1CNT_H   REG_SND1CNT    SOUND_SQUARE1_CTRL
64h     channel 1 (sqr) freq, on         REG_SG11     SOUND1CNT_X   REG_SND1FREQ   SOUND_SQUARE1_FREQ
68h     channel 2 (sqr) len, duty, env   REG_SG20     SOUND2CNT_L   REG_SND2CNT    SOUND_SQUARE2_CTRL
6Ch     channel 2 (sqr) freq, on         REG_SG21     SOUND2CNT_H   REG_SND1FREQ   SOUND_SQUARE2_FREQ
70h     channel 3 (wave) mode            REG_SG30     SOUND3CNT_L   REG_SND3SEL    SOUND_WAVE_MODE
72h     channel 3 (wave) len, vol        REG_SG30     SOUND3CNT_H   REG_SND3CNT    SOUND_WAVE_CTRL
74h     channel 3 (wave) freq, on        REG_SG31     SOUND3CNT_X   REG_SND3FREQ   SOUND_WAVE_FREQ
78h     channel 4 (noise) len, vol, env  REG_SG40     SOUND4CNT_L   REG_SND4CNT    SOUND_NOISE_CTRL
7Ch     channel 4 (noise) freq, on       REG_SG41     SOUND4CNT_H   REG_SND4FREQ   SOUND_NOISE_FREQ
80h     DMG master control               REG_SGCNT0   SOUNDCNT_L    REG_SNDDMGCNT  SOUND_DMG_MASTER
82h     DSound master control            REG_SGCNT0   SOUNDCNT_H    REG_SNDDSCNT   SOUND_DSOUND_MASTER
84h     sound status                     REG_SGCNT1   SOUNDCNT_X    REG_SNDSTAT    SOUND_STATUS
88h     bias control                     REG_SGBIAS   SOUNDBIAS     REG_SNDBIAS    SOUND_BIAS

Source: https://www.coranac.com/tonc/text/sndsqr.htm

To get sound out we have to configure the main sound control registers SOUND_DMG_MASTER (0x04000080), SOUND_DSOUND_MASTER (0x04000082) and SOUND_STATUS (0x04000084).

The SOUND_DMG_MASTER controls DMG left and right volumes as well as the channels enabled.

bits  name   description
0-2   LV     Left volume
4-6   RV     Right volume
8-B   L1-L4  Channels 1-4 on left
C-F   R1-R4  Channels 1-4 on right

The SOUND_DSOUND_MASTER controls direct sound channels, and how to mix it with the DMG ones.

bits  name            description
0-1   DMGV            DMG Volume ratio.
                          00: 25%
                          01: 50%
                          10: 100%
                          11: forbidden
2     AV              DSound A volume ratio. 50% if clear; 100% of set
3     BV              DSound B volume ratio. 50% if clear; 100% of set
8-9   AR, AL          DSound A enable Enable DS A on right and left speakers
A     AT              Dsound A timer. Use timer 0 (if clear) or 1 (if set) for DS A
B     AF              FIFO reset for Dsound A. When using DMA for Direct sound, this will cause DMA to reset the FIFO buffer after it's used.
C-F   BR, BL, BT, BF  As bits 8-B, but for DSound B

Finally, the SOUND_STATUS lower 4 bits can be used to read which DMG channel is currently playing (read) and we need to set bit 7 if we want to output any kind of sound. Sound must be enabled before we can interact with any of the other registers.

bits  name   description
0-3   1A-4A  Active channels. Indicates which DMA channels are currently playing. They do not enable the channels; that's what REG_SNDDMGCNT is for.
7     MSE    Master Sound Enable. Must be set if any sound is to be heard at all. Set this before you do anything else: the other registers can't be accessed otherwise, see GBATek for details.

Square wave (Channels 1-2)

Both DMG square channels are controlled in the same way, but the first channel has access to a frequency sweep. The SOUND_SQUAREx_CTRL registers adjusts the length, envelope (Attack-Sustain-Decay or ASD) and duty cycle of the signal. Note that the frequency set in SOUND_SQUAREx_FREQ is not set in Hertz (Hz), but instead, it is the “rate” of the signal. The rate goes from 0 to 2047 and can be calculated with the following formula: rate = 2048 - 2^17 / freq. This means that we can achieve frequencies between 64 Hz and 131 kHz. Here is the table describing the control registers for square DMG channels:

bits  name  description
0-5   L     Sound Length. This is a write-only field and only works if the channel is timed (REG_SNDxFREQ{E}). The length itself is actually (64−L)/256 seconds for a [3.9, 250] ms range.
6-7   D     Wave duty cycle. Ratio between on and of times of the square wave. Looking back at eq 18.2, this comes down to D=h/T. The available cycles are 12.5%, 25%, 50%, and 75% (one eighth, quarter, half and three quarters).
8-A   EST   Envelope step-time. Time between envelope changes: Δt = EST/64 s.
B     ED    Envelope direction. Indicates if the envelope decreases (default) or increases with each step.
C-F   EIV   Envelope initial value. Can be considered a volume setting of sorts: 0 is silent and 15 is full volume. Combined with the direction, you can have fade-in and fade-outs; to have a sustaining sound, set initial volume to 15 and an increasing direction. To vary the real volume, remember REG_SNDDMGCNT.

The frequency is set as follows:

bits  name  description
0-A   R     Sound rate. Well, initial rate. That's rate, not frequency. Nor period. The relation between rate and frequency is f = 217/(2048-R). Write-only field.
E     T     Timed flag. If set, the sound plays for as long as the length field (REG_SNDxCNT{0-5}) indicates. If clear, the sound plays forever. Note that even if a decaying envelope has reached 0, the sound itself would still be considered on, even if it's silent.
F     Re    Sound reset. Resets the sound to the initial volume (and sweep) settings. Remember that the rate field is in this register as well and due to its write-only nature a simple ‘|= SFREQ_RESET’ will not suffice (even though it might on emulators).

The sweep register:

bits  name  description
0-2   N     Sweep number. Not the number of sweeps; see the discussion below.
3     M     Sweep mode. The sweep can take the rate either up (default) or down (if set).
4-6   T     Sweep step-time. The time between sweeps is measured in 128 Hz (not kHz!): Δt = T/128 ms ≈ 7.8T ms; if T=0, the sweep is disabled.

Wave (Channel 3)

Sound channel 3 can play samples from a 64 sample pattern or two different 32 sample patterns, depending on the configuration of SOUND_WAVE_MODE:

bits  name  description
5     BM    Bank mode (0 = 2x32, 1 = 1x64)
6     BS    Bank select. Two banks are available if BM is set to 0. This also
            control the writing mode of each bank. BS is set to 0, bank 0 will
            play, and we can then write to the bank 1 wave RAM and viceversa.
7     E     Enable the output for wave mode.

The other wave control register is the SOUND_WAVE_CTRL register, which can be used to control the sound length and output volume.

bits  name  description
0-7   SL    Sound length according to formula REG = len_in_seconds * 256 for
            a 1 second maximum and 3.9 ms minimum.
D-F   V     Output level (0x0 = 0%, 0x1 = 100%, 0x4 = 75%, 0x2 = 50%, 0x3 = 25%).

DirectSound

DirectSound give us access to the 2 8bit DACs on the GBA. It plays 8bit signed PCM samples from a FIFO queue starting at 0x40000A0. It requires the use of timers 0 and/or 1 to control the sampling frequency and can work in DMA or interrupt mode. DMA is more efficient but may cause some issues in multiplayer games. In DMA mode, the sample queues will be filled by the DMA controller automatically. In interrupt mode, an interrupt will be used to manually load samples into the queue. Both DirectSound channels can use the same timer, which is normally desired for audio mixing. We should always reset the FIFO with the SOUND_DSOUND_MASTER register (0x04000082) before starting a new sample playback.

To play samples in DMA mode:

  1. Set DirectSound output/volume
  2. Set one timer to 0xFFFF - round(cpu_frequency/sample_frequency). For a 16KHz sample cpu_frequency = 2^24 and sample_frequency = 16000.
  3. Set another timer to count played samples and stop the sound when overflow. For example, if timer 0 is used for playing the samples, timer 1 can be set to cascade mode and enable IRQ on timer 1 when count is 0xFFFF- sample_count.
  4. Point the DMA channel source to the sample memory address and the destination to the desired FIFO queue.
  5. Set DMA start to repeat mode (11) so that the FIFO is refilled when empty.
  6. Set DMA repeat and 32 bit mode and set increment mode for destination and source.
  7. Enable timer.

Source: http://belogic.com/gba/

Timers

In addition to using the VBlank as a timer, the GBA have access to 4 clock timers. These timers are based on the CPU frequency (16.78 MHz), in which one clock cycle takes 59.6 ns. We can configure the timers in 4 different intervals, using 1, 64, 256 or 1024 cycles. Mixing and matching these timers we can create a wide variety of frequencies. Two registers are used for each of the 4 timers, TIMER_DATA_x starting at memory 0x04000100 used for accessing the timer output, and TIMER_CTRL_x, starting at 0x04000102, used for configuring the timers. Each of these stores a u16 and to access the timer number x we need to add 0x4 * x to those base addresses:

#define TIMER_DATA_0  0x04000100 + 0x04 * 0
#define TIMER_DATA_1  0x04000100 + 0x04 * 1
#define TIMER_DATA_2  0x04000100 + 0x04 * 2
#define TIMER_DATA_3  0x04000100 + 0x04 * 3
#define TIMER_CTRL_0  0x04000102 + 0x04 * 0
#define TIMER_CTRL_1  0x04000102 + 0x04 * 1
#define TIMER_CTRL_2  0x04000102 + 0x04 * 2
#define TIMER_CTRL_3  0x04000102 + 0x04 * 3

TIMER_CTRL_x uses only the lower 7 bits, despite being stored as u16. Here is the reference table:

bits  name  description
0-1   Fr    Timer frequency. 0-3 for 1, 64, 256, or 1024 cycles, respectively. y in the define is the number of cycles.
2     CM    Cascade mode. When the counter of the preceding (x−1) timer overflows (REG_TM(x-1)D= 0xffff), this one will be incremented too. A timer that has this bit set does not count on its own, though you still have to enable it. Obviously, this won't work for timer 0. If you plan on using it make sure you understand exactly what I just said; this place is a death-trap for the unwary.
6     I     Raise an interrupt on overflow.
7     En    Enable the timer.

Source: https://www.coranac.com/tonc/text/timers.htm

When reading the u16 value from TIMER_DATA_x we obtain the current value for the counter, but if we write to that address, the timer will not be reset to that value, rather it will set the initial value that the timer will contain when being enabled or overflowing. Since everytime that a timer is enabled it will reset to the value set in TIMER_DATA_x, if we want to pause and then resume all the timers, we can enable cascade mode instead and then disable it when resuming.

Using timers we can create a simple profiling function that counts the number of cycles between the profile_start() and profile_end() function calls.

// We use timers 2 and 3 to count the number of cycles since the profile_start
// functions is called. Don't use if the code we are trying to profile make use
// of these timers.
static inline
void profile_start() {
    TIMER_DATA_2 = 0;
    TIMER_DATA_3 = 0;
    TIMER_CTRL_2 = 0;
    TIMER_CTRL_3 = 0;
    TIMER_CTRL_3 = TIMER_CTRL_ENABLE | TIMER_CTRL_CASCADE;
    TIMER_CTRL_2 = TIMER_CTRL_ENABLE;
}

static inline
u32 profile_stop() {
   TIMER_CTRL_2 = 0;
   return (TIMER_DATA_3 << 16) | TIMER_DATA_2;
}

Resources