badd10de.dev

Fixed-Point Numbers


Introduction

Modern computers have access to dedicated floating point units, but in older machines (such as the GBA) this was not the case. In such cases, if we require the use of fractional numbers in our code we have to use integers. If used directly, integers will cause a loss of precision, which is why fixed-point numbers are used.

Fixed-point are integer representations of fractional numbers. Normally you need to know how many bits of precision are used for the unit part and how many for the fractional part. For example if we want to represent the number 123.4567 we would need to use 4 bits of fractional precision, though we can use more if necessary. We normally denote the number of bits used for the integer part (i) and the fractional part (f) as (i.f). For example, if we are using 32 bit numbers and we want to have 10 bits of precision the notation to describe these numbers would be (22.10). Note that in some cases, signed numbers are described as (1.i.f) but this doesn’t mean that we have a separate sign bit, we are always using integers with two’s complement numbers.

It is a bit difficult to see this representation with decimal numbers but using hex it should be clear. For example the 32 bit number 0xAABBCCDD at (30.12) would have 0xAABBC as the integer part and 0xCDD as the fractional part. This means building a fixed-point number is as simple as shifting left the integer part and adding the fractional part to it: fixe_num = (0xAABBC << 12) | 0xCDD. Likewise we can get the fractional part with a masked and and the integer part with a shift down: integer_part = 0xAABBCCDD >> 12;, fractional_part = 0xAABBCCDD & 0xFFF;. This only applies to unsigned numbers, since sign may be considered when getting the fractional part and using a mask would effectively destroy the sign.

One important consideration of fixed-point numbers is that depending on the selected integer precision we are limited in the largest number we can represent. With that said, nothing stops us from reducing shifting the fractional bit around as needed in our algorithms.

Math

Working with fixed-point numbers require certain things to ensure the correctness of arithmetic operations. In case of addition and subtraction, the numbers must have the same fixed-point representation, that is, the number of bits being used for the fractional part. When multiplying fixed-point numbers, we need to also multiply the fractional scale, which translate to a right shift after the multiplication (fpa * fpb = (A * B) >> f). To keep the highest accuracy, when using division, we shift the scale before we divide (fpa / fpb = (A << f) / B).

Resources