Whether you like it or not eventually you will have to apply some mathematics in your programs. Maybe you’re writing a game and have to simulate the player moving around in the world. You may have to convert between different currencies in a financial application, sample large datasets, or simply get HTML elements to appear where you want them on your website.

While you have probably studied both maths and physics in school it is easy to forget and while learning to program it is not always obvious how a computer deals with numbers.

I have personally never been good at math or physics, most likely because I never liked it growing up. I think this was because I didn’t understand the point of it. I had nothing to apply what I learned, so I was never interested. However, as I learned more about computer science and programming I started to find areas where math could be applied, and just like programming, I found that there is a system behind mathematics and it can be not only useful but fun if you invest some time into it.

This series is going to act as a journal for when I dive into getting better at mathematics and physics as a programmer. I am going to start from the basics and also apply what I learn through programming all the way.

With that said let’s start.

## Numbers

What are numbers in the eyes of a computer and what can they do for us?

Let us start with the thought that there are different types of numbers. These types are divided into several subsets where a set is a collection of elements where each element within the set must be unique.

The first set that most people are taught is the set of natural numbers which is represented by the symbol \({\displaystyle \mathbb {N} }\). The numbers in this set are the positive numbers starting from one (1, 2, 3, 4, 5 …). Some mathematicians include 0 in this set.

Adding the set of negative numbers gives us the set of integers that we as programmers know as an `int`

in most programming languages. The symbol for this set is denoted as \({\displaystyle \mathbb {Z} }\).

Negative numbers are interesting, I learned to think about negative numbers in terms of transactions or trades. So for example, if you have 5 apples and you want to give 2 of them to Johnny it is the same as Johnny giving you -2 apples. This idea of two approaches can be extended to work for multiplication as well. If Johnny gives you -2 apples -2 times it is the opposite of him doing it 2 times so it is the opposite of -4 giving you 4.

It might take some time to let that sink in. If you don’t get it right away then you shouldn’t get discouraged because it took centuries for people to accept this.

Beyond the set of integers, there is also the set of rational numbers using the symbol \({\displaystyle \mathbb {Q} }\) which includes the previous sets we talked about as well as the set of fractions.

Moving on we also have irrational numbers which are numbers that cannot be expressed by the quotient of two integers, a good example is the square root of 2. Rational and irrational numbers form the set of real numbers using the symbol \({\displaystyle \mathbb {R} }\). These are the sets that we will focus on for now.

The symbols for the different sets are something you can forget because we most likely won’t use them beyond this point. However, if you are planning some math on paper or discussing it with a mathematician it can be useful to speak the same language.

### Representing numbers

When we’re talking about the number 4 for example there are different ways of representing this number. You can represent it with a geometric figure as a rectangle but you can also just draw 4 dots on paper or simply just use the symbol “4”. Different representations have their advantages and disadvantages, sometimes a problem is easier solved if you can see it visually.

Different representations of the number 4.

There is no universal symbol for each number, instead, we use a limited set of symbols that we can combine to represent any given number. This is done by using a base system, the numbers in our daily life are written in base 10 but nothing is stopping us from using any other base.

To any number N in any base B you can apply an algorithm like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

function NumberToBase(N, B)
{
ConvertedNumber = [];
Index = 0;
while (N != 0)
{
ConvertedNumber[index] = N % B;
N = N / B;
++Index;
}
// Use the ConvertedNumber array to build a number or string
}

An actual example of a C implementation may look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

char *NumberTable[] =
{
"0", // 0
"1", // 1
"2", // 2
"3", // 3
"4", // 4
"5", // 5
"6", // 6
"7", // 7
"8", // 8
"9", // 9
"A", // 10
"B", // 11
"C", // 12
"D", // 13
"E", // 14
"F", // 15
};
char *
NumberToBase(int Number, int Base)
{
char *Result = calloc(1, 1024);
int At = 0;
while (Number != 0)
{
int ConvertedNumber = Number % Base;
Number = Number / Base;
PrependString(Result, NumberTable[ConvertedNumber]);
}
return Result;
}

How these algorithm works are quite simple, if you were to enter the numbers 462 for the Number argument and 10 for the Base, the remainder when dividing Number and Base would be 2. Then by dividing the Number by 10 we get the number 62 which is used to repeat the process giving us the number 462 when done.

The same operation can be done in reverse:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

int
BaseStringToValue2(char *Digits, int Base)
{
int StringLength = strlen(Digits);
if (StringLength == 0)
{
return 0;
}
int Result = 0;
int Power = 0;
while (StringLength != 0)
{
char C = Digits[StringLength - 1];
int Number = ToNumber(C);
Result += pow(Base, Power) * Number;
StringLength--;
Power++;
}
return Result;
}

If you feed the string “3572” and base 10 this algorithm will go through the number starting from 2 and yield \(2 * 1 + 7 * 10 + 5 * 100 + 3 * 1000\).

There is not too much special about using 10 as a base for your numbers other than the fact that we have 10 fingers to count with. Different bases have their advantages and can be seen used in different areas, for example, computers use base 2 because of how it maps well into the transistors within the hardware. Base 12 is used when talking about feet and inches, the clock often uses base 60 and when talking about degrees we are using a base of 360.

## How computers represent numbers

As we mentioned earlier, computers represent numbers in binary. This is ideal as down to the hardware the computer is built on the concept of “switches” that can be either turned on or off, this is represented as 1 or 0 respectively. Each switch represents one bit of information, the word bit here is short for “Binary digit”. With 8 bits you can represent any number from 0 to 255 while with 32 bits you can represent any number up to 4294967295 and with 64 bits you can represent numbers up to 18446744073709551615. The numbers 32 and 64-bit are something you may recognize, CPUs some years ago used to support 32-bit and you would have a 32-bit version of Windows installed while nowadays you most likely have a 64-bit CPU which means its natively working with 64-bit integers instead of 32-bit ones.

This does not mean that every single bit is used, usually, one bit is set aside to represent if the number is positive or negative. This means that a 32-bit computer would only deal with numbers from -2147483647 to 2147483647 instead of only a positive range up to 4294967295.

Binary addition works simply by adding every digit in the same column and if both are 1s you use a carry digits that cary over, example of this is when you perform addition like \(1 + 1 = 10\).

Subtraction works almost the same but you borrow from the next column if the column contains a 0 instead of a carry like with addition, for example: \(10 - 1 = 1\) and \(101 - 10 = 11\).

Arithmetic operations on binary numbers are quite trivial and are usually implemented using logical operators AND, OR, and NOT. The tables below show the different combinations of bits for the different operations.

### AND

The AND operator yields 1 only if both operands are 1

A | B | Result |
---|---|---|

1 | 0 | 0 |

0 | 1 | 0 |

1 | 1 | 1 |

0 | 0 | 0 |

### OR

The OR operator yields one if one or the other operand is 1.

A | B | Result |
---|---|---|

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 1 |

0 | 0 | 0 |

### NOT

The NOT operation yields 1 if the bit is set to 0 vice versa.

Bit | Result |
---|---|

0 | 1 |

1 | 0 |

## Floating point

At some point reading this you may have asked yourself how floating point numbers fit into all of this that we’ve talked about. If computers only know binary numbers 0 and 1 how can we represent a value that is less than 1 but not 0? To do this we do just like we do in real life, we use a marker known as the decimal point (also known as the radix point).

In base 10 the numbers after the decimal point represents tenths, hundreds. thousands and so on. In binary they represent halves, quarters and so on. Knowing this the number \(1\frac{3}{8}\) is represented in binary as \(1.011 (1+\frac{1}{4}+\frac{1}{8})\). While this seems sane there is a fatal problem with this and that is not all fractions can be represented this way. For example in decimal its hard to represent \(\frac{1}{3}\) because it would consist of an infinite series of \(\frac{3}{10} + \frac{3}{100} + \frac{3}{1000} + ...\) hence we oten just write \(0.333...\)

In binary the situation is even worse because you cannot represent any fraction whose denominator is divisible by any number other than 2. An example of this is \(\frac{1}{5}\) which in decimal equals to exactly \(0.2\) but in binary it ends up as an indefinitely repeating number: \(0.001100110011...\) Because theese types of number doesn’t have a clear definition when to stop computers have troubles with them.

There are ways around this that have been tried in the past. One way is to forget the radix point and simply represent the numbers as fractions, some programming languages have experimented with this. While every number can be represented exactly there are some problems when adding incompatible fractions resulting in increasing the complexity of the fraction with each step in a computation, easily reaching the maximum size for integers, and even if you don’t reach the maximum size the computations will be slow because of the size of the fractions.

Another problem is when you’re working with irrational numbers which cannot be expressed exactly as a fraction, instead you are forced to approximate the number. Computers use the base system to represent non-integers and round them to the nearest digit.

There may be problems with this as well but other problems can occur if for example fixed-point representations are used instead. What this means is that only a fixed amount of digits are used after the radix point.

Since you only have a limited number of bits for each number, each bit added to the right of the radix point means one bit less to use at the left. As a result, you are limiting how large and how small numbers can be. Instead what most programming languages do is allow the radix point to be moved.

### Scientific notation

Similar to this moving radix point is scientific notation, a representation which uses the first significant (non-zero) digit followed with as many digits you need after that, then you specify the position of the radix. For example the number \(305.57\) could be represented as \(30557;3\). Using scientific notation this is commonly represented as \(3.0557 e^2\). If you follow the convention you start with the radix after the first digit and then go on from there, \(e^2\) in this case is the exponent of the number.

### IEEE Standard

IEEE which stands for the Institute of Electrical and Electronic Engineers, is an organization that has created many standards which are commonly used by computer manafacturers and software developers. Theese standards are important for creating a common ground between different hardware, programming languages and systems.

An IEEE standard was established for using floating point numbers that specified that if you are working with 32 bits you should use 1 bit for the sign (positive or negative) and 8 bits for the position of the radix point (from -127 to +127) which gives a total of 255. The value of the actual number can go as high as just below \(2^{128}\) the remaining bits (23) represent the mantissa, which is the actual digits of the number. This simply works because the first significant digit of a binary number is always 1, there is no need to include this digit in the computer’s representation of the number, it is simply always implicitly there.

A great visualization for this can be seen as:

**Source: https://en.wikipedia.org/wiki/IEEE_754**

You can think of the exponent indicating the size of the number while the mantissa or fraction in the above image determines the precision of the number.

A fun little trick you can do in C++ to actually see this in action is with this code snippet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

#include <stdio.h>
struct FloatBits
{
unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
unsigned int exp:8; /**< Value is 2^(exp-127) */
unsigned int sign:1; /**< 0 for positive, 1 for negative */
};
/* A union is a struct where all the fields *overlap* each other */
union FloatUnion
{
float f;
FloatBits b;
};
int main()
{
FloatUnion s;
s.f=8.0;
printf("%f = sign %d exp %d fract %d\n", s.f, s.b.sign, s.b.exp, s.b.fraction);
return 0;
}

How this snippet works is that we simply define a struct for the bits just as how we specified in the image and description above, this can be done using C++ bit fields. The order of the fields is backward to what you might have imagined, this is because the machine we are working on is little-endian.

We then define a union for the float primitive as well as our FloatBits struct, this allows us to create a primitive float and see the same representation in our data type.

The calculation of the number looks like: \(value = (-1) * sign * 2 (exponent-127) * 1.fraction\)

For example the value \(8.0\) specified in the example above would be stored with the sign 0, exponent 130 \((3+127)\) and the mantissa would just be zeroes because the leading 1 is implicit.

Once again problems arise because no matter how precise we make these floating point numbers they will rarely be exact. This means that errors in calculations can happen, while this is not as common anymore it can still easily happen. These are mostly because of rounding errors where a number cannot be represented exactly and is rounded to the nearest number. Performing heavy math and scientific calculations will be a major problem. Sometimes it can help to move things around in your equations but the problem is still there. Nowadays however compilers tend to take care of this for us so rounding errors are less common.

Another problem is that floating point calculations use more expensive instructions than integers hence can result in worse performance than just using integers. However modern processors are becoming better and better at doing this as well but it is something to keep in mind.

If you really want to dive into floating point and its ups an downs I can heavily recommend this article: What Every Computer Scientist Should Know About Floating-Point Arithmetic