Computer Memory Part 4

Jun 15, 2022

In this article, we will finally get our hands dirty by writing some code. We’ve now learned about how the CPU works and how it deals with memory so now it is time to discuss what programmers can do with this knowledge.

Programming in a way that utilizes the CPU cache efficiently is a whole subject on its own. I will for sure not be able to cover everything in this subject but hopefully, at least get you started.

Data-oriented design

This is a subject that is quite popular within the game development space where you focus on the data layout of your application. Some may think that the compiler is smart enough to do this for them but this is not the case as compilers lack the bigger picture it needs to optimize the data access and data flow of the application.

Optimize for size

One part of the data-oriented design is to optimize your data structures for size. Why this can be important because if your data structure is small then more of them can fit within a single cache line.

An example of how this may look in practice is pretty simple. The compiler has explicit alignment for primitive types and their structures.

This means for example that an int can never be on address 0x15 (unless you specifically tell it to have a different allignment) because it is aligned to 4 bytes so the only aligned memory addresses would be: 0, 4, 8, 0xC, 0x10, 0x14, 0x18 …

Putting that same int on any misaligned address would be undefined behavior and the code would potentially not work at all. With that in mind look at the following code snippet:

struct example
{
  char  A;
  uint32_t B;
};

Since a char is 1 byte the 32-bit integer here would be misaligned if the compiler did not introduce padding between the two fields. To see what the compiler does behind the scenes there is a flag for MSVC that allows you to print out the layout of your data structure called /d1reportSingleClassLayout which if we use we get the following output:

class example   size(8):
    +---
 0  | A
    | <alignment member> (size=3)
 4  | B
    +---
Compiler returned: 0

This padding can lead to the structure being larger than it needs to be based on in which order you place your fields in that struct. Let’s take another example:

struct example
{
    uint32_t A;
    uint64_t B;
    uint32_t C;
};

/*
class example   size(24):
    +---
 0  | A
    | <alignment member> (size=4)
 8  | B
16  | C
    | <alignment member> (size=4)
    +---
Compiler returned: 0
*/

This would result in a structure of 24 bytes due to the padding being introduced between A and B as well as after C. However if we re-order our fields like this:

struct example
{
    uint64_t B;
    uint32_t A;
    uint32_t C;
};

/*
class example size(16):
    +---
 0  | B
 8  | A
12  | C
    +---
Compiler returned: 0
*/

And completely get rid of the padding and get a smaller resulting data structure which in turn we can fit more of in the cache.

Spatial locality

Another common concept in data-oriented design is to start thinking about how your data is accessed. By now you should understand that processing data sequentially in memory is more efficient that having to jump around to different memory locations. A simple way to show this is when organizing data in terms of structures of arrays (SOA) versus arrays of structures (AOS). Data structure adopting this mindset looks like this:

#define SIZE 16384

typedef struct entityaos_
{
    uint64_t Power;
    uint64_t Health;
    uint64_t Speed;
} entityAOS;

typedef struct entitysoa_
{
    uint64_t Power[SIZE];
    uint64_t Health[SIZE];
    uint64_t Speed[SIZE];
} entitySOA;

What happens here is that instead of allocating an array of entities we have a data structure that has all the different fields an entity can have in its arrays. The implications these two methods have on memory access are very different and can be easily spotted when visualized.

Now instead of processing each entity separately, we would instead process their fields. This allows us to deal with the specific field as a contiguous block of memory which helps us with fitting more of that property into the cache. I wrote a small piece of code to showcase this, here is two dummy functions for the two structs presented above which go over each entity and sum their power level.

#define SIZE 16384

typedef struct entity1_
{
    uint64_t Power;
    uint64_t Health;
    uint64_t Speed;
} entityAOS;

typedef struct entity2_
{
    uint64_t Power[SIZE];
    uint64_t Health[SIZE];
    uint64_t Speed[SIZE];
} entitySOA;

void InitEntityAOS(entityAOS *E)
{
    E->Power = rand();
    E->Health = rand();
    E->Speed = rand();
}

void InitEntitySOA(entitySOA *E)
{
    for (int i = 0; i < SIZE; ++i)
    {
        E->Power[i] = rand();
        E->Health[i] = rand();
        E->Speed[i] = rand();
    }
}

uint64_t SumEntitiesAOS(entityAOS *Entities, int Number)
{
    uint64_t Sum = 0;
    
    for (int i = 0; i < Number; ++i)
    {
        Sum += Entities[i].Power;
    }
    
    return (Sum);
}

uint64_t SumEntitiesSOA(entitySOA *Entities, int Number)
{
    uint64_t Sum = 0;
    
    for (int i = 0; i < Number; ++i)
    {
        Sum += Entities->Power[i];
    }
    
    return (Sum);
}

The result of benchmarking this can be seen in the following graph where the lower bar means faster. Here processing the SOA-based entities is about 3,5 times faster than processing them as AOS. A small disclaimer here is that entity is just an example and probably not something I would write like this in the real world as it would get tedious.

Intrinsics

Moving on we will be using something called intrinsics. What this is is a specific compiler function not part of any library but the compiler itself. These intrinsic functions can be thought of almost like an inline function that the compiler exposes to you when programming. We will specifically use intrinsics that are directly mapped to the x86 single instruction multiple data (SIMD) instruction set. SIMD is a way of operating on multiple data at once. For example in the previous section when we are adding all the power values for the entities instead of looping over each one we can loop over four at a time and add all of them together increasing the performance even further. However this type of optimization I would expect the compiler to try and do for us in this specific example due to its simplicity.

Figure from H. Jeong, S. Kim, W. Lee, and S.-H. Myung, “Performance of sse and avx instruction sets,” arXiv preprint arXiv:1211.0820, 2012.

There are two main SIMD instruction sets available called SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions) which offer 128-bit XMM registers and 256-bit YMM registers respectively.

Initially, SSE offered a set of 70 instructions that number was later increased with the releases of SSE2, SSE3, SSSE3, and SSE4. The XMM registers are displayed in the figure above and as you can see this instruction set allows for operating on four values at once since each register is 128-bit (16 bytes).

The AVX instruction set uses the YMM registers which can fit double the size of an XMM register which means that eight values can be stored in one register which is 256-bit (32 bytes) in size. In addition to its size, the AVX instruction set also provides the option for three input arguments which allows functions with the format of c=a+b instead of the normally used format a=a+b which SSE is restricted to.

Important to note when using SIMD instructions is to keep track of what throughput the instructions used have and how that affects your application. An example is the AVX instruction “VDIVPD” whose latency is twice as high as the alternative SSE instruction “DIVPD”. Using the wrong instructions can have a negative impact and become a performance bottleneck.

A great resource for available intrinsics is available on Intel’s intrinsic guide. I will probably make a separate series on SIMD so I won’t touch more on this subject here but if you are confused by the _mm instructions in the following sections do know that they are intrinsic functions and their documentation is available in the intel’s guide.

Bypassing the cache

Sometimes we are working with data that will not be immediately used or with large data structures and because each store operation will read a full cache line first and then modify that cached data if we have a large enough structure that means that the earlier parts of it will be evicted from the cache as we process the later parts of the structure and this makes caching the writes ineffective.

To solve this we have something called non-temporal write operations. If you remember that temporal here refers to re-use within a relatively small time duration and since the operation is non-temporal there is no reason to cache it. These operations do not read a cache line and then modify it, instead, they write new content into memory directly.

Following are a couple of code snippets taken from Ulrich Drepper’s paper which perfectly display how to do this. In the first one we can see the available intrinsics to perform the non-temporal writes:

#include <emmintrin.h>
void _mm_stream_si32(int *p, int a);
void _mm_stream_si128(int *p, __m128i a);
void _mm_stream_pd(double *p, __m128d a);
#include <xmmintrin.h>
void _mm_stream_pi(__m64 *p, __m64 a);
void _mm_stream_ps(float *p, __m128 a);
#include <ammintrin.h>
void _mm_stream_sd(double *p, __m128d a);
void _mm_stream_ss(float *p, __m128 a);

The next snippet shows how to use these intrinsics in order to initialize a matrix which can be useful when doing math or game development:

#include <emmintrin.h>
void setbytes(char *p, int c)
{
    __m128i i = _mm_set_epi8(c, c, c, c,
                             c, c, c, c,
                             c, c, c, c,
                             c, c, c, c);
    _mm_stream_si128((__m128i *)&p[0], i);
    _mm_stream_si128((__m128i *)&p[16], i);
    _mm_stream_si128((__m128i *)&p[32], i);
    _mm_stream_si128((__m128i *)&p[48], i);
}

This is also how the memset function within the C standard runtime deals with writing large blocks of memory. There is a single instruction available _mm_stream_load_si128for loading with the same non-temporal memory hint which means that you do have the option of loading memory without risking cache eviction.

Final notes

Using what we’ve learned so far in the series we should have all the tools available to understand how we can optimize our application’s cache usage. While the series has not presented every single optimization technique I think it has successfully laid the foundation for thinking about memory.

Moving on from here involves a lot of reading and experimentation, every application is different and hence needs different types of optimizations. Unfortunately for these types of things, there is no universal solution but to think about your data and how it is moving around in your application.

Nullsson

Discussion about this post