Last article we learned about CPU caches and its different levels, we also learned about cache lines, how they are stored and retrieved and last but not least we also touched upon multiprocessing describing the MESI protocol briefly.
In this article we will dive into virtual memory as well as non-uniform memory access. I suspect this to be the last article before we start to actually discuss code.
The virtual memory subsystem is part of the processor and can provide a virtual address space to each process, this kind of makes each process think it’s alone within the system. The main storage from the process perspective becomes one contiguous address space or collection of contiguous segments of memory.
The part of the CPU that implements the virtual address space is the Memory Management Unit (MMU). The operating system has to manage the virtual address spaces and assign them with real memory, the CPU has the capabilities here to automatically translate virtual addresses into physical addresses. A very beneficial part of this is that the main memory can be extended further than the real main memory’s capacity by utilizing disk storage.
The way the CPU looks up a physical address is by walking a set of hierarchically organized directories.
Taken from Drepper, Ulrich. “What Every Programmer Should Know About Memory.” (2007).
The above figure displays how a 64-bit address maps into the different directories. First, the top bits of the address are removed and used as a lookup inside the root directory which is the L4 Directory in this image.
When inside the L4 directory we will determine if it’s a valid address if it’s not we will stop because of a page fault while if it’s valid we will use those bits to lookup which L3 directory we are going to look at. The process is then repeated by using the Level 3 Index in the L3 directory. Once the process has been repeated the L1 directory will find out where the memory is in physical memory, the remaining bits of the virtual address are then used as an offset within the resulting page.
With all the caching we’ve gone through earlier that is great we don’t have to read memory that often but if we would have to perform four reads instead of one because of looking up physical memory addresses that would be very inefficient.
To solve that we have another cache called Translation Look-Aside Buffer (TLB). These TLBs are multi-level caches for the virtual address translations. On my machine which has a Haswell-based Intel CPU, there are two levels of TLBs, an L1 and L2. The L1 has 64 4KB page entries, 32 2MB, and 4 1GB page entries. What is stored inside one of these cache entries is the page for this specific translation, this means that for each entry there are 4096 physical bytes available which mean that there is a total buffer of 256KB that can be cached and looked up instantly.
The 32 2MB pages and the single 1GB page are something often referred to as “Huge pages” and are somewhat tedious to set up as they need for example administrator privileges on Windows and have the risk of failing if you have more of it than available memory. But it can be quite worthwhile as you do get a big buffer of memory which lets you prevent TLB misses.
With all this said the TLB is not something thought of much as it does get flushed upon context switches.
This gives quite some power to the developer to do some very interesting things. When allocating memory through the virtual memory system the memory is not actually allocated physically but instead just reserved so that no other code can use that memory space until we start operating on the allocated pages.
Most commonly in ArrayList implementations, the array size is first set to a default value like 4, then once the capacity is met the underlying array is simply reallocated into the double size.
Using virtual memory we don’t have to go through this hassle but instead, we can allocate a very big array from the start, let’s say for example 1 million objects.
This can be done using VirtualAlloc on Windows or mmap on Linux.
Non-uniform memory access is just something we will briefly mention but not talk about too much.
NUMA comes into play when you are dealing with multi-chip solutions where you have more than one physical CPU. This is very common for servers or high-end machines.
The communication between these CPUs is expensive and they have their own set of RAM plugged directly into them. The RAM plugged in to all of those CPUs does become the RAM of the computer so it is possible for one CPU to access the RAM plugged into the other but this is of course slower. So non-uniform in this case means that memory can’t be accessed in a uniform way access memory in the system as it depends on where it is plugged in on the chip and where your application is running.
NUMA can improve the performance over single shared memory on a factor based on the number of processors available.