Cache-friendly code: experimenting with cache hit and miss in C

I think I said a while ago that I would, one day, make a post about cache-miss and hit in my blog, or maybe not. Anyway, guess what this post is going to be about... Firstly, what is the CPU clock? Putting it in simple terms, it is a regular pulse of energy that keeps the pace of the operations in the CPU, somewhat similar to a drummer in a band: the faster the clock, more operations per second we have. However, this doesn't necessarily mean that we are going to complete every operation in one clock: some operations are more costly than others. For example, retrieving information from the hard drive is very costly in terms of operations - the CPU will request information from the hard drive and wait for hundreds or thousands (I actually don't know how many, but it is way too many) of clock cycles until the information arrives. To avoid using the slow hard-drive too much, we can also store information in DRAM, which is much faster than the hard drive. The DRAM requires less clock cycles to be accessed, however, the storage capacity is much smaller than the hard drive; but even so, if we are smart about it, we can simply put the information that we access often in the DRAM, and the information that we do not access too often in the hard drive. There are more of these layers. Let's put it this way, the layers are: Registers, L1, L2, L3, DRAM, Hard Drive. As we go down this list, we have less speed (more clock cycles required), but more storage space available. CPUs are (typically) capable of accessing the information in its own registers in only one clock cycle - they are super fast, but a CPU may have only one or two dozens of registers available (each one 32-64 bits wide). If we need to fetch additional information, we could just go to the DRAM and fetch it, but modern CPUs have some caches to avoid using the DRAM too much (the DRAM is considerably fast, but it is still incredibly slow compared to registers), these caches are: L1, L2, and L3, and this is how they work: if the CPU does not have the information in the registers and need to fetch it from the DRAM, first it goes to L1 and tries to find it there, if it does not find, it tries in L2, then in L3, and then in the DRAM. These caches are great because they improve the speed which we can retrieve information, but every time we miss a cache (the information is not there), it costs us some cycles - these are cache-misses, and sometimes we can avoid them. I absolutely loved the "baking chocolate chip cookies" analogy posted in this question on Stack Overflow: What is the Cost of an L1 Cache Miss?. Paraphrasing the author:
- Registers are your hands: if an ingredient is in your hands, you can just drop it in the dough.
- L1 is the kitchen counter. If something is not in your hands, it is still fairly quick to get if it is on the counter.
- L2 is the fridge. It is still quick, but you need to walk to the fridge, open it, get stuff out of the way, get the item, close the fridge...
- L3 is your cupboard: it's further away from the fridge, packed with other items, etc.
- DRAM is the corner store: you need to grab your wallet, put on your pants, forget your keys, and walk 5 minutes.
I will go ahead and add another layer: the hard drive is driving to another city to buy your can of beans. It is desirable to always make sure the information is available on cache, although this is not always possible. For example: if we fetch one word from DRAM (called a **memory hit**), it is likely that we will end up requesting something next to it (maybe we are getting a member of a struct, or looping through an array) - what happens then is that not only the information we asked for is recorded in the cache, but also the data around it. For example: let's say I have a cache (just one, no other layers) that can hold 5 integers, and I have the following array in my DRAM: ```c [ 1, 3, 5, 7] ``` If I try to access it, it could be fetched from memory and recorded in my cache: ```c // Cache: { [1, 3, 5, 7], ? } ``` Great. If I tried to loop through this array, the loop would probably be very fast, since all the information would be cached. However, say that our array is 20 elements long, but we still want to loop through the same element (say they are spaced 4 elements apart: ```c [ 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 5, 0, 0, 0, 0, 7, 0, 0, 0, 0 ] ``` This is a problem, because we cannot fit the whole array in the cache: in this case, an for loop like the one below would cause a cache-miss every iteration: ```c int i; for (i = 0; i < 20; i += 5) //... ``` The solution is obvious (but sometimes hard to do): keep the data as close as possible. We cannot tell the system what to keep in cache - this is out of our control, the best we can do is make it easy to be cached. Here is one example in C: I made a program that holds an array with 10,000 elements, one in series, another scattered in regular intervals: ```c // 1.c #include #include #define ELEMENTS 10000 int main() { int* arr = calloc(ELEMENTS, sizeof(int)); int i; int sum = 0; int iterations = 0; for (i = 0; i < ELEMENTS; i++) arr[i] = 5; for (i = 0; i < ELEMENTS; i++) { sum += arr[i]; iterations++; } printf("Sum: %d; Iterations: %d ", sum, iterations); free(arr); return 0; } ``` ```c // 2.c #include #include #define ELEMENTS 10000 #define INTERVAL 10000 int main() { int* arr = calloc(ELEMENTS * INTERVAL, sizeof(int)); int i; int sum = 0; int iterations = 0; for (i = 0; i < ELEMENTS * INTERVAL; i++) { if (i % INTERVAL == 0) arr[i] = 5; else arr[i] = 0; } for (i = 0; i < ELEMENTS * INTERVAL; i += INTERVAL) { sum += arr[i]; iterations++; } printf("Sum: %d; Iterations: %d ", sum, iterations); free(arr); return 0; } ``` I added a "sum" and a "iterations" counter just to make sure all my numbers are right: the sums should be the same, as well as the number of iterations. Measuring the time that they take to execute is not really what I want (of course the first one is faster - we have way less elements in the array). I want to measure the number of misses both programs have in the cache. For this, I will use the following command: ```bash $ perf stat -B -e cache-references,cache-misses ``` This will tell me the number of hits and misses. I will also use the `-O0` flag for compiling without optimisation (I don't want GCC getting rid of my loops or anything). It is also important to notice that the number of hits and misses will be different every time I execute the program, since what is kept in cache is not really controlled by us. Still, it should be a nice test: ```bash Sum: 50000; Iterations: 10000 Performance counter stats for './1.o': 3,603 cache-references:u 1,056 cache-misses:u # 29.309 % of all cache refs 0.000519530 seconds time elapsed Sum: 50000; Iterations: 10000 Performance counter stats for './2.o': 293,737 cache-references:u 282,754 cache-misses:u # 96.261 % of all cache refs 0.380252259 seconds time elapsed ``` This is a very dramatic difference. If I repeat the test several times, the cache-misses for the first programs float between 20~50%, while the second one always stay above 80%. Now, let's play a little bit with the "interval" of the second program. That number tells how far apart the numbers will be. **Interval of 100,000:** ```bash 2,763,186 cache-misses:u # 97.134 % of all cache refs ``` **Interval of 10,000:** ```bash 278,654 cache-misses:u # 96.303 % of all cache refs ``` **Interval of 1,000:** ```bash 38,180 cache-misses:u # 90.485 % of all cache refs ``` **Interval of 100:** ```bash 3,666 cache-misses:u # 22.485 % of all cache refs ``` Notice that from an interval of 100,000 to 1,000 the cache-misses didn't really improve that much - only a 7% improvement (through several tries). However, when I go from 1,000 to 100, the cache misses suddenly dropped around 70%. Why? It's simple: it doesn't matter if they are 1,000 elements or 100,000 elements far apart: they still don't fit in the cache, and we end up missing most of them. But when we reduce them to 100 elements apart, the array suddenly fits in cache, and we get this huge improvement. This just shows how important it is to keep cache in mind when we design our data structures and algorithms. An iteration like this: ```c for (...) sum += arr[0][0] + arr[0][1] + arr[1][0] + arr[1][1]; ``` Is much more cache-friendly than this: ```c for (...) sum += arr[0][0] + arr[1][0] + arr[0][1] + arr[1][1]; ``` Let's compare what would happen for both, assuming we always put the information we retrieve in cache: **First example** - Fetch the row [0] from memory and put it in cache (memory hit) - Access column [0][0] - Access column [0][1] without a cache-miss - Fetch the row [1] from memory and put it in cache (memory hit) - Access column [1][0] - Access column [1][1] without a cache-miss In total, we had 2 memory hits - times that we had to find things in memory because they were not in cache. **Second example** - Fetch the row [0] from memory and put it in cache (memory hit) - Access column [0][0] - Fetch the row [1] from memory and put it in cache (memory hit) - Access column [1][0] - Fetch the row [0] from memory and put it in cache (memory hit) - Access column [0][1] - Fetch the row [1] from memory and put it in cache (memory hit) - Access column [1][1] Here we had 4 memory hits.