blog

SPO600 Project [Part 3] - SIMD and Algorithm Optimization

This is a continuation (third part) of my final project for SPO600. For more details, read this post. In the last post I experimented with different build flags for the compiler. On this post, I will start modifying the code and try to make the algorithm faster. I have two ideas here, and both of them are related to SIMD (Single Instruction, Multiple Data). #### First approach: eliminating unnecessary bitwise operations First, let's refer back to the original code: ```c #include "toolbox.h" uint32_t original_hash(u_char *data, size_t len) { uint32_t h, k; h = 0 ^ len; while (len >= 4) { k = data[0]; k |= data[1] << 8; k |= data[2] << 16; k |= data[3] << 24; k *= 0x5bd1e995; k ^= k >> 24; k *= 0x5bd1e995; h *= 0x5bd1e995; h ^= k; data += 4; len -= 4; } switch (len) { case 3: h ^= data[2] << 16; /* fall through */ case 2: h ^= data[1] << 8; /* fall through */ case 1: h ^= data[0]; h *= 0x5bd1e995; } h ^= h >> 13; h *= 0x5bd1e995; h ^= h >> 15; return h; } ``` I am curious to understand what the line below does, so let's explore it a bit more: ```c k = data[0]; k |= data[1] << 8; k |= data[2] << 16; k |= data[3] << 24; ``` Just to remember: `k` is a 32-bit integer, and `data` is an array of 8-bit char elements. So what do those bit-shift operations do, exactly? Let's imagine that `k` is initialized with all bits as 0. Let's also imagine that `data` has 4 elements: [a, b, c, d]. This is how they could look like in memory: ``` k 8 bits 8 bits 8 bits 8 bits |--------||--------||--------||--------| 00000000 00000000 00000000 00000000 data[0] data[1] data[2] data[3] 8 bits 8 bits 8 bits 8 bits |--------||--------||--------||--------| 01100001 01100010 01100011 01100100 ``` Now what would happen with those operations? Notice what is happening: for every character in `data`, we shift it 8, 16, and 24 bits to the left and "or" it with k. So, let's try figuring out how it would look like: ``` # operation: k |= data[0] data[0]: 00000000 00000000 00000000 01100001 k: 00000000 00000000 00000000 00000000 result: 00000000 00000000 00000000 01100001 # operation: k |= (data[1] << 8) data[1]: 00000000 00000000 00000000 01100010 data[1] shifted: 00000000 00000000 01100010 00000000 k: 00000000 00000000 00000000 01100001 result: 00000000 00000000 01100010 01100001 # operation: k |= (data[2] << 16) data[2]: 00000000 00000000 00000000 01100011 data[2] shifted: 00000000 01100011 00000000 00000000 k: 00000000 00000000 01100010 01100001 result: 00000000 01100011 01100010 01100001 # operation: k |= (data[2] << 24) data[2]: 00000000 00000000 00000000 01100100 data[2] shifted: 01100100 00000000 00000000 00000000 k: 00000000 01100011 01100010 01100001 result: 01100100 01100011 01100010 01100001 ``` Ok. Let's compare the final result then: ``` k = 01100100 01100011 01100010 01100001 data = 01100001 01100010 01100011 01100100 ``` What we have here is simply the bytes of `data` mirrored and recorded into an integer. We have some room for improvement here! Going back to some processor architecture, there is a thing called **endianness**, which refers to the order of the bytes stored in memory, when they compose a numerical value that requires several bytes. There are two main types of endianness: little-endian, and big-endian. For big-endian, whenever we have several bytes mapped in memory, the most significant byte is stored first, while the opposite is true for little-endian. For example: ```c # Character array: 00000001 00000010 00000011 00000100 # Mapped to a 32-bit integer big-endian: 00000001 00000010 00000011 00000100 # Mapped to a 32-bit little-endian: 00000100 00000011 00000010 00000001 ``` Which one is better? I guess it's a matter of taste, like where to place your brackets or tabs x spaces, where one side defend the truth (K&R indentation and spaces are definitely the best), while the other sides defend their right of being wrong. Personally, I try to use all indentation styles and both tabs and spaces in all files, because I like diversity. So, as you noticed, what the algorithm is doing there with all those bit-shifts is basically enforcing a little-endian pattern, which is completely unnecessary for processors that are already little-endian. I can use this to speed things up: by checking the endianness of the processor, I can just cast that section of the array to an integer. According to the GNU manual, I can use the constants `__BYTE_ORDER__` and `__ORDER_LITTLE_ENDIAN__` for this (assuming we are using the GNU compiler). ```c #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ k = *((uint32_t *)data); #else k = data[0]; k |= data[1] << 8; k |= data[2] << 16; k |= data[3] << 24; #endif ``` Alright. Let's see if we get any difference. ```json Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330718 Time for large modified: 0.330592 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000331 Time for medium modified: 0.000319 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000000 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330793 Time for large modified: 0.330683 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000333 Time for medium modified: 0.000319 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000000 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330750 Time for large modified: 0.330641 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000332 Time for medium modified: 0.000318 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000001 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330752 Time for large modified: 0.330636 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000334 Time for medium modified: 0.000318 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000001 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330834 Time for large modified: 0.330662 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000331 Time for medium modified: 0.000318 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000000 ``` Looking at the compiled result, we can see that the processor architecture is, indeed, little-endian, because the if-statement evaluated to true: ```json while (len >= 4) { #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ k = *((uint32_t *)data); 4008a0: b84046c0 ldr w0, [x22],#4 k *= 0x5bd1e995; k ^= k >> 24; k *= 0x5bd1e995; ``` The difference was very small. Why? Because bit-shifting operations are already very fast. Still, it's our first improvement! #### Second approach: vectorization Looking at what the compiler produced, I did not see any mention of vector registers, which is a clear sign that no vectorization was done, even though there is room for it: ```c while (len >= 4) { #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ k = *((uint32_t *)data); #else k = data[0]; k |= data[1] << 8; k |= data[2] << 16; k |= data[3] << 24; #endif // The 3 lines below can be vectorized k *= 0x5bd1e995; k ^= k >> 24; k *= 0x5bd1e995; h *= 0x5bd1e995; h ^= k; data += 4; len -= 4; } ``` The value in `k` is a 32-bit integer which is subject to 3 operations. If we, somehow, vectorize this, we could process four *k*'s in at the same time! This is really not that much, if you think about it: instead of 12 operations, we would be doing 3. It adds up, for sure, but I don't expect to see a huge improvement. So, why is vectorization not happening there? It is not happening because the operations in that loop are too "tied" together: the operations before it cannot be vectorized, and the operations after it cannot be vectorized it. We'll have to take the loop apart ourselves. On top of that, `data` is probably not aligned in memory (I talked about memory alignment in this post). What's the solution then? ##### 1- I will align the *data* memory myself. Why? Well, we can't align *data* in its source, but I can make sure that, when we enter that "main" loop, memory will be aligned properly, and ready to be executed in batches. This will help making vectorization possible for the compiler. Imagine we have this situation, where *data* is spread in memory: the start is not aligned in memory, and the end is in another page. The chunk can be vectorized starts as soon as the element *data[n]* is aligned, and ends in the elements *data[m]*, where the page ends. ``` # v-- END OF PAGE .-- ??? -------..-- data ------------------------------..-- ??? ------. |--------||--------||--------||--------||--------| |--------||--------| # ^-- vectorizable chunk ------^ # | data[n] data[m] | ``` So, this is what I am going to do: repeat the *while* loop 3 times: the first time, it will stop when the element in question is aligned; then we will start the second loop - this loop is where I expect the vectorization to happen, it will end when the chunks processed by the vector registers are too big for what is left in memory; then, the third loop will start, taking care of what we have left. If I don't do this last step, it is possible that the vectorization will end up accessing memory outside of the current page that does not belong to me, which will result in a segmentation fault. ##### 2- I will isolate the 3 operations of *k* in another loop - this loop can then be vectorized. Take a look at this: ```c while (len >= 4) { k = *((uint32_t *)data); k *= 0x5bd1e995; k ^= k >> 24; k *= 0x5bd1e995; h *= 0x5bd1e995; h ^= k; data += 4; len -= 4; } ``` An alternative of doing this, is isolating the first 4 operations in one loop, saving the results of *k* in an array, and then having a similar loop using the values from the *k* array. For example: ```c for (...) { k[i] *= 0x5bd1e995; k[i] ^= k[i] >> 24; k[i] *= 0x5bd1e995; } for (...) { h *= 0x5bd1e995; h ^= k[i]; data += 4; len -= 4; } ``` The advantage of doing this is that the operations in *k* are now isolated: that first loop can now be vectorized! The disavantage is: now we have two loops instead of one. If the gains from the vectorization are not enough to compensate for the second loop, the program will end up being slower. There is another problem here: if I just make a huge array for *k*, we will have to worry about cache misses and RAM being accessed. To avoid this, instead of making an array for all the values of *k*, I can make a small one and process them in batches. For example: I can make an array of 32 *k*'s, which I will process 4 at a time with SIMD; a loop will run 8 times to fill my *k* array, which, with some luck, the compiler will vectorize. Another advantage of this "batch" approach: if we divide the batches in fixed lengths (for example: 32), the compiler will be aware of this constant, and can unroll the loop! It will be large enough to save some go-to's, and small enough to fit in cache. ##### Joining both approaches This is what my code looks like after I checked for little-endian processors, aligned memory manually, and isolated loops so the compiler would vectorize my code: ```c #include "toolbox.h" // How many "k"s at a time to store? #define BL_SIZE 32 uint32_t modified_hash(u_char *data, size_t len) { uint32_t h, k; int j; uint32_t k2[BL_SIZE]; h = 0 ^ len; // Using the regular loop until the memory is aligned while (((size_t)data & 0b1111) != 0 && len >= 4) { #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ k = *((uint32_t *)data); #else k = data[0]; k |= data[1] << 8; k |= data[2] << 16; k |= data[3] << 24; #endif k *= 0x5bd1e995; k ^= k >> 24; k *= 0x5bd1e995; h *= 0x5bd1e995; h ^= k; data += 4; len -= 4; } // The memory is aligned: start split loops until the block // is too big for the leftover memory for (; len > BL_SIZE * 4; len -= BL_SIZE * 4) { // Fill the "k" array in batches - I expect this loop // to be vectorized for (j = 0; j < BL_SIZE; j++) { #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ k2[j] = *((uint32_t *)data); #else k2[j] = data[0]; k2[j] |= data[1] << 8; k2[j] |= data[2] << 16; k2[j] |= data[3] << 24; #endif k2[j] *= 0x5bd1e995; k2[j] ^= k2[j] >> 24; k2[j] *= 0x5bd1e995; data += 4; } // Going through the batch and multiplying H. This loop will not // be vectorized for (j = 0; j < BL_SIZE; j++) { h *= 0x5bd1e995; h ^= k2[j]; } } // SIMD is done. Finish whatever we have left in memory with // the regular loop while (len >= 4) { #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ k = *((uint32_t *)data); #else k = data[0]; k |= data[1] << 8; k |= data[2] << 16; k |= data[3] << 24; #endif k *= 0x5bd1e995; k ^= k >> 24; k *= 0x5bd1e995; h *= 0x5bd1e995; h ^= k; data += 4; len -= 4; } switch (len) { case 3: h ^= data[2] << 16; /* fall through */ case 2: h ^= data[1] << 8; /* fall through */ case 1: h ^= data[0]; h *= 0x5bd1e995; } h ^= h >> 13; h *= 0x5bd1e995; h ^= h >> 15; return h; } ``` Surely this monster will not work, right? Let's see: ```json Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330718 Time for large modified: 0.322143 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000332 Time for medium modified: 0.000309 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000000 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330843 Time for large modified: 0.322202 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000334 Time for medium modified: 0.000310 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000000 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330706 Time for large modified: 0.322616 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000331 Time for medium modified: 0.000310 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000000 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330748 Time for large modified: 0.322156 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000331 Time for medium modified: 0.000310 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000000 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330801 Time for large modified: 0.322446 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000332 Time for medium modified: 0.000328 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000000 ``` That's good! Although the code is a lot larger, we definitely gained some milliseconds! *Note: I tested the code again with different data sizes (to test if I wouldn't segmentation faults) and batch sizes. It seems like everything is working properly, and a batch size of 32 works really well.* Now, the question: is vectorization actually being done, or this is just some sort of black magic? Let's see the assembly: ```x86asm k2[j] *= 0x5bd1e995; 400c48: 4ea09e18 mul v24.4s, v16.4s, v0.4s k2[j] = *((uint32_t *)data); 400c4c: 3cdc0004 ldur q4, [x0,#-64] k2[j] *= 0x5bd1e995; 400c50: 4ea09cf7 mul v23.4s, v7.4s, v0.4s k2[j] = *((uint32_t *)data); 400c54: 3cdd0003 ldur q3, [x0,#-48] k2[j] *= 0x5bd1e995; 400c58: 4ea09cd6 mul v22.4s, v6.4s, v0.4s k2[j] = *((uint32_t *)data); 400c5c: 3cde0002 ldur q2, [x0,#-32] k2[j] *= 0x5bd1e995; 400c60: 4ea09cb5 mul v21.4s, v5.4s, v0.4s k2[j] ^= k2[j] >> 24; 400c64: 6f280710 ushr v16.4s, v24.4s, #24 k2[j] = *((uint32_t *)data); 400c68: 3cdf0001 ldur q1, [x0,#-16] k2[j] *= 0x5bd1e995; 400c6c: 4ea09c94 mul v20.4s, v4.4s, v0.4s k2[j] ^= k2[j] >> 24; 400c70: 6f2806e7 ushr v7.4s, v23.4s, #24 k2[j] *= 0x5bd1e995; 400c74: 4ea09c73 mul v19.4s, v3.4s, v0.4s k2[j] ^= k2[j] >> 24; 400c78: 6f2806c6 ushr v6.4s, v22.4s, #24 k2[j] *= 0x5bd1e995; 400c7c: 4ea09c52 mul v18.4s, v2.4s, v0.4s k2[j] ^= k2[j] >> 24; 400c80: 6e381e10 eor v16.16b, v16.16b, v24.16b 400c84: 6f2806a5 ushr v5.4s, v21.4s, #24 k2[j] *= 0x5bd1e995; 400c88: 4ea09c31 mul v17.4s, v1.4s, v0.4s k2[j] ^= k2[j] >> 24; 400c8c: 6e371ce7 eor v7.16b, v7.16b, v23.16b 400c90: 6f280684 ushr v4.4s, v20.4s, #24 k2[j] *= 0x5bd1e995; 400c94: 4ea09e10 mul v16.4s, v16.4s, v0.4s k2[j] ^= k2[j] >> 24; 400c98: 6e361cc6 eor v6.16b, v6.16b, v22.16b 400c9c: 6f280663 ushr v3.4s, v19.4s, #24 k2[j] *= 0x5bd1e995; 400ca0: 4ea09ce7 mul v7.4s, v7.4s, v0.4s k2[j] ^= k2[j] >> 24; 400ca4: 6e351ca5 eor v5.16b, v5.16b, v21.16b 400ca8: 6f280642 ushr v2.4s, v18.4s, #24 k2[j] *= 0x5bd1e995; 400cac: 4ea09cc6 mul v6.4s, v6.4s, v0.4s k2[j] ^= k2[j] >> 24; 400cb0: 6e341c84 eor v4.16b, v4.16b, v20.16b 400cb4: 6f280621 ushr v1.4s, v17.4s, #24 k2[j] *= 0x5bd1e995; 400cb8: 4ea09ca5 mul v5.4s, v5.4s, v0.4s k2[j] ^= k2[j] >> 24; 400cbc: 6e331c63 eor v3.16b, v3.16b, v19.16b 400cc0: 6e321c42 eor v2.16b, v2.16b, v18.16b k2[j] *= 0x5bd1e995; 400cc4: 3d8003f0 str q16, [sp] 400cc8: 4ea09c84 mul v4.4s, v4.4s, v0.4s k2[j] ^= k2[j] >> 24; 400ccc: 6e311c21 eor v1.16b, v1.16b, v17.16b k2[j] *= 0x5bd1e995; 400cd0: 3d8007e7 str q7, [sp,#16] 400cd4: 4ea09c63 mul v3.4s, v3.4s, v0.4s 400cd8: 3d800be6 str q6, [sp,#32] 400cdc: 4ea09c42 mul v2.4s, v2.4s, v0.4s 400ce0: 3d800fe5 str q5, [sp,#48] 400ce4: 4ea09c21 mul v1.4s, v1.4s, v0.4s 400ce8: 3d8013e4 str q4, [sp,#64] 400cec: 3d8017e3 str q3, [sp,#80] 400cf0: 3d801be2 str q2, [sp,#96] 400cf4: 3d801fe1 str q1, [sp,#112] ``` Not only we have tons of vector registers being used, the compiler also unrolled the loop! That's a victory!