blog

SPO600 Project [Part 4] - Inline Assembly

This is a continuation (fourth part) of my final project for SPO600. For more details, read this post. In the last post, I modified the original algorithm so the compiler would be able to use SIMD instructions. On this post, I will experiment with Inline Assembly. I made a post about Inline Assembly a while ago, click here to see it. First, why would inline assembly make this function faster, if we already activated vectorization for the compiler in the last post? The main problem with the solution I showed in the last post is that the algorithm is too dependent on loops: the *go-to*s created are slow. I think that we should be able to make it a lot faster if we "hand-crafted" the Assembly code ourselves. Second, what are the disadvantages of using Inline Assembly? The main ones would be: 1. Portability. Think about it: if we program in assembly, it will only work for that particular CPU architecture. A workaround is to make the Inline Assembly for an architecture, and then have the "regular" algorithm as a fallback for other achitectures, which leads to the next problem: 2. The code is less maintainable. If we have Inline Assembly for a particular architecture, we will probably need another version for other architectures. Now you have two pieces of code to maintain stead of one. If you need to change something, you will have to do it in two places; if you need to test it, you have to test it in two platforms, and so on. Still, I think it is a very interesting exercise. Let's see what happens if we make (almost) the whole thing in assembly! Here is the plan: I will have the original algorithm as a fallback for other architectures, and then, for Aarch64 (and Little Endian) machines, I will use my Assembly version. This is the file structure I will use: ```c #include "toolbox.h" uint32_t modified_hash(u_char *data, size_t len) { uint32_t k, h; uint32_t c = 0x5bd1e995; h = 0 ^ len; #if ... // Using Aarch64 and Little Endian // ASSEMBLY ALGORITHM // For Aarch64 + Little Endian #else // ORIGINAL ALGORITHM // Original algorithm from NGINX // We will rewrite this loop in Assembly while (len >= 4) { k = data[0]; k |= data[1] << 8; k |= data[2] << 16; k |= data[3] << 24; k *= c; k ^= k >> 24; k *= c; h *= c; h ^= k; data += 4; len -= 4; } #endif switch (len) { case 3: h ^= data[2] << 16; /* fall through */ case 2: h ^= data[1] << 8; /* fall through */ case 1: h ^= data[0]; h *= c; } h ^= h >> 13; h *= c; h ^= h >> 15; return h; } ``` Before I start writing the Assembly code, how do we test if we are using Aarch64 and Little Endian? For Little Endian, we can use the same preprocessor flags we used in the previous post; for Aarch64, GCC provides the following flag: `__aarch64__`. So this should do the trick: ```c #if defined(__aarch64__) && __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ ``` #### Writing the base code Alright. The mission now is to turn the original loop into Assembly code for Aarch64. One thing I want to do in this algorithm: use vectorization. For this, i will have to align the memory, like what I did in the previous post. So I will use the modified algorithm from my previous post as a start point. Since I am already checking if we are using Little endian, I can remove the preprocessor checks from the code. After I cleaned up the code a little bit, this is what I have: ```c while (((size_t)data & 0b1111) != 0 && len >= 4) { k = *((uint32_t *)data); k *= c; k ^= k >> 24; k *= c; h *= c; h ^= k; data += 4; len -= 4; } // The memory is aligned: start split loops until the block // is too big for the leftover memory for (; len > BL_SIZE * 4; len -= BL_SIZE * 4) { for (j = 0; j < BL_SIZE; j++) { k2[j] = *((uint32_t *)data); k2[j] *= c; k2[j] ^= k2[j] >> 24; k2[j] *= c; data += 4; } for (j = 0; j < BL_SIZE; j++) { h *= c; h ^= k2[j]; } } // SIMD is done. Finish whatever we have left in memory with // the regular loop while (len >= 4) { k = *((uint32_t *)data); k *= c; k ^= k >> 24; k *= c; h *= c; h ^= k; data += 4; len -= 4; } ``` Ok. This will be turned into our Assembly algorithm: the first and last loops can be rewritten in "regular" assembly (no vectorization), and for the one in the middle, we can use the vector registers. #### Rewriting the serial loops The first and the last loop have the same body - this means that I can reuse the same assembly code for both of them. This is what I came up with: ```c asm( // k = *((uint32_t *)data); "ldr %w[k], [%x[data]] " // k *= 0x5bd1e995; "mul %w[k], %w[k], %w[c] " // k ^= k >> 24; "lsr w9, %w[k], #24 " "eor %w[k], %w[k], w9 " // k *= 0x5bd1e995; "mul %w[k], %w[k], %w[c] " // h *= 0x5bd1e995; "mul %w[h], %w[h], %w[c] " // h ^= k; "eor %w[h], %w[h], %w[k] " // data += 4; // len -= 4; "add %x[data], %x[data], #4 " "sub %x[len], %x[len], #4 " : [k] "=r" (k), [h] "=r" (h), [data] "+r" (data), [len] "+r" (len) : [c] "r" (c) : "r9" ); ``` There! The first and the last loop are done! #### Rewriting the SIMD loop This is a little more complicated, but still the same idea. Since we can use SIMD here, we can calculate the value of four *k*s at once, and then do the calculation with *h* four times. Rinse and repeat. ```c // Duplicating "c" (the constant) in a vector register asm("dup v2.4s, %w0" : : "r" (c)); while (len >= 16) { asm( // Here we do the following lines in the SIMD registers: // k = *((uint32_t *)data); // k *= 0x5bd1e995; // k ^= k >> 24; // k *= 0x5bd1e995; // Every number is 32 bits, so we can fit 4 of them in // every register. The following lines, respectively: // 1- Load 4 numbers in the vector at once // 2- Multiply them by the constants on the other register // 3- Shift the 4 numbers right and store the result somewhere else // 4- "XOR" the result from #3 with the current numbers // 5- Multiply the numbers by the constants again // We now have four results of "k" "ldur q0, [%x[data]] " "mul v0.4s, v0.4s, v2.4s " "ushr v1.4s, v0.4s, #24 " "eor v0.16b, v0.16b, v1.16b " "mul v0.4s, v0.4s, v2.4s " // Since we have 4 results of "k", we have to perform // the operations in "h" 4 times. The lines below do, four times, // respectively: // 1- Multiply "h" by the constant // 2- Load one number from the vector register into a regular register // 3- "XOR" the value of "h" with the value from #2 // Every time we repeat it, we take the next value from the vector // register (we have four lanes there, so value #0, #1, #2, and #3) "mul %w[h], %w[h], %w[c] " "mov w9, v0.s[0] " "eor %w[h], %w[h], w9 " "mul %w[h], %w[h], %w[c] " "mov w9, v0.s[1] " "eor %w[h], %w[h], w9 " "mul %w[h], %w[h], %w[c] " "mov w9, v0.s[2] " "eor %w[h], %w[h], w9 " "mul %w[h], %w[h], %w[c] " "mov w9, v0.s[3] " "eor %w[h], %w[h], w9 " // data += 16; // len -= 16; "add %[data], %[data], #16 " "sub %[len], %[len], #16 " : [h] "=r" (h), [data] "+r" (data), [len] "+r" (len) : [c] "r" (c) : "r9" ); } ``` #### Testing the algorithm This is what we have so far: ```c #include "toolbox.h" uint32_t modified_hash(u_char *data, size_t len) { uint32_t k, h; uint32_t c = 0x5bd1e995; h = 0 ^ len; #if defined(__aarch64__) && __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ // Using the regular loop until the memory is aligned while (((size_t)data & 0b1111) != 0 && len >= 4) { asm( "ldr %w[k], [%x[data]] " "mul %w[k], %w[k], %w[c] " "lsr w9, %w[k], #24 " "eor %w[k], %w[k], w9 " "mul %w[k], %w[k], %w[c] " "mul %w[h], %w[h], %w[c] " "eor %w[h], %w[h], %w[k] " "add %x[data], %x[data], #4 " "sub %x[len], %x[len], #4 " : [k] "=r" (k), [h] "=r" (h), [data] "+r" (data), [len] "+r" (len) : [c] "r" (c) : "r9" ); } // The memory is aligned: start split loops until the block // is too big for the leftover memory asm("dup v2.4s, %w0" : : "r" (c)); while (len >= 16) { asm( "ldur q0, [%x[data]] " "mul v0.4s, v0.4s, v2.4s " "ushr v1.4s, v0.4s, #24 " "eor v0.16b, v0.16b, v1.16b " "mul v0.4s, v0.4s, v2.4s " "mul %w[h], %w[h], %w[c] " "mov w9, v0.s[0] " "eor %w[h], %w[h], w9 " "mul %w[h], %w[h], %w[c] " "mov w9, v0.s[1] " "eor %w[h], %w[h], w9 " "mul %w[h], %w[h], %w[c] " "mov w9, v0.s[2] " "eor %w[h], %w[h], w9 " "mul %w[h], %w[h], %w[c] " "mov w9, v0.s[3] " "eor %w[h], %w[h], w9 " "add %[data], %[data], #16 " "sub %[len], %[len], #16 " : [h] "=r" (h), [data] "+r" (data), [len] "+r" (len) : [c] "r" (c) : "r9" ); } // SIMD is done. Finish whatever we have left in memory with // the regular loop while (len >= 4) { asm( "ldr %w[k], [%x[data]] " "mul %w[k], %w[k], %w[c] " "lsr w9, %w[k], #24 " "eor %w[k], %w[k], w9 " "mul %w[k], %w[k], %w[c] " "mul %w[h], %w[h], %w[c] " "eor %w[h], %w[h], %w[k] " "add %x[data], %x[data], #4 " "sub %x[len], %x[len], #4 " : [k] "=r" (k), [h] "=r" (h), [data] "+r" (data), [len] "+r" (len) : [c] "r" (c) : "r9" ); } #else while (len >= 4) { k = data[0]; k |= data[1] << 8; k |= data[2] << 16; k |= data[3] << 24; k *= c; k ^= k >> 24; k *= c; h *= c; h ^= k; data += 4; len -= 4; } #endif switch (len) { case 3: h ^= data[2] << 16; /* fall through */ case 2: h ^= data[1] << 8; /* fall through */ case 1: h ^= data[0]; h *= c; } h ^= h >> 13; h *= c; h ^= h >> 15; return h; } ``` Just to be safe, I compiled and ran the program in my x86_64 computer, just to make sure it still works: ```json Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.198862 Time for large modified: 0.201308 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000190 Time for medium modified: 0.000157 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000000 Time for small modified: 0.000000 ``` No problem. Still working normally. Now let's try running it on Aarch 64: ```json Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330993 Time for large modified: 0.270605 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000332 Time for medium modified: 0.000256 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000001 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330845 Time for large modified: 0.270763 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000332 Time for medium modified: 0.000257 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000000 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330849 Time for large modified: 0.270494 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000332 Time for medium modified: 0.000256 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000001 Time for small modified: 0.000000 Generated 500000000 characters for large hash functions SUCCESS! - All hashes are identical for both functions Time for large original: 0.330873 Time for large modified: 0.271337 Generated 500000 characters for medium hash functions SUCCESS! - All hashes are identical for both functions Time for medium original: 0.000351 Time for medium modified: 0.000256 Generated 500 characters for small hash functions SUCCESS! - All hashes are identical for both functions Time for small original: 0.000002 Time for small modified: 0.000000 ``` Success! The algorithm is a lot faster now, and it still works on x86_64!