SPO600 Project [Part 4] - Inline Assembly

Date posted: 2018-01-03

This is a continuation (fourth part) of my final project for SPO600. For more details, read this post.

In the last post, I modified the original algorithm so the compiler would be able to use SIMD instructions. On this post, I will experiment with Inline Assembly.

I made a post about Inline Assembly a while ago, click here to see it.

First, why would inline assembly make this function faster, if we already activated vectorization for the compiler in the last post? The main problem with the solution I showed in the last post is that the algorithm is too dependent on loops: the go-tos created are slow. I think that we should be able to make it a lot faster if we "hand-crafted" the Assembly code ourselves.

Second, what are the disadvantages of using Inline Assembly? The main ones would be:

  1. Portability. Think about it: if we program in assembly, it will only work for that particular CPU architecture. A workaround is to make the Inline Assembly for an architecture, and then have the "regular" algorithm as a fallback for other achitectures, which leads to the next problem:
  2. The code is less maintainable. If we have Inline Assembly for a particular architecture, we will probably need another version for other architectures. Now you have two pieces of code to maintain stead of one. If you need to change something, you will have to do it in two places; if you need to test it, you have to test it in two platforms, and so on.

Still, I think it is a very interesting exercise. Let's see what happens if we make (almost) the whole thing in assembly!

Here is the plan: I will have the original algorithm as a fallback for other architectures, and then, for Aarch64 (and Little Endian) machines, I will use my Assembly version. This is the file structure I will use:

#include "toolbox.h"

uint32_t
modified_hash(u_char *data, size_t len)
{
    uint32_t k, h;
    uint32_t c = 0x5bd1e995;

    h = 0 ^ len;

#if ... // Using Aarch64 and Little Endian

    // ASSEMBLY ALGORITHM
    // For Aarch64 + Little Endian

#else

    // ORIGINAL ALGORITHM
    // Original algorithm from NGINX
    // We will rewrite this loop in Assembly
    while (len >= 4) {
        k  = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;

        k *= c;
        k ^= k >> 24;
        k *= c;

        h *= c;
        h ^= k;

        data += 4;
        len -= 4;
    }

#endif

    switch (len) {
    case 3:
        h ^= data[2] << 16;
        /* fall through */
    case 2:
        h ^= data[1] << 8;
        /* fall through */
    case 1:
        h ^= data[0];
        h *= c;
    }

    h ^= h >> 13;
    h *= c;
    h ^= h >> 15;

    return h;
}

Before I start writing the Assembly code, how do we test if we are using Aarch64 and Little Endian? For Little Endian, we can use the same preprocessor flags we used in the previous post; for Aarch64, GCC provides the following flag: __aarch64__. So this should do the trick:

#if defined(__aarch64__) && __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__

Writing the base code

Alright. The mission now is to turn the original loop into Assembly code for Aarch64.

One thing I want to do in this algorithm: use vectorization. For this, i will have to align the memory, like what I did in the previous post. So I will use the modified algorithm from my previous post as a start point. Since I am already checking if we are using Little endian, I can remove the preprocessor checks from the code. After I cleaned up the code a little bit, this is what I have:

    while (((size_t)data & 0b1111) != 0 && len >= 4) {
        k = *((uint32_t *)data);

        k *= c;
        k ^= k >> 24;
        k *= c;

        h *= c;
        h ^= k;

        data += 4;
        len -= 4;
    }

    // The memory is aligned: start split loops until the block
    // is too big for the leftover memory

    for (; len > BL_SIZE * 4; len -= BL_SIZE * 4) {
        for (j = 0; j < BL_SIZE; j++) {
            k2[j] = *((uint32_t *)data);

            k2[j] *= c;
            k2[j] ^= k2[j] >> 24;
            k2[j] *= c;

            data += 4;
        }

        for (j = 0; j < BL_SIZE; j++) {
                h *= c;
                h ^= k2[j];
        }
    }

    // SIMD is done. Finish whatever we have left in memory with
    // the regular loop

    while (len >= 4) {
        k = *((uint32_t *)data);

        k *= c;
        k ^= k >> 24;
        k *= c;

        h *= c;
        h ^= k;

        data += 4;
        len -= 4;
    }

Ok. This will be turned into our Assembly algorithm: the first and last loops can be rewritten in "regular" assembly (no vectorization), and for the one in the middle, we can use the vector registers.

Rewriting the serial loops

The first and the last loop have the same body - this means that I can reuse the same assembly code for both of them. This is what I came up with:

asm(
    // k = *((uint32_t *)data);
    "ldr   %w[k], [%x[data]]       
    "

    // k *= 0x5bd1e995;
    "mul   %w[k], %w[k], %w[c]     
    "

    // k ^= k >> 24;
    "lsr   w9, %w[k], #24          
    "
    "eor   %w[k], %w[k], w9        
    "

    // k *= 0x5bd1e995;
    "mul   %w[k], %w[k], %w[c]     
    "

    // h *= 0x5bd1e995;
    "mul   %w[h], %w[h], %w[c]     
    "

    // h ^= k;
    "eor   %w[h], %w[h], %w[k]     
    "

    // data += 4;
    // len -= 4;
    "add   %x[data], %x[data], #4  
    "
    "sub   %x[len], %x[len], #4    
    "

    : [k] "=r" (k),
      [h] "=r" (h),
      [data] "+r" (data),
      [len] "+r" (len)
    : [c] "r" (c)
    : "r9"
);

There! The first and the last loop are done!

Rewriting the SIMD loop

This is a little more complicated, but still the same idea. Since we can use SIMD here, we can calculate the value of four ks at once, and then do the calculation with h four times. Rinse and repeat.

// Duplicating "c" (the constant) in a vector register
asm("dup v2.4s, %w0" : : "r" (c));

while (len >= 16) {
    asm(
        // Here we do the following lines in the SIMD registers:
        // k = *((uint32_t *)data);
        // k *= 0x5bd1e995;
        // k ^= k >> 24;
        // k *= 0x5bd1e995;
        // Every number is 32 bits, so we can fit 4 of them in
        // every register. The following lines, respectively:
        // 1- Load 4 numbers in the vector at once
        // 2- Multiply them by the constants on the other register
        // 3- Shift the 4 numbers right and store the result somewhere else
        // 4- "XOR" the result from #3 with the current numbers
        // 5- Multiply the numbers by the constants again
        // We now have four results of "k"
        "ldur    q0, [%x[data]]           
    "
        "mul     v0.4s, v0.4s, v2.4s      
    "
        "ushr    v1.4s, v0.4s, #24        
    "
        "eor     v0.16b, v0.16b, v1.16b   
    "
        "mul     v0.4s, v0.4s, v2.4s      
    "

        // Since we have 4 results of "k", we have to perform
        // the operations in "h" 4 times. The lines below do, four times,
        // respectively:
        // 1- Multiply "h" by the constant
        // 2- Load one number from the vector register into a regular register
        // 3- "XOR" the value of "h" with the value from #2
        // Every time we repeat it, we take the next value from the vector
        // register (we have four lanes there, so value #0, #1, #2, and #3)
        "mul     %w[h], %w[h], %w[c]      
    "
        "mov     w9, v0.s[0]              
    "
        "eor     %w[h], %w[h], w9         
    "

        "mul     %w[h], %w[h], %w[c]      
    "
        "mov     w9, v0.s[1]              
    "
        "eor     %w[h], %w[h], w9         
    "

        "mul     %w[h], %w[h], %w[c]      
    "
        "mov     w9, v0.s[2]              
    "
        "eor     %w[h], %w[h], w9         
    "

        "mul     %w[h], %w[h], %w[c]      
    "
        "mov     w9, v0.s[3]              
    "
        "eor     %w[h], %w[h], w9         
    "

        // data += 16;
        // len -= 16;
        "add     %[data], %[data], #16    
    "
        "sub     %[len], %[len], #16      
    "
        : [h] "=r" (h),
          [data] "+r" (data),
          [len] "+r" (len)
        : [c] "r" (c)
        : "r9"
    );
}

Testing the algorithm

This is what we have so far:

#include "toolbox.h"

uint32_t
modified_hash(u_char *data, size_t len)
{
    uint32_t k, h;
    uint32_t c = 0x5bd1e995;

    h = 0 ^ len;

#if defined(__aarch64__) && __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__

    // Using the regular loop until the memory is aligned
    while (((size_t)data & 0b1111) != 0 && len >= 4) {
        asm(
            "ldr   %w[k], [%x[data]]       
    "
            "mul   %w[k], %w[k], %w[c]     
    "
            "lsr   w9, %w[k], #24          
    "
            "eor   %w[k], %w[k], w9        
    "
            "mul   %w[k], %w[k], %w[c]     
    "

            "mul   %w[h], %w[h], %w[c]     
    "
            "eor   %w[h], %w[h], %w[k]     
    "

            "add   %x[data], %x[data], #4  
    "
            "sub   %x[len], %x[len], #4    
    "
            : [k] "=r" (k),
              [h] "=r" (h),
              [data] "+r" (data),
              [len] "+r" (len)
            : [c] "r" (c)
            : "r9"
        );
    }

    // The memory is aligned: start split loops until the block
    // is too big for the leftover memory
    asm("dup v2.4s, %w0" : : "r" (c));
    while (len >= 16) {
        asm(
            "ldur    q0, [%x[data]]           
    "
            "mul     v0.4s, v0.4s, v2.4s      
    "
            "ushr    v1.4s, v0.4s, #24        
    "
            "eor     v0.16b, v0.16b, v1.16b   
    "
            "mul     v0.4s, v0.4s, v2.4s      
    "

            "mul     %w[h], %w[h], %w[c]      
    "
            "mov     w9, v0.s[0]              
    "
            "eor     %w[h], %w[h], w9         
    "

            "mul     %w[h], %w[h], %w[c]      
    "
            "mov     w9, v0.s[1]              
    "
            "eor     %w[h], %w[h], w9         
    "

            "mul     %w[h], %w[h], %w[c]      
    "
            "mov     w9, v0.s[2]              
    "
            "eor     %w[h], %w[h], w9         
    "

            "mul     %w[h], %w[h], %w[c]      
    "
            "mov     w9, v0.s[3]              
    "
            "eor     %w[h], %w[h], w9         
    "

            "add     %[data], %[data], #16    
    "
            "sub     %[len], %[len], #16      
    "
            : [h] "=r" (h),
              [data] "+r" (data),
              [len] "+r" (len)
            : [c] "r" (c)
            : "r9"
        );
    }

    // SIMD is done. Finish whatever we have left in memory with
    // the regular loop
    while (len >= 4) {
        asm(
            "ldr   %w[k], [%x[data]]       
    "
            "mul   %w[k], %w[k], %w[c]     
    "
            "lsr   w9, %w[k], #24          
    "
            "eor   %w[k], %w[k], w9        
    "
            "mul   %w[k], %w[k], %w[c]     
    "

            "mul   %w[h], %w[h], %w[c]     
    "
            "eor   %w[h], %w[h], %w[k]     
    "

            "add   %x[data], %x[data], #4  
    "
            "sub   %x[len], %x[len], #4    
    "
            : [k] "=r" (k),
              [h] "=r" (h),
              [data] "+r" (data),
              [len] "+r" (len)
            : [c] "r" (c)
            : "r9"
        );
    }

#else

    while (len >= 4) {
        k  = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;

        k *= c;
        k ^= k >> 24;
        k *= c;

        h *= c;
        h ^= k;

        data += 4;
        len -= 4;
    }

#endif

    switch (len) {
    case 3:
        h ^= data[2] << 16;
        /* fall through */
    case 2:
        h ^= data[1] << 8;
        /* fall through */
    case 1:
        h ^= data[0];
        h *= c;
    }

    h ^= h >> 13;
    h *= c;
    h ^= h >> 15;

    return h;
}

Just to be safe, I compiled and ran the program in my x86_64 computer, just to make sure it still works:

Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.198862
Time for large modified: 0.201308
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000190
Time for medium modified: 0.000157
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000000
Time for small modified: 0.000000

No problem. Still working normally.

Now let's try running it on Aarch 64:

Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330993
Time for large modified: 0.270605
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000332
Time for medium modified: 0.000256
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000001
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330845
Time for large modified: 0.270763
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000332
Time for medium modified: 0.000257
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000000
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330849
Time for large modified: 0.270494
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000332
Time for medium modified: 0.000256
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000000
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330873
Time for large modified: 0.271337
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000351
Time for medium modified: 0.000256
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000002
Time for small modified: 0.000000

Success! The algorithm is a lot faster now, and it still works on x86_64!