SPO600 Project [Part 3] - SIMD and Algorithm Optimization

Date posted: 2017-12-29

This is a continuation (third part) of my final project for SPO600. For more details, read this post.

In the last post I experimented with different build flags for the compiler. On this post, I will start modifying the code and try to make the algorithm faster. I have two ideas here, and both of them are related to SIMD (Single Instruction, Multiple Data).

First approach: eliminating unnecessary bitwise operations

First, let's refer back to the original code:

#include "toolbox.h"

uint32_t
original_hash(u_char *data, size_t len)
{
    uint32_t  h, k;

    h = 0 ^ len;

    while (len >= 4) {
        k  = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;

        k *= 0x5bd1e995;
        k ^= k >> 24;
        k *= 0x5bd1e995;

        h *= 0x5bd1e995;
        h ^= k;

        data += 4;
        len -= 4;
    }

    switch (len) {
    case 3:
        h ^= data[2] << 16;
        /* fall through */
    case 2:
        h ^= data[1] << 8;
        /* fall through */
    case 1:
        h ^= data[0];
        h *= 0x5bd1e995;
    }

    h ^= h >> 13;
    h *= 0x5bd1e995;
    h ^= h >> 15;

    return h;
}

I am curious to understand what the line below does, so let's explore it a bit more:

k  = data[0];
k |= data[1] << 8;
k |= data[2] << 16;
k |= data[3] << 24;

Just to remember: k is a 32-bit integer, and data is an array of 8-bit char elements. So what do those bit-shift operations do, exactly?

Let's imagine that k is initialized with all bits as 0. Let's also imagine that data has 4 elements: [a, b, c, d]. This is how they could look like in memory:

 k
 8 bits    8 bits    8 bits    8 bits
|--------||--------||--------||--------|
 00000000  00000000  00000000  00000000 

 data[0]   data[1]   data[2]   data[3]
 8 bits    8 bits    8 bits    8 bits
|--------||--------||--------||--------|
 01100001  01100010  01100011  01100100 

Now what would happen with those operations? Notice what is happening: for every character in data, we shift it 8, 16, and 24 bits to the left and "or" it with k. So, let's try figuring out how it would look like:

# operation: k |= data[0]
data[0]: 00000000  00000000  00000000  01100001
k:       00000000  00000000  00000000  00000000
result:  00000000  00000000  00000000  01100001

# operation: k |= (data[1] << 8)
data[1]:          00000000  00000000  00000000  01100010
data[1] shifted:  00000000  00000000  01100010  00000000
k:                00000000  00000000  00000000  01100001
result:           00000000  00000000  01100010  01100001

# operation: k |= (data[2] << 16)
data[2]:          00000000  00000000  00000000  01100011
data[2] shifted:  00000000  01100011  00000000  00000000
k:                00000000  00000000  01100010  01100001
result:           00000000  01100011  01100010  01100001

# operation: k |= (data[2] << 24)
data[2]:          00000000  00000000  00000000  01100100
data[2] shifted:  01100100  00000000  00000000  00000000
k:                00000000  01100011  01100010  01100001
result:           01100100  01100011  01100010  01100001

Ok. Let's compare the final result then:

k    = 01100100  01100011  01100010  01100001
data = 01100001  01100010  01100011  01100100 

What we have here is simply the bytes of data mirrored and recorded into an integer. We have some room for improvement here!

Going back to some processor architecture, there is a thing called endianness, which refers to the order of the bytes stored in memory, when they compose a numerical value that requires several bytes. There are two main types of endianness: little-endian, and big-endian. For big-endian, whenever we have several bytes mapped in memory, the most significant byte is stored first, while the opposite is true for little-endian. For example:

# Character array:
00000001 00000010 00000011 00000100

# Mapped to a 32-bit integer big-endian:
00000001 00000010 00000011 00000100

# Mapped to a 32-bit little-endian:
00000100 00000011 00000010 00000001

Which one is better? I guess it's a matter of taste, like where to place your brackets or tabs x spaces, where one side defend the truth (K&R indentation and spaces are definitely the best), while the other sides defend their right of being wrong. Personally, I try to use all indentation styles and both tabs and spaces in all files, because I like diversity.

So, as you noticed, what the algorithm is doing there with all those bit-shifts is basically enforcing a little-endian pattern, which is completely unnecessary for processors that are already little-endian. I can use this to speed things up: by checking the endianness of the processor, I can just cast that section of the array to an integer. According to the GNU manual, I can use the constants __BYTE_ORDER__ and __ORDER_LITTLE_ENDIAN__ for this (assuming we are using the GNU compiler).

#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
        k = *((uint32_t *)data);
#else
        k  = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;
#endif

Alright. Let's see if we get any difference.

Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330718
Time for large modified: 0.330592
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000331
Time for medium modified: 0.000319
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000000
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330793
Time for large modified: 0.330683
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000333
Time for medium modified: 0.000319
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000000
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330750
Time for large modified: 0.330641
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000332
Time for medium modified: 0.000318
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000001
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330752
Time for large modified: 0.330636
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000334
Time for medium modified: 0.000318
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000001
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330834
Time for large modified: 0.330662
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000331
Time for medium modified: 0.000318
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000000

Looking at the compiled result, we can see that the processor architecture is, indeed, little-endian, because the if-statement evaluated to true:

    while (len >= 4) {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
        k = *((uint32_t *)data);
  4008a0:       b84046c0        ldr     w0, [x22],#4

        k *= 0x5bd1e995;
        k ^= k >> 24;
        k *= 0x5bd1e995;

The difference was very small. Why? Because bit-shifting operations are already very fast. Still, it's our first improvement!

Second approach: vectorization

Looking at what the compiler produced, I did not see any mention of vector registers, which is a clear sign that no vectorization was done, even though there is room for it:

    while (len >= 4) {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
        k = *((uint32_t *)data);
#else
        k  = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;
#endif

        // The 3 lines below can be vectorized
        k *= 0x5bd1e995;
        k ^= k >> 24;
        k *= 0x5bd1e995;

        h *= 0x5bd1e995;
        h ^= k;

        data += 4;
        len -= 4;
    }

The value in k is a 32-bit integer which is subject to 3 operations. If we, somehow, vectorize this, we could process four k's in at the same time! This is really not that much, if you think about it: instead of 12 operations, we would be doing 3. It adds up, for sure, but I don't expect to see a huge improvement.

So, why is vectorization not happening there? It is not happening because the operations in that loop are too "tied" together: the operations before it cannot be vectorized, and the operations after it cannot be vectorized it. We'll have to take the loop apart ourselves. On top of that, data is probably not aligned in memory (I talked about memory alignment in this post). What's the solution then?

1- I will align the data memory myself.

Why? Well, we can't align data in its source, but I can make sure that, when we enter that "main" loop, memory will be aligned properly, and ready to be executed in batches. This will help making vectorization possible for the compiler.

Imagine we have this situation, where data is spread in memory: the start is not aligned in memory, and the end is in another page. The chunk can be vectorized starts as soon as the element data[n] is aligned, and ends in the elements data[m], where the page ends.

#                                                 v-- END OF PAGE
.-- ??? -------..-- data ------------------------------..-- ??? ------.
|--------||--------||--------||--------||--------| |--------||--------|
#                   ^-- vectorizable chunk ------^
#                   | data[n]            data[m] |

So, this is what I am going to do: repeat the while loop 3 times: the first time, it will stop when the element in question is aligned; then we will start the second loop - this loop is where I expect the vectorization to happen, it will end when the chunks processed by the vector registers are too big for what is left in memory; then, the third loop will start, taking care of what we have left. If I don't do this last step, it is possible that the vectorization will end up accessing memory outside of the current page that does not belong to me, which will result in a segmentation fault.

2- I will isolate the 3 operations of k in another loop - this loop can then be vectorized.

Take a look at this:

    while (len >= 4) {
        k = *((uint32_t *)data);

        k *= 0x5bd1e995;
        k ^= k >> 24;
        k *= 0x5bd1e995;

        h *= 0x5bd1e995;
        h ^= k;

        data += 4;
        len -= 4;
    }

An alternative of doing this, is isolating the first 4 operations in one loop, saving the results of k in an array, and then having a similar loop using the values from the k array. For example:

    for (...) {
        k[i] *= 0x5bd1e995;
        k[i] ^= k[i] >> 24;
        k[i] *= 0x5bd1e995;
    }

    for (...) {
        h *= 0x5bd1e995;
        h ^= k[i];

        data += 4;
        len -= 4;
    }

The advantage of doing this is that the operations in k are now isolated: that first loop can now be vectorized! The disavantage is: now we have two loops instead of one. If the gains from the vectorization are not enough to compensate for the second loop, the program will end up being slower.

There is another problem here: if I just make a huge array for k, we will have to worry about cache misses and RAM being accessed. To avoid this, instead of making an array for all the values of k, I can make a small one and process them in batches. For example: I can make an array of 32 k's, which I will process 4 at a time with SIMD; a loop will run 8 times to fill my k array, which, with some luck, the compiler will vectorize.

Another advantage of this "batch" approach: if we divide the batches in fixed lengths (for example: 32), the compiler will be aware of this constant, and can unroll the loop! It will be large enough to save some go-to's, and small enough to fit in cache.

Joining both approaches

This is what my code looks like after I checked for little-endian processors, aligned memory manually, and isolated loops so the compiler would vectorize my code:

#include "toolbox.h"

// How many "k"s at a time to store?
#define BL_SIZE 32

uint32_t
modified_hash(u_char *data, size_t len)
{
    uint32_t  h, k;
    int j;
    uint32_t k2[BL_SIZE];

    h = 0 ^ len;

    // Using the regular loop until the memory is aligned

    while (((size_t)data & 0b1111) != 0 && len >= 4) {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
        k = *((uint32_t *)data);
#else
        k  = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;
#endif

        k *= 0x5bd1e995;
        k ^= k >> 24;
        k *= 0x5bd1e995;

        h *= 0x5bd1e995;
        h ^= k;

        data += 4;
        len -= 4;
    }

    // The memory is aligned: start split loops until the block
    // is too big for the leftover memory

    for (; len > BL_SIZE * 4; len -= BL_SIZE * 4) {

        // Fill the "k" array in batches - I expect this loop
        // to be vectorized
        for (j = 0; j < BL_SIZE; j++) {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
            k2[j] = *((uint32_t *)data);
#else
            k2[j]  = data[0];
            k2[j] |= data[1] << 8;
            k2[j] |= data[2] << 16;
            k2[j] |= data[3] << 24;
#endif

            k2[j] *= 0x5bd1e995;
            k2[j] ^= k2[j] >> 24;
            k2[j] *= 0x5bd1e995;

            data += 4;
        }

        // Going through the batch and multiplying H. This loop will not
        // be vectorized
        for (j = 0; j < BL_SIZE; j++) {
                h *= 0x5bd1e995;
                h ^= k2[j];
        }
    }

    // SIMD is done. Finish whatever we have left in memory with
    // the regular loop

    while (len >= 4) {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
        k = *((uint32_t *)data);
#else
        k  = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;
#endif

        k *= 0x5bd1e995;
        k ^= k >> 24;
        k *= 0x5bd1e995;

        h *= 0x5bd1e995;
        h ^= k;

        data += 4;
        len -= 4;
    }

    switch (len) {
    case 3:
        h ^= data[2] << 16;
        /* fall through */
    case 2:
        h ^= data[1] << 8;
        /* fall through */
    case 1:
        h ^= data[0];
        h *= 0x5bd1e995;
    }

    h ^= h >> 13;
    h *= 0x5bd1e995;
    h ^= h >> 15;

    return h;
}

Surely this monster will not work, right? Let's see:

Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330718
Time for large modified: 0.322143
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000332
Time for medium modified: 0.000309
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000000
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330843
Time for large modified: 0.322202
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000334
Time for medium modified: 0.000310
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000000
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330706
Time for large modified: 0.322616
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000331
Time for medium modified: 0.000310
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000000
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330748
Time for large modified: 0.322156
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000331
Time for medium modified: 0.000310
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000000
Generated 500000000 characters for large hash functions
SUCCESS! - All hashes are identical for both functions
Time for large original: 0.330801
Time for large modified: 0.322446
Generated 500000 characters for medium hash functions
SUCCESS! - All hashes are identical for both functions
Time for medium original: 0.000332
Time for medium modified: 0.000328
Generated 500 characters for small hash functions
SUCCESS! - All hashes are identical for both functions
Time for small original: 0.000001
Time for small modified: 0.000000

That's good! Although the code is a lot larger, we definitely gained some milliseconds!

Note: I tested the code again with different data sizes (to test if I wouldn't segmentation faults) and batch sizes. It seems like everything is working properly, and a batch size of 32 works really well.

Now, the question: is vectorization actually being done, or this is just some sort of black magic? Let's see the assembly:

            k2[j] *= 0x5bd1e995;
  400c48:       4ea09e18        mul     v24.4s, v16.4s, v0.4s
            k2[j] = *((uint32_t *)data);
  400c4c:       3cdc0004        ldur    q4, [x0,#-64]
            k2[j] *= 0x5bd1e995;
  400c50:       4ea09cf7        mul     v23.4s, v7.4s, v0.4s
            k2[j] = *((uint32_t *)data);
  400c54:       3cdd0003        ldur    q3, [x0,#-48]
            k2[j] *= 0x5bd1e995;
  400c58:       4ea09cd6        mul     v22.4s, v6.4s, v0.4s
            k2[j] = *((uint32_t *)data);
  400c5c:       3cde0002        ldur    q2, [x0,#-32]
            k2[j] *= 0x5bd1e995;
  400c60:       4ea09cb5        mul     v21.4s, v5.4s, v0.4s
            k2[j] ^= k2[j] >> 24;
  400c64:       6f280710        ushr    v16.4s, v24.4s, #24
            k2[j] = *((uint32_t *)data);
  400c68:       3cdf0001        ldur    q1, [x0,#-16]
            k2[j] *= 0x5bd1e995;
  400c6c:       4ea09c94        mul     v20.4s, v4.4s, v0.4s
            k2[j] ^= k2[j] >> 24;
  400c70:       6f2806e7        ushr    v7.4s, v23.4s, #24
            k2[j] *= 0x5bd1e995;
  400c74:       4ea09c73        mul     v19.4s, v3.4s, v0.4s
            k2[j] ^= k2[j] >> 24;
  400c78:       6f2806c6        ushr    v6.4s, v22.4s, #24
            k2[j] *= 0x5bd1e995;
  400c7c:       4ea09c52        mul     v18.4s, v2.4s, v0.4s
            k2[j] ^= k2[j] >> 24;
  400c80:       6e381e10        eor     v16.16b, v16.16b, v24.16b
  400c84:       6f2806a5        ushr    v5.4s, v21.4s, #24
            k2[j] *= 0x5bd1e995;
  400c88:       4ea09c31        mul     v17.4s, v1.4s, v0.4s
            k2[j] ^= k2[j] >> 24;
  400c8c:       6e371ce7        eor     v7.16b, v7.16b, v23.16b
  400c90:       6f280684        ushr    v4.4s, v20.4s, #24
            k2[j] *= 0x5bd1e995;
  400c94:       4ea09e10        mul     v16.4s, v16.4s, v0.4s
            k2[j] ^= k2[j] >> 24;
  400c98:       6e361cc6        eor     v6.16b, v6.16b, v22.16b
  400c9c:       6f280663        ushr    v3.4s, v19.4s, #24
            k2[j] *= 0x5bd1e995;
  400ca0:       4ea09ce7        mul     v7.4s, v7.4s, v0.4s
            k2[j] ^= k2[j] >> 24;
  400ca4:       6e351ca5        eor     v5.16b, v5.16b, v21.16b
  400ca8:       6f280642        ushr    v2.4s, v18.4s, #24
            k2[j] *= 0x5bd1e995;
  400cac:       4ea09cc6        mul     v6.4s, v6.4s, v0.4s
            k2[j] ^= k2[j] >> 24;
  400cb0:       6e341c84        eor     v4.16b, v4.16b, v20.16b
  400cb4:       6f280621        ushr    v1.4s, v17.4s, #24
            k2[j] *= 0x5bd1e995;
  400cb8:       4ea09ca5        mul     v5.4s, v5.4s, v0.4s
            k2[j] ^= k2[j] >> 24;
  400cbc:       6e331c63        eor     v3.16b, v3.16b, v19.16b
  400cc0:       6e321c42        eor     v2.16b, v2.16b, v18.16b
            k2[j] *= 0x5bd1e995;
  400cc4:       3d8003f0        str     q16, [sp]
  400cc8:       4ea09c84        mul     v4.4s, v4.4s, v0.4s
            k2[j] ^= k2[j] >> 24;
  400ccc:       6e311c21        eor     v1.16b, v1.16b, v17.16b
            k2[j] *= 0x5bd1e995;
  400cd0:       3d8007e7        str     q7, [sp,#16]
  400cd4:       4ea09c63        mul     v3.4s, v3.4s, v0.4s
  400cd8:       3d800be6        str     q6, [sp,#32]
  400cdc:       4ea09c42        mul     v2.4s, v2.4s, v0.4s
  400ce0:       3d800fe5        str     q5, [sp,#48]
  400ce4:       4ea09c21        mul     v1.4s, v1.4s, v0.4s
  400ce8:       3d8013e4        str     q4, [sp,#64]
  400cec:       3d8017e3        str     q3, [sp,#80]
  400cf0:       3d801be2        str     q2, [sp,#96]
  400cf4:       3d801fe1        str     q1, [sp,#112]

Not only we have tons of vector registers being used, the compiler also unrolled the loop! That's a victory!