blog

SPO600 Project [Part 1] - Planning (Stage 1)

As a requirement for my SPO600 course, I need to complete and blog about a project for identifying a checksum/hash algorithm in an open-source project and improving its performance. The full requirements can be found here. This post is the Stage 1 of my project, which is composed of the following steps: 1- Finding an open source package that has a checksum/hash algorithm, implemented in a language that gets compiled to machine code (C, C++, Assembly, etc). 2- Benchmark the performance on AArch64. 3- Identify the strategy to improve the algorithm, for example: altered build options, algorithm optimization, changes in the code to allow better optimization by the compiler, and in-line assembly. #### Finding an open source package One of our professor's recommendations was a hash called MurmurHash. I found the algorithm very interesting: it is simple, very easy to port, and due to its simplicity, improving it can be very challenging. I decided to follow the recommendation, and I found that **Nginx** uses this algorithm. I don't think it will be an easy task to make it more performant, but I am up for the challenge. #### Benchmark the performance on AArch64 Testing the performance of this algorithm should be relatively straight forward: the function can easily be detached from the file, which will allow me to put a wrapper around it specifically made for testing its performance. A few important factors when testing the performance of this algorithm: - I do not want other users in the system (since I am using a computer that is shared among several other users) spawning processes that will interfere with the algorithm's performance, so I made sure to run the tests when no other users were using the system. - In this case, since I can easily detach the function, I will measure the time right before and right after the function runs, so I will not be measuring anything except for that function in particular. - A quick search in the Nginx repository shows that it is built with the **-O2** compiler flag for optimization and a few other options. I will assume that this is true for all cases. Because of this, my tests will be ran with the **-O3** compiler flag. - Sometimes, a program will run much faster on the second time it is executed (because of how it is cached). To avoid these possible errors on my measurements, I will run the benchmark 10 times. So, here is my extracted algorithm: ```c // murmur_original.c // The include below is just for me to define a few // constants, if I ever need it #include "toolbox.h" uint32_t original_hash(u_char *data, size_t len) { uint32_t h, k; h = 0 ^ len; while (len >= 4) { k = data[0]; k |= data[1] << 8; k |= data[2] << 16; k |= data[3] << 24; k *= 0x5bd1e995; k ^= k >> 24; k *= 0x5bd1e995; h *= 0x5bd1e995; h ^= k; data += 4; len -= 4; } switch (len) { case 3: h ^= data[2] << 16; /* fall through */ case 2: h ^= data[1] << 8; /* fall through */ case 1: h ^= data[0]; h *= 0x5bd1e995; } h ^= h >> 13; h *= 0x5bd1e995; h ^= h >> 15; return h; } ``` I duplicated this file and called it `murmur_modified.c` (changing the name of the function as well). I will use the other file to modify the function, so I can compare them side by side. All I need to do is make a wrapper and call the function. This is the wrapper I made: ```c #include #include #include #include #include #include "murmur_original.c" #include "murmur_modified.c" #define DATA_LENGTH 500000002 int main() { uint32_t result_original; uint32_t result_modified; u_char tmp; u_char *data_o; int i; int are_identical; struct timeval timer_ori_before, timer_ori_after, timer_ori_result, timer_mod_before, timer_mod_after, timer_mod_result; srand(time(NULL)); // Creating random characters data_o = calloc(DATA_LENGTH, sizeof(u_char)); for (i = 0; i < DATA_LENGTH; i++) { tmp = (u_char)((rand() % 26) + 'a'); data_o[i] = tmp; } gettimeofday(&timer_ori_before, NULL); result_original = original_hash(data_o, sizeof(u_char) * DATA_LENGTH); gettimeofday(&timer_ori_after, NULL); gettimeofday(&timer_mod_before, NULL); result_modified = modified_hash(data_o, sizeof(u_char) * DATA_LENGTH); gettimeofday(&timer_mod_after, NULL); timersub(&timer_ori_after, &timer_ori_before, &timer_ori_result); timersub(&timer_mod_after, &timer_mod_before, &timer_mod_result); are_identical = result_original == result_modified; printf("Generated %d characters for hash functions ", DATA_LENGTH); if (are_identical) printf("SUCCESS! - All hashes are identical for both functions "); else printf("!!! ERROR !!! The hashes for both functions do not match "); printf("Time for original: %ld.%06ld ", (long int)timer_ori_result.tv_sec, (long int)timer_ori_result.tv_usec); printf("Time for modified: %ld.%06ld ", (long int)timer_mod_result.tv_sec, (long int)timer_mod_result.tv_usec); free(data_o); return 0; } ``` Still not perfect, but a good start. This wrapper: 1- Creates a long array of random characters 2- Starts the timer for the original algorithm 3- Sends the array to the first hash function and records the result 4- Stops the timer for the original algorithm 5- Starts the timer for the modified algorithm 6- Sends the array to the second hash function and records the result 7- Stops the timer for the modified algorithm 8- Compares both results. If they are different, my modified hash function is broken and the test is invalid. Otherwise, they comparable 9- Print the time for the original and the modified function Here are the results for the first test run (the "modified" version is still the original algorithm): ```json Generated 500000002 characters for hash functions SUCCESS! - All hashes are identical for both functions Time for original: 0.330580 Time for modified: 0.330509 Generated 500000002 characters for hash functions SUCCESS! - All hashes are identical for both functions Time for original: 0.333051 Time for modified: 0.332968 Generated 500000002 characters for hash functions SUCCESS! - All hashes are identical for both functions Time for original: 0.330849 Time for modified: 0.330729 Generated 500000002 characters for hash functions SUCCESS! - All hashes are identical for both functions Time for original: 0.330410 Time for modified: 0.330330 Generated 500000002 characters for hash functions SUCCESS! - All hashes are identical for both functions Time for original: 0.330744 Time for modified: 0.330551 Generated 500000002 characters for hash functions SUCCESS! - All hashes are identical for both functions Time for original: 0.330604 Time for modified: 0.330511 Generated 500000002 characters for hash functions SUCCESS! - All hashes are identical for both functions Time for original: 0.330427 Time for modified: 0.330301 Generated 500000002 characters for hash functions SUCCESS! - All hashes are identical for both functions Time for original: 0.330391 Time for modified: 0.330341 Generated 500000002 characters for hash functions SUCCESS! - All hashes are identical for both functions Time for original: 0.330375 Time for modified: 0.330274 Generated 500000002 characters for hash functions SUCCESS! - All hashes are identical for both functions Time for original: 0.330385 Time for modified: 0.330292 ``` Not bad. The results seem to be very uniform. #### Identifying the strategies First, about altering build options: Nginx has a very complete build configuration, and I don't think I am comfortable with messing with their building options. However, I do know that some compiler flags are available for AArch64 that may offer some hints to the compiler for optimizations. When I investigated the file **/proc/cpuinfo** for the machine, I saw the following line: ```bash Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid ``` Some AArch64 processors offer features to speed up hashing algorithms. This one in particular seem to have these features for algorithms such as SHA-1 and SHA-2, but not for Murmur. I am not particularly optimistic about compiler flags, but it seems like SIMD may be an option for optimization. I will try a few options. Second, about algorithm optimization: the code is very straight-forward and there is not much room for improvement there. It is already using bitwise operations, and it has a very simple structure. I think it is safe to say that I will not find anything to improve in terms of algorithm performance. Third, about changing the code to allow optimizations by the compiler: I see some room for improvement here. Maybe I can make some changes in the code so the compiler will vectorize (SIMD) a few operations. It may give me some milliseconds. Lastly, inline-assembly: probably not a tool that should be used very often. I find inline-assembly really neat, but it has a fatal flaw: it kills portability. If I write inline-assembly for AArch64, it will not run on X86-64 at all. It is something I should avoid, but it may be useful if I really can't find anything else to improve and I still want better performance. #### Other to-dos - Investigate how to test with different string lengths - Check if the performance for x86-64 is affected by the changes