I will not introduce new information here. This post will be mostly a summary of what I tried, what worked or not worked and why, and the conclusions I took from this project. For more information, please read the previous posts - they include a lot more detail.
Open-Source package and algorithm
The open-source project chosen was Nginx, which uses the MurmurHash algorithm.
Methodology for benchmarking
I made a timer that was started right before the hash function was called, and stopped right after the function call returned, so I was only timing the function execution (and the overhead for calling it, for course, but it is not important). To make sure the results were reliable, I employed 3 different string lengths, and 10 repetitions for every test.
I tested the hash with randomly generated strings of three different lengths: 500,000,000 characters, 500,000 characters, and 500 characters.
Every test (with the 3 variations) was repeated 10 times, by executing the benchmark with a for loop in bash, similar to this:
for (( i = 0; i < 10; i++ )); do ./benchmark done;
Having similar results for every repetition is a good indicator that the results were not being affected by occasional abnormalities in the system, caching, etc.
Strategies for performance improvement
I used 3 strategies for performance improvement. One of them was not successful, the other two were.
1 - Changing compiler flags
NGINX is already compiler with the appropriate -On compiler flags, so instead, I tried activating some special flags for the compiler that are relative to the CPU architecture, such as
These flags failed to yield any result, which can be explained by the facts that the default code was not SIMD friendly and the crypto features for the CPU are not directed at MurmurHash, but mainly at the SHA family.
2- Algorithm improvement and making the algorithm SIMD friendly
By using some preprocessor checks, I was able to elliminate some code from the algorithm, and by isolating some elements from the loop and forcing memory alignment, I was able to make the compiler use SIMD instructions.
The gains in performance were small, but significant. The fact that the gains were not huge can be explained by the extra overhead created by splitting the loop into several parts: we gained SIMD, but the program got significantly larger and more complex.
3- Inline Assembly
Building "on top" of the previous solution, I rewrote the contents of the loop in assembly. Because I knew exactly how the assembly code should handle the information in the vector registers, I was able to make the code significantly faster.
For the second strategy, even though the compiler was able to vectorize my solution, the way it handled the values afterwards was not really optimal, and this particular part is where the Inline Assembly solution thrived.
Compatibility with other CPU architectures
At the end, I used the C preprocessor to determine the type of the CPU: an Aarch64 CPU with the right Endianness would make use of my solution, while other CPU architectures would use the default algorithm. I tested the solutions both in a x86 64 and Aarch64 machines, and they both worked: the Aarch64 had a significant improvement in performance, while the x86 64 machine kept its previous performance.