In the last post I explained and talked about some approaches for my final project. On this post, I will investigate the first one: trying different build options.
Firstly, what exactly am I going to do? Passing different compiler flags to GCC may make it more aware of the platform we are building for, as well as being able to use its resources more wisely. For example: it can know exactly the size of the cache, which processor tools are available, and then use them for optimizing the application. So, what is my target machine?
[...@aarchie proc]$ lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 4 NUMA node(s): 1 Model: 2 BogoMIPS: 500.00 NUMA node0 CPU(s): 0-7 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
The command lscpu gave me some hints: it's an aarch64 machine with 8 cores. It also has a few interesting features such as fp, asimd, sha1, sha2, etc. These processor features are optimizations for floating point operations, SIMD, SHA1 and SHA2 - if I was using SHA1 or SHA2, I would have struck gold! I could activate these flags for the compiler, and our algorithm would be blazing fast. Unfortunately, there is nothing for Murmur hash there, but there is an intersting flag there: asimd. It is a flag for using with Advanced SIMD instructions. Using this flag may save some milliseconds!
I still don't know much about the CPU, and unfortunately, the command dmidecode didn't give me much information either:
Processor Information Socket Designation: P0 Type: Central Processor Family: ARM Manufacturer: AMD ID: 00 00 00 00 10 00 2F 01 Version: N/A Voltage: 1.0 V External Clock: 100 MHz Max Speed: 2000 MHz Current Speed: 2000 MHz Status: Populated, Enabled Upgrade: Daughter Board L1 Cache Handle: 0x000A L2 Cache Handle: 0x000B L3 Cache Handle: 0x000C Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Core Count: 8 Core Enabled: 8 Thread Count: 8 Characteristics: 64-bit capable Multi-Core Execute Protection Enhanced Virtualization
Luckily, our professor told us: it is a Cortex A57.
Now, let's dive a bit in the GCC manual for AArch64. According to the manual, I can activate the SIMD flags, floating point instructions (fp) and encryption extensions only by adding the crypto feature modifier, together with the target architecture. This is how the flag would look like:
Since I will be using SIMD, it will also be useful to have the option
-fopt-info-vec-all. This option will print diagnostics for the vectorized loops: when it succeeds, and when it can not be done. This will help me identifying places that I can improve.
Now, here is the plan: I'm going to compile the original algorithm twice. One with the mcpu flag, and the other one without it, then I will test their performance. I also modified my tester so it will test with 3 different array lengths instead of only one. The code is very similar to the older one, so I will just show you the results:
------------------------------ O3 Only ------------------------------ Generated 500000000 characters for large hash functions. Result: 3741523049 Time for large original: 0.330880 Generated 500000 characters for medium hash functions. Result: 3986793340 Time for medium original: 0.000332 Generated 500 characters for small hash functions. Result: 2959172869 Time for small original: 0.000001 ------------------------------ O3 + Processor architecture + Crypto flags ------------------------------ Generated 500000000 characters for large hash functions. Result: 3644189640 Time for large original: 0.331015 Generated 500000 characters for medium hash functions. Result: 2710302921 Time for medium original: 0.000331 Generated 500 characters for small hash functions. Result: 1629857860 Time for small original: 0.000000
The results are the same. Why? There are several reasons:
- The Crypto flags are directed at SHA algorithms, not Murmur hash
- We are not actually doing many floating point instructions
- The code is not written to facilitate SIMD instructions
We can verify the last reason with the option -fopt-info-vec-all. This is what I get when I try compiling the code with this option:
murmur_original.c:10:11: note: === vect_analyze_data_refs === murmur_original.c:10:11: note: got vectype for stmt: k_90 = MEM[(u_char *)data_158]; vector(4) unsigned int murmur_original.c:10:11: note: not vectorized: not enough data-refs in basic block. murmur_original.c:10:11: note: ===vect_slp_analyze_bb=== murmur_original.c:10:11: note: ===vect_slp_analyze_bb=== murmur_original.c:27:5: note: === vect_analyze_data_refs === murmur_original.c:27:5: note: not vectorized: not enough data-refs in basic block. murmur_original.c:27:5: note: ===vect_slp_analyze_bb=== murmur_original.c:29:11: note: === vect_analyze_data_refs === murmur_original.c:29:11: note: got vectype for stmt: _103 = MEM[(u_char *)data_75 + 2B]; vector(16) unsigned char murmur_original.c:29:11: note: not vectorized: not enough data-refs in basic block. murmur_original.c:29:11: note: ===vect_slp_analyze_bb=== murmur_original.c:32:11: note: === vect_analyze_data_refs === murmur_original.c:32:11: note: got vectype for stmt: _109 = MEM[(u_char *)data_75 + 1B]; vector(16) unsigned char murmur_original.c:32:11: note: not vectorized: not enough data-refs in basic block. murmur_original.c:32:11: note: ===vect_slp_analyze_bb=== murmur_original.c:36:11: note: === vect_analyze_data_refs === murmur_original.c:36:11: note: got vectype for stmt: _115 = *data_75;vector(16) unsigned char murmur_original.c:36:11: note: not vectorized: not enough data-refs in basic block. murmur_original.c:36:11: note: ===vect_slp_analyze_bb===
We have many messages about vectorization there for the algorithm, and all of them say that the vectorization was not possible. Line 10 is where we have the while loop, and where we would like the vectorization to happen.
There is one very important detail to notice: just because these flags didn't make a difference NOW, it does not mean that they won't ever make a difference. The flags themselves are not going to make any difference if there is no code they can optimize. In other words, the program and the compiler flags are not independent - we need to make them work with each other.
In conclusion for this post, just using compiler flags for specifying the processor and using its flags did not make a difference: I will have to make the code more optimization-friendly, for example: making the loops vectorizable. After I do this, I can test the compiler flags again to see if they make a difference.