This is a continuation (second part) of my final project for SPO600. For more details, read this post.
In the last post I explained and talked about some approaches for my final project. On this post, I will investigate the first one: trying different build options.
Firstly, what exactly am I going to do? Passing different compiler flags to GCC may make it more aware of the platform we are building for, as well as being able to use its resources more wisely. For example: it can know exactly the size of the cache, which processor tools are available, and then use them for optimizing the application. So, what is my target machine?
[...@aarchie proc]$ lscpu
Byte Order: Little Endian
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 2
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
The command **lscpu** gave me some hints: it's an aarch64 machine with 8 cores. It also has a few interesting features such as **fp**, **asimd**, **sha1**, **sha2**, etc. These processor features are optimizations for floating point operations, SIMD, SHA1 and SHA2 - if I was using SHA1 or SHA2, I would have struck gold! I could activate these flags for the compiler, and our algorithm would be blazing fast. Unfortunately, there is nothing for Murmur hash there, but there is an intersting flag there: **asimd**. It is a flag for using with Advanced SIMD instructions. Using this flag may save some milliseconds!
I still don't know much about the CPU, and unfortunately, the command **dmidecode** didn't give me much information either:
Socket Designation: P0
Type: Central Processor
ID: 00 00 00 00 10 00 2F 01
Voltage: 1.0 V
External Clock: 100 MHz
Max Speed: 2000 MHz
Current Speed: 2000 MHz
Status: Populated, Enabled
Upgrade: Daughter Board
L1 Cache Handle: 0x000A
L2 Cache Handle: 0x000B
L3 Cache Handle: 0x000C
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Core Count: 8
Core Enabled: 8
Thread Count: 8
Luckily, our professor told us: it is a Cortex A57.
Now, let's dive a bit in the GCC manual for AArch64. According to the manual, I can activate the SIMD flags, floating point instructions (**fp**) and encryption extensions only by adding the **crypto** feature modifier, together with the target architecture. This is how the flag would look like:
Since I will be using SIMD, it will also be useful to have the option `-fopt-info-vec-all`. This option will print diagnostics for the vectorized loops: when it succeeds, and when it can not be done. This will help me identifying places that I can improve.
Now, here is the plan: I'm going to compile the original algorithm twice. One with the **mcpu** flag, and the other one without it, then I will test their performance. I also modified my tester so it will test with 3 different array lengths instead of only one. The code is very similar to the older one, so I will just show you the results:
Generated 500000000 characters for large hash functions. Result: 3741523049
Time for large original: 0.330880
Generated 500000 characters for medium hash functions. Result: 3986793340
Time for medium original: 0.000332
Generated 500 characters for small hash functions. Result: 2959172869
Time for small original: 0.000001
O3 + Processor architecture + Crypto flags
Generated 500000000 characters for large hash functions. Result: 3644189640
Time for large original: 0.331015
Generated 500000 characters for medium hash functions. Result: 2710302921
Time for medium original: 0.000331
Generated 500 characters for small hash functions. Result: 1629857860
Time for small original: 0.000000
The results are the same. Why? There are several reasons:
- The Crypto flags are directed at SHA algorithms, not Murmur hash
- We are not actually doing many floating point instructions
- The code is not written to facilitate SIMD instructions
We can verify the last reason with the option **-fopt-info-vec-all**. This is what I get when I try compiling the code with this option:
murmur_original.c:10:11: note: === vect_analyze_data_refs ===
murmur_original.c:10:11: note: got vectype for stmt: k_90 = MEM[(u_char *)data_158]; vector(4) unsigned int
murmur_original.c:10:11: note: not vectorized: not enough data-refs in basic block.
murmur_original.c:10:11: note: ===vect_slp_analyze_bb===
murmur_original.c:10:11: note: ===vect_slp_analyze_bb===
murmur_original.c:27:5: note: === vect_analyze_data_refs ===
murmur_original.c:27:5: note: not vectorized: not enough data-refs in basic block.
murmur_original.c:27:5: note: ===vect_slp_analyze_bb===
murmur_original.c:29:11: note: === vect_analyze_data_refs ===
murmur_original.c:29:11: note: got vectype for stmt: _103 = MEM[(u_char *)data_75 + 2B]; vector(16) unsigned char
murmur_original.c:29:11: note: not vectorized: not enough data-refs in basic block.
murmur_original.c:29:11: note: ===vect_slp_analyze_bb===
murmur_original.c:32:11: note: === vect_analyze_data_refs ===
murmur_original.c:32:11: note: got vectype for stmt: _109 = MEM[(u_char *)data_75 + 1B]; vector(16) unsigned char
murmur_original.c:32:11: note: not vectorized: not enough data-refs in basic block.
murmur_original.c:32:11: note: ===vect_slp_analyze_bb===
murmur_original.c:36:11: note: === vect_analyze_data_refs ===
murmur_original.c:36:11: note: got vectype for stmt: _115 = *data_75;vector(16) unsigned char
murmur_original.c:36:11: note: not vectorized: not enough data-refs in basic block.
murmur_original.c:36:11: note: ===vect_slp_analyze_bb===
We have many messages about vectorization there for the algorithm, and all of them say that the vectorization was not possible. Line 10 is where we have the *while* loop, and where we would like the vectorization to happen.
There is one very important detail to notice: just because these flags didn't make a difference NOW, it does not mean that they won't ever make a difference. The flags themselves are not going to make any difference if there is no code they can optimize. In other words, the program and the compiler flags are not independent - we need to make them work with each other.
In conclusion for this post, just using compiler flags for specifying the processor and using its flags did not make a difference: I will have to make the code more optimization-friendly, for example: making the loops vectorizable. After I do this, I can test the compiler flags again to see if they make a difference.