SPO600 Project [Part 2] - Testing different build flags

Date posted: 2017-12-22

This is a continuation (second part) of my final project for SPO600. For more details, read this post.

In the last post I explained and talked about some approaches for my final project. On this post, I will investigate the first one: trying different build options.

Firstly, what exactly am I going to do? Passing different compiler flags to GCC may make it more aware of the platform we are building for, as well as being able to use its resources more wisely. For example: it can know exactly the size of the cache, which processor tools are available, and then use them for optimizing the application. So, what is my target machine?

[...@aarchie proc]$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):           4
NUMA node(s):        1
Model:               2
BogoMIPS:            500.00
NUMA node0 CPU(s):   0-7
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

The command lscpu gave me some hints: it's an aarch64 machine with 8 cores. It also has a few interesting features such as fp, asimd, sha1, sha2, etc. These processor features are optimizations for floating point operations, SIMD, SHA1 and SHA2 - if I was using SHA1 or SHA2, I would have struck gold! I could activate these flags for the compiler, and our algorithm would be blazing fast. Unfortunately, there is nothing for Murmur hash there, but there is an intersting flag there: asimd. It is a flag for using with Advanced SIMD instructions. Using this flag may save some milliseconds!

I still don't know much about the CPU, and unfortunately, the command dmidecode didn't give me much information either:

Processor Information
    Socket Designation: P0
    Type: Central Processor
    Family: ARM
    Manufacturer: AMD
    ID: 00 00 00 00 10 00 2F 01
    Version: N/A
    Voltage: 1.0 V
    External Clock: 100 MHz
    Max Speed: 2000 MHz
    Current Speed: 2000 MHz
    Status: Populated, Enabled
    Upgrade: Daughter Board
    L1 Cache Handle: 0x000A
    L2 Cache Handle: 0x000B
    L3 Cache Handle: 0x000C
    Serial Number: Not Specified
    Asset Tag: Not Specified
    Part Number: Not Specified
    Core Count: 8
    Core Enabled: 8
    Thread Count: 8
    Characteristics:
        64-bit capable
        Multi-Core
        Execute Protection
        Enhanced Virtualization

Luckily, our professor told us: it is a Cortex A57.

Now, let's dive a bit in the GCC manual for AArch64. According to the manual, I can activate the SIMD flags, floating point instructions (fp) and encryption extensions only by adding the crypto feature modifier, together with the target architecture. This is how the flag would look like:

-mcpu=cortex-a57+crypto

Since I will be using SIMD, it will also be useful to have the option -fopt-info-vec-all. This option will print diagnostics for the vectorized loops: when it succeeds, and when it can not be done. This will help me identifying places that I can improve.

Now, here is the plan: I'm going to compile the original algorithm twice. One with the mcpu flag, and the other one without it, then I will test their performance. I also modified my tester so it will test with 3 different array lengths instead of only one. The code is very similar to the older one, so I will just show you the results:

------------------------------
O3 Only
------------------------------
Generated 500000000 characters for large hash functions. Result: 3741523049
Time for large original: 0.330880
Generated 500000 characters for medium hash functions. Result: 3986793340
Time for medium original: 0.000332
Generated 500 characters for small hash functions. Result: 2959172869
Time for small original: 0.000001

------------------------------
O3 + Processor architecture + Crypto flags
------------------------------
Generated 500000000 characters for large hash functions. Result: 3644189640
Time for large original: 0.331015
Generated 500000 characters for medium hash functions. Result: 2710302921
Time for medium original: 0.000331
Generated 500 characters for small hash functions. Result: 1629857860
Time for small original: 0.000000

The results are the same. Why? There are several reasons:

We can verify the last reason with the option -fopt-info-vec-all. This is what I get when I try compiling the code with this option:

murmur_original.c:10:11: note: === vect_analyze_data_refs ===
murmur_original.c:10:11: note: got vectype for stmt: k_90 = MEM[(u_char *)data_158]; vector(4) unsigned int
murmur_original.c:10:11: note: not vectorized: not enough data-refs in basic block.
murmur_original.c:10:11: note: ===vect_slp_analyze_bb===
murmur_original.c:10:11: note: ===vect_slp_analyze_bb===
murmur_original.c:27:5:  note: === vect_analyze_data_refs ===
murmur_original.c:27:5:  note: not vectorized: not enough data-refs in basic block.
murmur_original.c:27:5:  note: ===vect_slp_analyze_bb===
murmur_original.c:29:11: note: === vect_analyze_data_refs ===
murmur_original.c:29:11: note: got vectype for stmt: _103 = MEM[(u_char *)data_75 + 2B]; vector(16) unsigned char
murmur_original.c:29:11: note: not vectorized: not enough data-refs in basic block.
murmur_original.c:29:11: note: ===vect_slp_analyze_bb===
murmur_original.c:32:11: note: === vect_analyze_data_refs ===
murmur_original.c:32:11: note: got vectype for stmt: _109 = MEM[(u_char *)data_75 + 1B]; vector(16) unsigned char
murmur_original.c:32:11: note: not vectorized: not enough data-refs in basic block.
murmur_original.c:32:11: note: ===vect_slp_analyze_bb===
murmur_original.c:36:11: note: === vect_analyze_data_refs ===
murmur_original.c:36:11: note: got vectype for stmt: _115 = *data_75;vector(16) unsigned char
murmur_original.c:36:11: note: not vectorized: not enough data-refs in basic block.
murmur_original.c:36:11: note: ===vect_slp_analyze_bb===

We have many messages about vectorization there for the algorithm, and all of them say that the vectorization was not possible. Line 10 is where we have the while loop, and where we would like the vectorization to happen.

There is one very important detail to notice: just because these flags didn't make a difference NOW, it does not mean that they won't ever make a difference. The flags themselves are not going to make any difference if there is no code they can optimize. In other words, the program and the compiler flags are not independent - we need to make them work with each other.

In conclusion for this post, just using compiler flags for specifying the processor and using its flags did not make a difference: I will have to make the code more optimization-friendly, for example: making the loops vectorizable. After I do this, I can test the compiler flags again to see if they make a difference.