Auto-vectorization with GCC on Aarch64, and also my SPO600 Lab 5

Date posted: 2017-10-09

In my last post I talked about some optimizations that the C compiler can do to our code. This time, I will talk a little about vectorization and how the compiler can also do it automatically.

First, I want to talk about what vectorization is. Suppose we have the following code:

short arr1[] = { 1, 1, 2, 2 };
short arr2[] = { 3, 4, 4, 3 };

Now, say we want to sum the elements of these arrays (1 + 3, 1 + 4, 2 + 4, and 2 + 3) into the array c. How could we do this?

A simple approach is to do the following:

for (size_t i = 0; i < 4; i++) c[i] = a[i] + b[i];

This is great, but we are not being as efficient as we could be. Let's think about memory for a bit: a short is -typically- 2 bytes long (16 bits), and words for modern processors are 64 bits long. This means that for every iteration, we are loading 16 bit numbers into 64 bit registers:

I am going to use decimal notation for some examples, because it is much easier to read

// Imagine we are loading the numbers "4" and "1" in registers
// in order to add them. This is what the compiler would do:

      16 bit            16 bit           16 bit           16 bit
|----------------|----------------||----------------|----------------|
        4                 0                0                 0 // <- a[]
        1                 0                0                 0 // <- b[]
        5                 0                0                 0 // <- c[] (result)

This is a lot of wasted space. If we somehow loaded more numbers into that wasted space, we would be doing 4 operations at a time, instead of only 1:

      16 bit            16 bit           16 bit           16 bit
|----------------|----------------||----------------|----------------|
        1                 1                2                 2 // <- a[]
        3                 4                4                 3 // <- b[]
        4                 5                6                 5 // <- c[] (result)

This would be great! We could theoretically align the numbers in memory (or not, if they are already aligned), load them into the register, and do these operations in parallel! But there is a catch: if the numbers are large enough, they will overflow to the left, ruining the whole operation:

//                                       0 + 0          65535 *+ 1
      16 bit            16 bit           16 bit           16 bit
|----------------|----------------||----------------|----------------|
 ???????????????? ????????????????  0000000000000000 1111111111111111
 ???????????????? ????????????????  0000000000000000 0000000000000001

 ???????????????? ????????????????  0000000000000000 0000000000000000 // <- what we want
 ???????????????? ????????????????  0000000000000001 0000000000000000 // <- what we get
//                                                 ^- This should not be here!

// * assuming unassigned

Checking for this pitfall manually would be too tedious and too complicated - it would likely make the process slow again. It would just not be worth it. However, modern CPUs have special registers for this kind of operation, they are called vector registers: we can tell them how long our data is, and it will avoid this overflow. Since they are optimized for a high volume of data, they can process larger words, like 128 bits instead of 64. The process of using these registers to make operations in parallel is called vectorization - it is great for applying single instructions to multiple data (SIMD).

Luckily, compilers like GCC have built-in vectorization - this means that if we are using optimization, the compiler will try to vectorize operations when it finds necessary. To make this possible, however, we need to adopt some good practices. These practices will make sure that the vectorization process is easy and will not carry extra overhead, making it too complicated and ruining the performance. The two main practices I would like to talk about is memory alignment and data dependencies.

1. Memory alignment

Memory is just a long strip of data. Say that this is how our memory looks like, with both our arrays in it:

 Word 1                                  Word 2
|---------------------------------------|---------------------------------------|
0x0  0x1  0x2  0x3  0x4  0x5  0x6  0x7  0x8  0x9  0x10 0x11 0x12 0x13 0x14 0x15
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
  ?    ?    1    1    2    2    ?    ?    ?    3    4    4    3    ?    ?    ?

Cool. Now let's try loading both words in vector registers to add them together:

|----|----|----|----|----|----|----|----|
  ?    ?    1    1    2    2    ?    ?
  ?    3    4    4    3    ?    ?    ?

Well, that is a problem: they are misaligned. We can't do the vectorization here properly. Well, we can, but it would take some time to properly align the memory in order to make the vectorization easy. This is why I said memory alignment is a problem.

However, C is a really cool language, and it has some tools we can use to align the memory:

short arr1[] __attribute__((__aligned__(16))) = { 1, 1, 2, 2 };
short arr2[] __attribute__((__aligned__(16))) = { 3, 4, 4, 3 };

It is not very pretty, but it will align our arrays: we are telling the compiler to put the start of the array in any address multiple of 16. Why 16? Because 16 bytes is 128 bits - which will align our memory right at the beginning of every 128 bit word for the vector registers.

Notice that even if we are using those attributes, our data can still be misaligned. For example, if this is our operation:

for (size_t i = 0; i < 4; i++) c[i] = a[i] + b[i + 1];

We don't want a[i] to be aligned with b[i], we want a[i] to be aligned with b[i + 1]! We must keep this in mind when aligning our values.

C structs are a good way to keep variables together. For example:

struct {
  int a;
  char b;
  short c;
}

These three variables would be declared sequentially in memory. Assuming an int of 32 bits, a char of 8 bits, and a short of 16 bits, we would have 56 bits in total. We align these structs in memory so they will occupy 64 bits each. However, if we were to vectorize the sum of the member "a" from several of these structs, we would run into another problem: the member "a" of the two structs would be too far apart in memory, and we will not be able to fit more than one in the same vector register easily.

For this reason, it is recommended to have objects of arrays instead of arrays of objects if you are planning to vectorize their values: keep the values you want to vectorize close to each other.

2. Data dependencies

Say that we are doing this operation in the loop, and it will be vectorized:

a[i + 1] = a[i] + b[i];

This is a big problem: in a normal loop, we would be modifying the next value in the array, which is totally fine. However, since we will be working with a[i] and a[i + 1] at the same time, this would not be possible. Data dependencies like this can make the vectorization process too difficult or impossible.

3. Other practices

Other practices that will make life easier for the compiler to vectorize our code (but I won't spend too much time talking about) are:

Vectorizing

I wrote the following application in C:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    srand(1);

    int a1[1000] __attribute__((__aligned__(16)));
    int a2[1000] __attribute__((__aligned__(16)));
    int a3[1000] __attribute__((__aligned__(16)));

    int sum = 0;

    for (int i = 0; i < 1000; i++) {
        a1[i] = (rand() % 2000) - 1000;
        a2[i] = (rand() % 2000) - 1000;
    }

    for (int i = 0; i < 1000; i++) {
        a3[i] = a1[i] + a2[i];
    }

    for (int i = 0; i < 1000; i++) {
        sum += a3[i];
    }

    printf("Sum: %d
", sum);

    return 0;
}

The idea is simple: 3 arrays, load two arrays with random numbers, add the numbers into the third array, add the numbers from the third array into a variable "sum", print the sum, exit.

I compiled the code above in an Aarch64 machine, with the flag -O3 (optimized), and here is the result in assembly:

I cleaned up the code a little bit, removing some useless comments and leaving part of the debugging source. I also added some observations to what is important.

/* Just some initialization stuff */
mov   x16, #0x2f20
sub   sp, sp, x16

/* Seeding the random generator */
mov   w0,  #0x1
stp   x29, x30, [sp]
mov   x29, sp
stp   x19, x20, [sp,#16]

/* for (int i = 0; i < 1000; i++) {    */

/*  a1[i] = (rand() % 2000) - 1000; */
mov   w20, #0x4dd3
stp   x21, x22, [sp,#32]
/*  a1[i] = (rand() % 2000) - 1000; */
movk  w20, #0x1062, lsl #16
str   x23, [sp,#48]
add   x22, x29, #0x40
add   x21, x29, #0xfe0
/*  a1[i] = (rand() % 2000) - 1000; */
mov   w19, #0x7d0
mov   x23, #0x0
bl    400510 <srand@plt>
/*  a1[i] = (rand() % 2000) - 1000; */
bl    4004e0 <rand@plt>
smull x1, w0, w20
asr   x1, x1, #39
sub   w1, w1, w0, asr #31
msub  w0, w1, w19, w0
sub   w0, w0, #0x3e8
str   w0, [x22,x23]
/*  a2[i] = (rand() % 2000) - 1000; */
bl    4004e0 <rand@plt>
smull x1, w0, w20
asr   x1, x1, #39
sub   w1, w1, w0, asr #31
msub  w0, w1, w19, w0
sub   w0, w0, #0x3e8
str   w0, [x21,x23]
add   x23, x23, #0x4
/* for (int i = 0; i < 1000; i++) { */
cmp   x23, #0xfa0
b.ne  40056c <main+0x3c>
mov   x2, #0x1f80
add   x1, x29, x2
mov   x0, #0x0
/* } */

/* for (int i = 0; i < 1000; i++) { */
/*  a3[i] = a1[i] + a2[i]; */
ldr   q0, [x22,x0]
ldr   q1, [x21,x0]
add   v0.4s, v0.4s, v1.4s /* <--- Notice the name of these strange registers! */
str   q0, [x1,x0]
add   x0, x0, #0x10
cmp   x0, #0xfa0
b.ne  4005bc <main+0x8c>
movi  v0.4s, #0x0 /* <--- WOW!!!! */
mov   x0, x1
mov   x1, #0x2f20
add   x1, x29, x1
/* } */

/* for (int i = 0; i < 1000; i++) { */
/*  sum += a3[i]; */
ldr   q1, [x0],#16
add   v0.4s, v0.4s, v1.4s /* <--- O NO!!!!!!! */
cmp   x0, x1
b.ne  4005e8 <main+0xb8>
/* } */

/* printf("Sum: %d
", sum); */
addv  s0, v0.4s
adrp  x0, 400000 <_init-0x498>
add   x0, x0, #0x7f0
mov   w1, v0.s[0]  /* <--- SEND HELP!!!!!!!!!!!!! */
bl    400520 <printf@plt>

/* return 0; */
/* } */
ldr   x23, [sp,#48]
ldp   x19, x20, [sp,#16]
mov   w0, #0x0
ldp   x21, x22, [sp,#32]
mov   x16, #0x2f20
ldp   x29, x30, [sp]
add   sp, sp, x16
ret
.inst 0x00000000 ; undefined

If the compiler was not optimizing this code, we would see 3 loops in total, but it does not happen here: the loops get merged together, and they also get unrolled. You can see more details about this in my previous post.

The most important thing, however, are those weird registers, like v0.4s. You will never guess what that v in the name of the register stand for in this blog post about vector registers. They are vector registers! The compiler optimized the code so it would make use of vector registers to make the operations!

The name v0.4s refers to the first (index 0) vector register, dividing it into 4 lanes of 32 bits (that's the meaning of the "s") each. This means that we can fit 4 pieces of data with 32 bits each in each vector. You can find more information about this naming convention here.

Now, let's take a closer look at this part:

/* Load the Quadword 0 (128 bit register) with the content of x22 + x0 */
ldr   q0, [x22,x0]

/* Load the Quadword 1 (128 bit register) with the content of x21 + x0 */
ldr   q1, [x21,x0]

/* Add vector 0 + vector 1 (both), storing the result in vector 0 */
add   v0.4s, v0.4s, v1.4s

/* Store the content of Quadword 0 (our result) into x1 + x0 */
str   q0, [x1,x0]

And there it is. Our vectorized code.

Now, I also compiled the same code in -O0, which should NOT vectorize the code - and indeed, it was not vectorized. Here it is, notice that there are no vector registers being used here:

mov   x16, #0x2f00
sub   sp, sp, x16
stp   x29, x30, [sp]
mov   x29, sp
/* srand(1); */
mov   w0, #0x1
bl    400510 <srand@plt>

/*  int a1[1000] __attribute__((__aligned__(16))); */
/*  int a2[1000] __attribute__((__aligned__(16))); */
/*  int a3[1000] __attribute__((__aligned__(16))); */

/* int sum = 0; */
str   wzr, [x29,#12028]

/*  for (int i = 0; i < 1000; i++) { */
str   wzr, [x29,#12024]
b     4006f0 <main+0xbc>
/*   a1[i] = (rand() % 2000) - 1000; */
bl    4004e0 <rand@plt>
mov   w1, w0
mov   w0, #0x4dd3
movk  w0, #0x1062, lsl #16
smull x0, w1, w0
lsr   x0, x0, #32
asr   w2, w0, #7
asr   w0, w1, #31
sub   w0, w2, w0
mov   w2, #0x7d0
mul   w0, w0, w2
sub   w0, w1, w0
sub   w2, w0, #0x3e8
ldrsw x0, [x29,#12024]
lsl   x0, x0, #2
add   x1, x29, #0x1, lsl #12
add   x1, x1, #0xf50
str   w2, [x1,x0]
/*   a2[i] = (rand() % 2000) - 1000; */
bl    4004e0 <rand@plt>
mov   w1, w0
mov   w0, #0x4dd3
movk  w0, #0x1062, lsl #16
smull x0, w1, w0
lsr   x0, x0, #32
asr   w2, w0, #7
asr   w0, w1, #31
sub   w0, w2, w0
mov   w2, #0x7d0
mul   w0, w0, w2
sub   w0, w1, w0
sub   w2, w0, #0x3e8
ldrsw x0, [x29,#12024]
lsl   x0, x0, #2
add   x1, x29, #0xfb0
str   w2, [x1,x0]
/*  for (int i = 0; i < 1000; i++) { */
ldr   w0, [x29,#12024]
add   w0, w0, #0x1
str   w0, [x29,#12024]
ldr   w0, [x29,#12024]
cmp   w0, #0x3e7
b.le  400658 <main+0x24>
/*  } */

/*  for (int i = 0; i < 1000; i++) { */
str   wzr, [x29,#12020]
b     400748 <main+0x114>
/*   a3[i] = a1[i] + a2[i]; */
ldrsw x0, [x29,#12020]
lsl   x0, x0, #2
add   x1, x29, #0x1, lsl #12
add   x1, x1, #0xf50
ldr   w1, [x1,x0]
ldrsw x0, [x29,#12020]
lsl   x0, x0, #2
add   x2, x29, #0xfb0
ldr   w0, [x2,x0]
add   w2, w1, w0
ldrsw x0, [x29,#12020]
lsl   x0, x0, #2
add   x1, x29, #0x10
str   w2, [x1,x0]
/* for (int i = 0; i < 1000; i++) { */
ldr   w0, [x29,#12020]
add   w0, w0, #0x1
str   w0, [x29,#12020]
ldr   w0, [x29,#12020]
cmp   w0, #0x3e7
b.le  400704 <main+0xd0>
/*  } */

/*  for (int i = 0; i < 1000; i++) { */
str   wzr, [x29,#12016]
b     400784 <main+0x150>
/*   sum += a3[i]; */
ldrsw x0, [x29,#12016]
lsl   x0, x0, #2
add   x1, x29, #0x10
ldr   w0, [x1,x0]
ldr   w1, [x29,#12028]
add   w0, w1, w0
str   w0, [x29,#12028]
/*  for (int i = 0; i < 1000; i++) { */
ldr   w0, [x29,#12016]
add   w0, w0, #0x1
str   w0, [x29,#12016]
ldr   w0, [x29,#12016]
cmp   w0, #0x3e7
b.le  40075c <main+0x128>
/*  } */

/*  printf("Sum: %d
", sum); */
adrp  x0, 400000 <_init-0x498>
add   x0, x0, #0x870
ldr   w1, [x29,#12028]
bl    400520 <printf@plt>

/*  return 0; */
mov   w0, #0x0
/* } */
ldp   x29, x30, [sp]
mov   x16, #0x2f00
add   sp, sp, x16
ret
.inst 0x00000000 ; undefined