blog

Auto-vectorization with GCC on Aarch64, and also my SPO600 Lab 5

In my last post I talked about some optimizations that the C compiler can do to our code. This time, I will talk a little about vectorization and how the compiler can also do it automatically. First, I want to talk about what vectorization is. Suppose we have the following code: ```c short arr1[] = { 1, 1, 2, 2 }; short arr2[] = { 3, 4, 4, 3 }; ``` Now, say we want to sum the elements of these arrays (1 + 3, 1 + 4, 2 + 4, and 2 + 3) into the array **c**. How could we do this? A simple approach is to do the following: ```c for (size_t i = 0; i < 4; i++) c[i] = a[i] + b[i]; ``` This is great, but we are not being as efficient as we could be. Let's think about memory for a bit: a **short** is -typically- 2 bytes long (16 bits), and words for modern processors are 64 bits long. This means that for every iteration, we are loading 16 bit numbers into 64 bit registers: *I am going to use decimal notation for some examples, because it is much easier to read* ``` // Imagine we are loading the numbers "4" and "1" in registers // in order to add them. This is what the compiler would do: 16 bit 16 bit 16 bit 16 bit |----------------|----------------||----------------|----------------| 4 0 0 0 // <- a[] 1 0 0 0 // <- b[] 5 0 0 0 // <- c[] (result) ``` This is a lot of wasted space. If we somehow loaded more numbers into that wasted space, we would be doing 4 operations at a time, instead of only 1: ``` 16 bit 16 bit 16 bit 16 bit |----------------|----------------||----------------|----------------| 1 1 2 2 // <- a[] 3 4 4 3 // <- b[] 4 5 6 5 // <- c[] (result) ``` This would be great! We could theoretically align the numbers in memory (or not, if they are already aligned), load them into the register, and do these operations in parallel! But there is a catch: if the numbers are large enough, they will overflow to the left, ruining the whole operation: ``` // 0 + 0 65535 *+ 1 16 bit 16 bit 16 bit 16 bit |----------------|----------------||----------------|----------------| ???????????????? ???????????????? 0000000000000000 1111111111111111 ???????????????? ???????????????? 0000000000000000 0000000000000001 ???????????????? ???????????????? 0000000000000000 0000000000000000 // <- what we want ???????????????? ???????????????? 0000000000000001 0000000000000000 // <- what we get // ^- This should not be here! // * assuming unassigned ``` Checking for this pitfall manually would be too tedious and too complicated - it would likely make the process slow again. It would just not be worth it. However, modern CPUs have special registers for this kind of operation, they are called **vector** registers: we can tell them how long our data is, and it will avoid this overflow. Since they are optimized for a high volume of data, they can process larger words, like 128 bits instead of 64. The process of using these registers to make operations in parallel is called **vectorization** - it is great for applying single instructions to multiple data (**SIMD**). Luckily, compilers like GCC have built-in vectorization - this means that if we are using optimization, the compiler will try to vectorize operations when it finds necessary. To make this possible, however, we need to adopt some good practices. These practices will make sure that the vectorization process is easy and will not carry extra overhead, making it too complicated and ruining the performance. The two main practices I would like to talk about is memory alignment and data dependencies. ##### 1. Memory alignment Memory is just a long strip of data. Say that this is how our memory looks like, with both our arrays in it: ```c Word 1 Word 2 |---------------------------------------|---------------------------------------| 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0x10 0x11 0x12 0x13 0x14 0x15 |----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----| ? ? 1 1 2 2 ? ? ? 3 4 4 3 ? ? ? ``` Cool. Now let's try loading both words in vector registers to add them together: ```c |----|----|----|----|----|----|----|----| ? ? 1 1 2 2 ? ? ? 3 4 4 3 ? ? ? ``` Well, that is a problem: they are misaligned. We can't do the vectorization here properly. Well, we can, but it would take some time to properly align the memory in order to make the vectorization easy. This is why I said memory alignment is a problem. However, C is a really cool language, and it has some tools we can use to align the memory: ```c short arr1[] __attribute__((__aligned__(16))) = { 1, 1, 2, 2 }; short arr2[] __attribute__((__aligned__(16))) = { 3, 4, 4, 3 }; ``` It is not very pretty, but it will align our arrays: we are telling the compiler to put the start of the array in any address multiple of 16. Why 16? Because 16 bytes is 128 bits - which will align our memory right at the beginning of every 128 bit word for the vector registers. Notice that even if we are using those attributes, our data can still be misaligned. For example, if this is our operation: ```c for (size_t i = 0; i < 4; i++) c[i] = a[i] + b[i + 1]; ``` We don't want `a[i]` to be aligned with `b[i]`, we want `a[i]` to be aligned with `b[i + 1]`! We must keep this in mind when aligning our values. C **structs** are a good way to keep variables together. For example: ```c struct { int a; char b; short c; } ``` These three variables would be declared sequentially in memory. Assuming an *int* of 32 bits, a char of 8 bits, and a short of 16 bits, we would have 56 bits in total. We align these structs in memory so they will occupy 64 bits each. However, if we were to vectorize the sum of the member "a" from several of these structs, we would run into another problem: the member "a" of the two structs would be too far apart in memory, and we will not be able to fit more than one in the same vector register easily. For this reason, it is recommended to have **objects of arrays** instead of **arrays of objects** if you are planning to vectorize their values: keep the values you want to vectorize close to each other. ##### 2. Data dependencies Say that we are doing this operation in the loop, and it will be vectorized: ```c a[i + 1] = a[i] + b[i]; ``` This is a big problem: in a normal loop, we would be modifying the next value in the array, which is totally fine. However, since we will be working with `a[i]` and `a[i + 1]` at the same time, this would not be possible. Data dependencies like this can make the vectorization process too difficult or impossible. ##### 3. Other practices Other practices that will make life easier for the compiler to vectorize our code (but I won't spend too much time talking about) are: + Make the number of iterations easily countable + Having the loop as single entry and single exit (with no breaks, continues, etc) + Avoid branches, like switches and functions + Use the loop index ("i") for accessing the array # Vectorizing I wrote the following application in C: ```c #include #include int main() { srand(1); int a1[1000] __attribute__((__aligned__(16))); int a2[1000] __attribute__((__aligned__(16))); int a3[1000] __attribute__((__aligned__(16))); int sum = 0; for (int i = 0; i < 1000; i++) { a1[i] = (rand() % 2000) - 1000; a2[i] = (rand() % 2000) - 1000; } for (int i = 0; i < 1000; i++) { a3[i] = a1[i] + a2[i]; } for (int i = 0; i < 1000; i++) { sum += a3[i]; } printf("Sum: %d\n", sum); return 0; } ``` The idea is simple: 3 arrays, load two arrays with random numbers, add the numbers into the third array, add the numbers from the third array into a variable "sum", print the sum, exit. I compiled the code above in an **Aarch64** machine, with the flag **-O3** (optimized), and here is the result in assembly: *I cleaned up the code a little bit, removing some useless comments and leaving part of the debugging source. I also added some observations to what is important.* ```armasm /* Just some initialization stuff */ mov x16, #0x2f20 sub sp, sp, x16 /* Seeding the random generator */ mov w0, #0x1 stp x29, x30, [sp] mov x29, sp stp x19, x20, [sp,#16] /* for (int i = 0; i < 1000; i++) { */ /* a1[i] = (rand() % 2000) - 1000; */ mov w20, #0x4dd3 stp x21, x22, [sp,#32] /* a1[i] = (rand() % 2000) - 1000; */ movk w20, #0x1062, lsl #16 str x23, [sp,#48] add x22, x29, #0x40 add x21, x29, #0xfe0 /* a1[i] = (rand() % 2000) - 1000; */ mov w19, #0x7d0 mov x23, #0x0 bl 400510 /* a1[i] = (rand() % 2000) - 1000; */ bl 4004e0 smull x1, w0, w20 asr x1, x1, #39 sub w1, w1, w0, asr #31 msub w0, w1, w19, w0 sub w0, w0, #0x3e8 str w0, [x22,x23] /* a2[i] = (rand() % 2000) - 1000; */ bl 4004e0 smull x1, w0, w20 asr x1, x1, #39 sub w1, w1, w0, asr #31 msub w0, w1, w19, w0 sub w0, w0, #0x3e8 str w0, [x21,x23] add x23, x23, #0x4 /* for (int i = 0; i < 1000; i++) { */ cmp x23, #0xfa0 b.ne 40056c mov x2, #0x1f80 add x1, x29, x2 mov x0, #0x0 /* } */ /* for (int i = 0; i < 1000; i++) { */ /* a3[i] = a1[i] + a2[i]; */ ldr q0, [x22,x0] ldr q1, [x21,x0] add v0.4s, v0.4s, v1.4s /* <--- Notice the name of these strange registers! */ str q0, [x1,x0] add x0, x0, #0x10 cmp x0, #0xfa0 b.ne 4005bc movi v0.4s, #0x0 /* <--- WOW!!!! */ mov x0, x1 mov x1, #0x2f20 add x1, x29, x1 /* } */ /* for (int i = 0; i < 1000; i++) { */ /* sum += a3[i]; */ ldr q1, [x0],#16 add v0.4s, v0.4s, v1.4s /* <--- O NO!!!!!!! */ cmp x0, x1 b.ne 4005e8 /* } */ /* printf("Sum: %d\n", sum); */ addv s0, v0.4s adrp x0, 400000 <_init-0x498> add x0, x0, #0x7f0 mov w1, v0.s[0] /* <--- SEND HELP!!!!!!!!!!!!! */ bl 400520 /* return 0; */ /* } */ ldr x23, [sp,#48] ldp x19, x20, [sp,#16] mov w0, #0x0 ldp x21, x22, [sp,#32] mov x16, #0x2f20 ldp x29, x30, [sp] add sp, sp, x16 ret .inst 0x00000000 ; undefined ``` If the compiler was not optimizing this code, we would see 3 loops in total, but it does not happen here: the loops get merged together, and they also get unrolled. You can see more details about this in my previous post. The most important thing, however, are those weird registers, like **v0.4s**. You will never guess what that **v** in the name of the register stand for in this blog post about vector registers. They are vector registers! The compiler optimized the code so it would make use of vector registers to make the operations! The name **v0.4s** refers to the first (index 0) vector register, dividing it into 4 lanes of 32 bits (that's the meaning of the "s") each. This means that we can fit 4 pieces of data with 32 bits each in each vector. You can find more information about this naming convention here. Now, let's take a closer look at this part: ```armasm /* Load the Quadword 0 (128 bit register) with the content of x22 + x0 */ ldr q0, [x22,x0] /* Load the Quadword 1 (128 bit register) with the content of x21 + x0 */ ldr q1, [x21,x0] /* Add vector 0 + vector 1 (both), storing the result in vector 0 */ add v0.4s, v0.4s, v1.4s /* Store the content of Quadword 0 (our result) into x1 + x0 */ str q0, [x1,x0] ``` And there it is. Our vectorized code. Now, I also compiled the same code in **-O0**, which should NOT vectorize the code - and indeed, it was not vectorized. Here it is, notice that there are no vector registers being used here: ```armasm mov x16, #0x2f00 sub sp, sp, x16 stp x29, x30, [sp] mov x29, sp /* srand(1); */ mov w0, #0x1 bl 400510 /* int a1[1000] __attribute__((__aligned__(16))); */ /* int a2[1000] __attribute__((__aligned__(16))); */ /* int a3[1000] __attribute__((__aligned__(16))); */ /* int sum = 0; */ str wzr, [x29,#12028] /* for (int i = 0; i < 1000; i++) { */ str wzr, [x29,#12024] b 4006f0 /* a1[i] = (rand() % 2000) - 1000; */ bl 4004e0 mov w1, w0 mov w0, #0x4dd3 movk w0, #0x1062, lsl #16 smull x0, w1, w0 lsr x0, x0, #32 asr w2, w0, #7 asr w0, w1, #31 sub w0, w2, w0 mov w2, #0x7d0 mul w0, w0, w2 sub w0, w1, w0 sub w2, w0, #0x3e8 ldrsw x0, [x29,#12024] lsl x0, x0, #2 add x1, x29, #0x1, lsl #12 add x1, x1, #0xf50 str w2, [x1,x0] /* a2[i] = (rand() % 2000) - 1000; */ bl 4004e0 mov w1, w0 mov w0, #0x4dd3 movk w0, #0x1062, lsl #16 smull x0, w1, w0 lsr x0, x0, #32 asr w2, w0, #7 asr w0, w1, #31 sub w0, w2, w0 mov w2, #0x7d0 mul w0, w0, w2 sub w0, w1, w0 sub w2, w0, #0x3e8 ldrsw x0, [x29,#12024] lsl x0, x0, #2 add x1, x29, #0xfb0 str w2, [x1,x0] /* for (int i = 0; i < 1000; i++) { */ ldr w0, [x29,#12024] add w0, w0, #0x1 str w0, [x29,#12024] ldr w0, [x29,#12024] cmp w0, #0x3e7 b.le 400658 /* } */ /* for (int i = 0; i < 1000; i++) { */ str wzr, [x29,#12020] b 400748 /* a3[i] = a1[i] + a2[i]; */ ldrsw x0, [x29,#12020] lsl x0, x0, #2 add x1, x29, #0x1, lsl #12 add x1, x1, #0xf50 ldr w1, [x1,x0] ldrsw x0, [x29,#12020] lsl x0, x0, #2 add x2, x29, #0xfb0 ldr w0, [x2,x0] add w2, w1, w0 ldrsw x0, [x29,#12020] lsl x0, x0, #2 add x1, x29, #0x10 str w2, [x1,x0] /* for (int i = 0; i < 1000; i++) { */ ldr w0, [x29,#12020] add w0, w0, #0x1 str w0, [x29,#12020] ldr w0, [x29,#12020] cmp w0, #0x3e7 b.le 400704 /* } */ /* for (int i = 0; i < 1000; i++) { */ str wzr, [x29,#12016] b 400784 /* sum += a3[i]; */ ldrsw x0, [x29,#12016] lsl x0, x0, #2 add x1, x29, #0x10 ldr w0, [x1,x0] ldr w1, [x29,#12028] add w0, w1, w0 str w0, [x29,#12028] /* for (int i = 0; i < 1000; i++) { */ ldr w0, [x29,#12016] add w0, w0, #0x1 str w0, [x29,#12016] ldr w0, [x29,#12016] cmp w0, #0x3e7 b.le 40075c /* } */ /* printf("Sum: %d\n", sum); */ adrp x0, 400000 <_init-0x498> add x0, x0, #0x870 ldr w1, [x29,#12028] bl 400520 /* return 0; */ mov w0, #0x0 /* } */ ldp x29, x30, [sp] mov x16, #0x2f00 add sp, sp, x16 ret .inst 0x00000000 ; undefined ```