blog

GCC compiler optimizations for loops

The GCC compiler provides several optimization flags that, when used, will signal the compiler to try to optimize the code being produced. Using the flag **-O0**, the compiler will not optimize the generated code too much - this is ideal when we just want to compile the program to test it. The **-O3** flag, however, will heavily optimize the output, but it is a much slower process. In this post, I will compare some outputs of compiled programs using the **-O0** and **-O3** flags. First, I wrote the following code in C: ```C #include #include #define SZ 1000 int main() { srand(1); int arr1[SZ]; long long sum = 0; for (size_t i = 0; i < SZ; i++) arr1[i] = rand(); for (size_t i = 0; i < SZ; i++) sum += arr1[i]; return 0; } ``` It's a very simple program: it fills an array with a bunch of random numbers, then sum all of them together into the variable *sum*. I compiled this program with both the O0 and O3 flags, and here are the results (more specifically, the resulted assembly code for the **main** function): ```x86asm /* COMPILED WITH O0 */ /*------------------------------------------------------------------------------ Just some initialization stuff. Nothing to see here ------------------------------------------------------------------------------*/ 72a: push %rbp 72b: mov %rsp,%rbp 72e: sub $0xfd0,%rsp 735: mov %fs:0x28,%rax 73e: mov %rax,-0x8(%rbp) 742: xor %eax,%eax /*------------------------------------------------------------------------------ Pushing "1" as argument and calling "srand" ------------------------------------------------------------------------------*/ 744: mov $0x1,%edi 749: callq 5f0 /*------------------------------------------------------------------------------ Initializing "sum" to 0 ------------------------------------------------------------------------------*/ 74e: movq $0x0,-0xfc8(%rbp) /*------------------------------------------------------------------------------ FIRST LOOP ---------- Here is where the first loop starts Initializing the first "i" to 0 Notice that we are also jumping to line 783. On line 783 we are checking if the condition is true ------------------------------------------------------------------------------*/ 759: movq $0x0,-0xfc0(%rbp) 764: jmp 783 /*------------------------------------------------------------------------------ Getting the random number and recording it in the array ------------------------------------------------------------------------------*/ 766: callq 600 76b: mov %eax,%edx 76d: mov -0xfc0(%rbp),%rax 774: mov %edx,-0xfb0(%rbp,%rax,4) /*------------------------------------------------------------------------------ Incrementing "i" ------------------------------------------------------------------------------*/ 77b: addq $0x1,-0xfc0(%rbp) /*------------------------------------------------------------------------------ Comparing "i" with the limit: if it was not met, jump back to line 766 ------------------------------------------------------------------------------*/ 783: cmpq $0x3e7,-0xfc0(%rbp) 78e: jbe 766 /*------------------------------------------------------------------------------ SECOND LOOP ----------- Here is where the second loop starts Initializing the second "i" to 0 Notice that we are also jumping to line 7bc. On line 7bc we are checking if the condition is true ------------------------------------------------------------------------------*/ 790: movq $0x0,-0xfb8(%rbp) 79b: jmp 7bc /*------------------------------------------------------------------------------ Getting the number from the array and adding it to "sum" ------------------------------------------------------------------------------*/ 79d: mov -0xfb8(%rbp),%rax 7a4: mov -0xfb0(%rbp,%rax,4),%eax 7ab: cltq 7ad: add %rax,-0xfc8(%rbp) /*------------------------------------------------------------------------------ Incrementing "i" ------------------------------------------------------------------------------*/ 7b4: addq $0x1,-0xfb8(%rbp) /*------------------------------------------------------------------------------ Comparing "i" with the limit: if it was not met, jump back to line 79d ------------------------------------------------------------------------------*/ 7bc: cmpq $0x3e7,-0xfb8(%rbp) 7c7: jbe 79d /*------------------------------------------------------------------------------ Magic that returns the function. Here be dragons. Trust me: stay away from it, except for one little part: notice how it is pushing 0 into eax ------------------------------------------------------------------------------*/ 7c9: mov $0x0,%eax 7ce: mov -0x8(%rbp),%rcx 7d2: xor %fs:0x28,%rcx 7db: je 7e2 7dd: callq 5e0 <__stack_chk_fail@plt> 7e2: leaveq 7e3: retq 7e4: nopw %cs:0x0(%rax,%rax,1) 7ee: xchg %ax,%ax ``` Great. It really looks a lot like my original code. Now, before I show the code for O3, let me tell you a tale about processors. Every time we have an operation, the resulting value is not the only thing we get back: processors also have things called **flags**, which they set in order to give you quick information about the result. For example: is the result equal to 0? Is it greater than 0? Did it overflow? And so on. On x86 machines, the "equal to zero" flag happens to be called **ZF**. These flags are what the jump commands generally use. For example, the command "jne" (jump if not equal to 0) will look at the **ZF** flag to decide if it should jump it or not. Why I am saying this is a total mystery and 100% not related to the following piece of code, which was compiled with the O3 option: ```x86asm /* COMPILED WITH O3 */ /*------------------------------------------------------------------------------ Just some initialization stuff. Nothing to see here ------------------------------------------------------------------------------*/ 5c0: push %rbx /*------------------------------------------------------------------------------ Pushing "1" as argument and calling the "srand" function. Notice what it is doing with the b register: it is pushing the value 0x3e8 into it. What is 0x3e8? It happens to be "1000": the number of times to loop ------------------------------------------------------------------------------*/ 5c1: mov $0x1,%edi 5c6: mov $0x3e8,%ebx 5cb: callq 590 /*------------------------------------------------------------------------------ Calling the "rand" function ------------------------------------------------------------------------------*/ 5d0: callq 5a0 /*------------------------------------------------------------------------------ Subtracting 1 from register b, which is where the original "1000" was: every time it loops, it subtracts 1. The result of this operation will set the flags in the processor. For example: if we reach 0, it will set the ZF flag. This would be a very simple way to know when we finished looping. Still, totally not related to what happens in the next line ------------------------------------------------------------------------------*/ 5d5: sub $0x1,%rbx /*------------------------------------------------------------------------------ Jumping back to the beginning of the loop if the ZF flag was set. Here, the compiler took step to avoid extra operations: it is not even using the "cmp" (compare) operation, but checking the ZF flag from the subtraction directly to know if the loop is done or not ------------------------------------------------------------------------------*/ 5d9: jne 5d0 /*------------------------------------------------------------------------------ More magic to return from the function. No flags here (I think), but there is one more thing: instead of pushing 0 into eax, it is doing a XOR on itself. Any value XORed with itself results in 0 - this is a quicker way to set a value to 0 than pushing 0 into it ------------------------------------------------------------------------------*/ 5db: xor %eax,%eax 5dd: pop %rbx 5de: retq 5df: nop ``` Just by looking at the code produced, it is clear that the compiler took a huge step to make the code more efficient: it completely eliminated a loop, and it got rid of my useless array (I wasn't using it anyway!). It also eliminated the use of the "i" variables that keep track of my loop index, and instead, just used a counter that went from 1000 to 0. There is one last test I want to do: I noticed how the compiler eliminates the "i" variable in order to make things faster. But what if I need that variable to do something else? This time, I compiled these programs: ```C #include int main() { for (size_t i = 0; i < 10; i++) puts("a"); return 0; } ``` ```C #include int main() { for (size_t i = 0; i < 10; i++) printf("%d", i); return 0; } ``` One program uses the variable "i", the other one does not. Let's check the first version (this time, I will remove some of the non-important parts of the assembly code): ```x86asm /* COMPILED WITH O3 */ 572: lea 0x1bb(%rip),%rbp /*------------------------------------------------------------------------------ Moving "10" into the b register ------------------------------------------------------------------------------*/ 579: mov $0xa,%ebx 57e: sub $0x8,%rsp 582: nopw 0x0(%rax,%rax,1) 588: mov %rbp,%rdi 58b: callq 550 /*------------------------------------------------------------------------------ Subtracting 1 from 10 and comparing the ZF flag to jump back to 588 ------------------------------------------------------------------------------*/ 590: sub $0x1,%rbx 594: jne 588 ``` Again, the compiler optimized the loop by removing the "i" variable. How about in the second version?: ```x86asm /* COMPILED WITH O3 */ 582: lea 0x1bb(%rip),%rbp /*------------------------------------------------------------------------------ Setting register b to 0 ------------------------------------------------------------------------------*/ 589: xor %ebx,%ebx 58b: sub $0x8,%rsp 58f: nop 590: mov %rbx,%rsi 593: xor %eax,%eax 595: mov %rbp,%rdi /*------------------------------------------------------------------------------ Adding 1 to register b ------------------------------------------------------------------------------*/ 598: add $0x1,%rbx 59c: callq 560 /*------------------------------------------------------------------------------ Comparing b to 10 and jumping if they are still different ------------------------------------------------------------------------------*/ 5a1: cmp $0xa,%rbx 5a5: jne 590 ``` This time, since we need the variable "i", the compiler gave us one. On my next post, I will talk about another strategy for optimization: **vectorization**.