Henry S. Coelho

GCC compiler optimizations for loops

The GCC compiler provides several optimization flags that, when used, will signal the compiler to try to optimize the code being produced. Using the flag -O0, the compiler will not optimize the generated code too much - this is ideal when we just want to compile the program to test it. The -O3 flag, however, will heavily optimize the output, but it is a much slower process.

In this post, I will compare some outputs of compiled programs using the -O0 and -O3 flags.

First, I wrote the following code in C:

#include <stdio.h>
#include <stdlib.h>

#define SZ 1000

int main()
{
  srand(1);

  int arr1[SZ];

  long long sum = 0;

  for (size_t i = 0; i < SZ; i++) arr1[i] = rand();

  for (size_t i = 0; i < SZ; i++) sum += arr1[i];

  return 0;
}

It's a very simple program: it fills an array with a bunch of random numbers, then sum all of them together into the variable sum.

I compiled this program with both the O0 and O3 flags, and here are the results (more specifically, the resulted assembly code for the main function):

/* COMPILED WITH O0 */
/*------------------------------------------------------------------------------
  Just some initialization stuff. Nothing to see here
------------------------------------------------------------------------------*/
72a:    push   %rbp
72b:    mov    %rsp,%rbp
72e:    sub    $0xfd0,%rsp
735:    mov    %fs:0x28,%rax
73e:    mov    %rax,-0x8(%rbp)
742:    xor    %eax,%eax

/*------------------------------------------------------------------------------
  Pushing "1" as argument and calling "srand"
------------------------------------------------------------------------------*/
744:    mov    $0x1,%edi
749:    callq  5f0 <srand@plt>

/*------------------------------------------------------------------------------
  Initializing "sum" to 0
------------------------------------------------------------------------------*/
74e:    movq   $0x0,-0xfc8(%rbp)

/*------------------------------------------------------------------------------
  FIRST LOOP
  ----------
  Here is where the first loop starts
  Initializing the first "i" to 0
  Notice that we are also jumping to line 783. On line 783 we are checking if 
  the condition is true
------------------------------------------------------------------------------*/
759:    movq   $0x0,-0xfc0(%rbp)
764:    jmp    783 <main+0x59>

/*------------------------------------------------------------------------------
  Getting the random number and recording it in the array
------------------------------------------------------------------------------*/
766:    callq  600 <rand@plt>
76b:    mov    %eax,%edx
76d:    mov    -0xfc0(%rbp),%rax
774:    mov    %edx,-0xfb0(%rbp,%rax,4)

/*------------------------------------------------------------------------------
  Incrementing "i"
------------------------------------------------------------------------------*/
77b:    addq   $0x1,-0xfc0(%rbp)

/*------------------------------------------------------------------------------
  Comparing "i" with the limit: if it was not met, jump back to line 766
------------------------------------------------------------------------------*/
783:    cmpq   $0x3e7,-0xfc0(%rbp)
78e:    jbe    766 <main+0x3c>

/*------------------------------------------------------------------------------
  SECOND LOOP
  -----------
  Here is where the second loop starts
  Initializing the second "i" to 0
  Notice that we are also jumping to line 7bc. On line 7bc we are checking if 
  the condition is true
------------------------------------------------------------------------------*/
790:    movq   $0x0,-0xfb8(%rbp)
79b:    jmp    7bc <main+0x92>

/*------------------------------------------------------------------------------
  Getting the number from the array and adding it to "sum"
------------------------------------------------------------------------------*/
79d:    mov    -0xfb8(%rbp),%rax
7a4:    mov    -0xfb0(%rbp,%rax,4),%eax
7ab:    cltq   
7ad:    add    %rax,-0xfc8(%rbp)

/*------------------------------------------------------------------------------
  Incrementing "i"
------------------------------------------------------------------------------*/
7b4:    addq   $0x1,-0xfb8(%rbp)

/*------------------------------------------------------------------------------
  Comparing "i" with the limit: if it was not met, jump back to line 79d
------------------------------------------------------------------------------*/
7bc:    cmpq   $0x3e7,-0xfb8(%rbp)
7c7:    jbe    79d <main+0x73>

/*------------------------------------------------------------------------------
  Magic that returns the function. Here be dragons. Trust me: stay away from it,
  except for one little part: notice how it is pushing 0 into eax
------------------------------------------------------------------------------*/
7c9:    mov    $0x0,%eax
7ce:    mov    -0x8(%rbp),%rcx
7d2:    xor    %fs:0x28,%rcx
7db:    je     7e2 <main+0xb8>
7dd:    callq  5e0 <__stack_chk_fail@plt>
7e2:    leaveq 
7e3:    retq   
7e4:    nopw   %cs:0x0(%rax,%rax,1)
7ee:    xchg   %ax,%ax

Great. It really looks a lot like my original code. Now, before I show the code for O3, let me tell you a tale about processors. Every time we have an operation, the resulting value is not the only thing we get back: processors also have things called flags, which they set in order to give you quick information about the result. For example: is the result equal to 0? Is it greater than 0? Did it overflow? And so on. On x86 machines, the "equal to zero" flag happens to be called ZF. These flags are what the jump commands generally use. For example, the command "jne" (jump if not equal to 0) will look at the ZF flag to decide if it should jump it or not.

Why I am saying this is a total mystery and 100% not related to the following piece of code, which was compiled with the O3 option:

/* COMPILED WITH O3 */
/*------------------------------------------------------------------------------
  Just some initialization stuff. Nothing to see here
------------------------------------------------------------------------------*/
5c0:    push   %rbx

/*------------------------------------------------------------------------------
  Pushing "1" as argument and calling the "srand" function.
  Notice what it is doing with the b register: it is pushing the value 0x3e8
  into it. What is 0x3e8? It happens to be "1000": the number of times to loop
------------------------------------------------------------------------------*/
5c1:    mov    $0x1,%edi
5c6:    mov    $0x3e8,%ebx
5cb:    callq  590 <srand@plt>

/*------------------------------------------------------------------------------
  Calling the "rand" function
------------------------------------------------------------------------------*/
5d0:    callq  5a0 <rand@plt>

/*------------------------------------------------------------------------------
  Subtracting 1 from register b, which is where the original "1000" was: every
  time it loops, it subtracts 1. The result of this operation will set the flags
  in the processor. For example: if we reach 0, it will set the ZF flag. This
  would be a very simple way to know when we finished looping. Still, totally
  not related to what happens in the next line
------------------------------------------------------------------------------*/
5d5:    sub    $0x1,%rbx

/*------------------------------------------------------------------------------
  Jumping back to the beginning of the loop if the ZF flag was set.
  Here, the compiler took step to avoid extra operations: it is not even using
  the "cmp" (compare) operation, but checking the ZF flag from the subtraction
  directly to know if the loop is done or not
------------------------------------------------------------------------------*/
5d9:    jne    5d0 <main+0x10>

/*------------------------------------------------------------------------------
  More magic to return from the function. No flags here (I think), but there
  is one more thing: instead of pushing 0 into eax, it is doing a XOR on itself.
  Any value XORed with itself results in 0 - this is a quicker way to set a
  value to 0 than pushing 0 into it
------------------------------------------------------------------------------*/
5db:    xor    %eax,%eax
5dd:    pop    %rbx
5de:    retq   
5df:    nop

Just by looking at the code produced, it is clear that the compiler took a huge step to make the code more efficient: it completely eliminated a loop, and it got rid of my useless array (I wasn't using it anyway!). It also eliminated the use of the "i" variables that keep track of my loop index, and instead, just used a counter that went from 1000 to 0.

There is one last test I want to do: I noticed how the compiler eliminates the "i" variable in order to make things faster. But what if I need that variable to do something else?

This time, I compiled these programs:

#include <stdio.h>

int main()
{
  for (size_t i = 0; i < 10; i++) puts("a");
  return 0;
}
#include <stdio.h>

int main()
{
  for (size_t i = 0; i < 10; i++) printf("%d", i);
  return 0;
}

One program uses the variable "i", the other one does not. Let's check the first version (this time, I will remove some of the non-important parts of the assembly code):

/* COMPILED WITH O3 */
572:    lea    0x1bb(%rip),%rbp

/*------------------------------------------------------------------------------
  Moving "10" into the b register
------------------------------------------------------------------------------*/
579:    mov    $0xa,%ebx

57e:    sub    $0x8,%rsp
582:    nopw   0x0(%rax,%rax,1)

588:    mov    %rbp,%rdi
58b:    callq  550 <puts@plt>

/*------------------------------------------------------------------------------
  Subtracting 1 from 10 and comparing the ZF flag to jump back to 588
------------------------------------------------------------------------------*/
590:    sub    $0x1,%rbx
594:    jne    588 <main+0x18>

Again, the compiler optimized the loop by removing the "i" variable. How about in the second version?:

/* COMPILED WITH O3 */
582:    lea    0x1bb(%rip),%rbp

/*------------------------------------------------------------------------------
  Setting register b to 0
------------------------------------------------------------------------------*/
589:    xor    %ebx,%ebx

58b:    sub    $0x8,%rsp
58f:    nop

590:    mov    %rbx,%rsi
593:    xor    %eax,%eax
595:    mov    %rbp,%rdi

/*------------------------------------------------------------------------------
  Adding 1 to register b
------------------------------------------------------------------------------*/
598:    add    $0x1,%rbx

59c:    callq  560 <printf@plt>

/*------------------------------------------------------------------------------
  Comparing b to 10 and jumping if they are still different
------------------------------------------------------------------------------*/
5a1:    cmp    $0xa,%rbx
5a5:    jne    590 <main+0x10>

This time, since we need the variable "i", the compiler gave us one.

On my next post, I will talk about another strategy for optimization: vectorization.