blog

Experimenting with Inline Assembly in C

As I said in a previous post, C is a really cool language, and as any cool language, it allows us to embed Assembly instructions in the code. This is useful in scenarios where you really want to optimise an operation, so you want to write it in pure assembly code, or there is no available implementation for a functionality you want (say you don't have the Standard C Library, for example), so you need to implement it in asm. In this post I will make a little snippet in C and tell the compiler to optimize it as much as it can (**O3**), then I will try to implement something even faster with inline assembly. This is the application I will use. It is very simple: we take 3 arrays, sum the first with the second, and put it in the third. Any other functionality there is just to prevent the compiler from removing parts of my code (for example, if I make an array that I end up not using, the compiler will probably remove it). Summing the result from the third array at the end is also useful to make sure both my programs are correct. ```c #include #include #include #define LENGTH 250000000 int main() { srand(1); // We will use this to record the running time struct timeval timerBefore, timerAfter, timerResult; int16_t* arr1 __attribute__((__aligned__(16))); int16_t* arr2 __attribute__((__aligned__(16))); int16_t* arr3 __attribute__((__aligned__(16))); size_t i; long long sum = 0; // Allocating the memory arr1 = calloc(LENGTH, 16); arr2 = calloc(LENGTH, 16); arr3 = calloc(LENGTH, 16); // Seeding the array for (i = 0; i < LENGTH; i++) { arr1[i] = rand(); arr2[i] = rand(); } gettimeofday(&timerBefore, NULL); // Adding the arrays for (i = 0; i < LENGTH; i++) arr3[i] += arr2[i] + arr1[i]; gettimeofday(&timerAfter, NULL); // Getting the sum for (i = 0; i < LENGTH; i++) sum += arr3[i]; free(arr1); free(arr2); free(arr3); timersub(&timerAfter, &timerBefore, &timerResult); printf("Time elapsed: %ld.%06ld ", (long int)timerResult.tv_sec, (long int)timerResult.tv_usec); printf("Sum: %ld ", sum); return 0; } ``` My arrays are arrays of 16 bit integers (shorts). I aligned them in memory for every 128 bits to make it simple for the compiler to vectorize my code (more about it in this post). I tried to make it as simple as possible for the compiler to optimise it as much as it could. This was the result when I ran it: ```json Time elapsed: 0.187425 Sum: 39268368 ``` Not bad. Let's see if I can make it better. But first, some explanation about inline assembly. To write inline assembly, we simply use the command `asm` and pass a string with the instructions. For example: ```c asm("mov a, b " "add c, d " ); ``` That works. But what if we need to access a variable? To do this, we could tell the compiler to put them in a register. The `register` keyword in C does that, but it is only a suggestion for the compiler. To make it actually put the variable in a register, we do something like this: ```c register int a asm ("r15"); ``` Now we can access the variable "a" in the register "r15". Something like this works, but it's not the best method. What if in this code we end up using the register `rax`, but the compiler is using it for something else? If possible, we should use **extended asm**. It will give us a friendlier way to use inline assembly. The syntax is very similar: ```c asm("mov a, b " "add c, d " : [Output operands] : [Input operands] : [Clobbers] ); ``` **Output operands** are the registers/variables that we will write to, **input operands** are the registers/variables that we will read from, and **clobbers** will tell the compiler which registers we are going to use, so it will not have anything important there when we run our asm instructions. Example: ```c // dst and src are variables in the C program asm("mov %1, %0 " "add $1, %0 " "mov %0, %%rax" : "=r" (dst) : "r" (src) : "rax" ); ``` The "=r" and "r" means that we are using general registers, and "=" means we are writing into it. These are constraints and modifiers. When we pass input/output operands, their names in the asm instructions will be *%*. Since "dst" was the first operand passed, it is `%0`. Registers are accessed using the prefix `%%`. We can also give names to the operands using square brackets: ```c // dst and src are variables in the C program asm("mov %[s], %[d] " "add $[s], %[d] " "mov %[d], %%rax" : [d] "=r" (dst) : [s] "r" (src) : "rax" ); ``` The syntax is a bit weird, but we get used to it quite quickly. More information: - Extended asm - Constraints - Modifiers Alright. So, my objective here is to replace that for loop with asm code. This is what I made: ```c u_int64_t max = LENGTH; asm volatile( "xor %%r10, %%r10 " "loop: " // Body of the iteration "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "cmp %[max], %%r10 " "jl loop " : [vec] "=x" (vector) : [z] "r" (arr3), [y] "r" (arr2), [x] "r" (arr1), [max] "r" (max) : "r10" ); ``` Why did I use the keyword **volatile**? This keyword (typically used with variables) tell the compiler that the value can change at any moment, from an external source (for example, if we are using an embedded device). A side effect is that the compiler will not perform any optimisation that could cause a problem for this volatile variable (maybe like unrolling the loop, if the variable is the loop index). This is not the case for my asm code, but the side effect will be useful for me: it will prevent GCC from optimising my asm code. Also, I know that GCC used vectorization to make the process faster, so if I wanted this thing to be faster, it would have to use vectorization as well. But there is another detail: GCC probably cheated and unrolled my loop, so I will do this as well: I will copy and paste those four lines below "Body of the iteration" 20 times. Why 20? Well, the ideal would be 250,000,000, so we wouldn't have any iteration at all, but we can't do so many: if we do that, my code will not fit in the cache (I will make a post some day about cache miss) and the processor will have to fetch the rest of the code in memory, which is very slow. I know for a fact that 20 times will fit, so I will use this number. This is how it looks like: ```c #include #include #include #include #define LENGTH 250000000 int main() { srand(1); // We will use this to record the running time struct timeval timerBefore, timerAfter, timerResult; int16_t* arr1 __attribute__((__aligned__(16))); int16_t* arr2 __attribute__((__aligned__(16))); int16_t* arr3 __attribute__((__aligned__(16))); size_t i; long long sum = 0; __m128 vector; // Allocating the memory arr1 = calloc(LENGTH, 16); arr2 = calloc(LENGTH, 16); arr3 = calloc(LENGTH, 16); // Seeding the array for (i = 0; i < LENGTH; i++) { arr1[i] = rand(); arr2[i] = rand(); } gettimeofday(&timerBefore, NULL); u_int64_t max = LENGTH; asm volatile( "xor %%r10, %%r10 " "loop: " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "movdqa (%[y],%%r10,2), %[vec] " "paddw (%[x],%%r10,2), %[vec] " "movdqa %[vec], (%[z],%%r10,2) " "add $8, %%r10 " "cmp %[max], %%r10 " "jl loop " : [vec] "=x" (vector) : [z] "r" (arr3), [y] "r" (arr2), [x] "r" (arr1), [max] "r" (max) : "r10" ); gettimeofday(&timerAfter, NULL); // Getting the sum for (i = 0; i < LENGTH; i++) sum += arr3[i]; free(arr1); free(arr2); free(arr3); timersub(&timerAfter, &timerBefore, &timerResult); printf("Time elapsed: %ld.%06ld ", (long int)timerResult.tv_sec, (long int)timerResult.tv_usec); printf("Sum: %ld ", sum); return 0; } ``` Just as a reminder, this was the code without inline assembly: ```json Time elapsed: 0.187425 Sum: 39268368 ``` And now the moment of truth! Code with inline ASM: ```json Time elapsed: 0.160923 Sum: 39268368 ``` Well, it is definitely faster! But only by 0.02 seconds! GCC is good, really good. In conclusion, inline assembly is crucial for situations when we don't have the standard library available, and very useful for when we can make dramatic improvements in the code by writing it directly in assembly, but it should be avoided if not absolutely necessary. In my snippet, for example, the gains were minimal, and unless those 0.02 seconds were extremely important, it is better to let GCC do the job.