Why Duff's Device works (and dissecting switches and loops in assembly)

There are two things I really like: pomegranates and weird hacks in programming languages that you should never use. Not because you can use them in your projects to make them hard to maintain (and I am not talking about pomegranates here), but because they are interesting and eye-openers. Duff's Device is one of my favourites - although it is not exactly something that should never be used (it was useful in that particular situation), it's probably something that nobody nowadays would ever have to use (thankfully). If you don't know what it is, here is a short explanation: it is a clever way to implement loop unrolling in C with the capability of jumping into the middle of the loop. Here is the original code in all its K&R glory: ```c // The 3 lines below are the function declaration, nowadays it would be: // send(register short *to, register short *from, register count) send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch (count % 8) { case 0: do { *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; } while (--n > 0); } } ``` This thing, which looks more like a horrible accident with macros on VIM, is actually valid syntax, and it works. You can read more about the story of this thing here. But what does it do, exactly? Let's make a simpler example. Take a look at this: ```c int main() { int counter = 40; int sum = 0; do { sum++; } while(--counter > 0); printf("Total: %d\n", sum); return 0; } ``` This is a very simple script: it loops 40 times, and for every iteration, it adds 1 to sum. At the end, it prints `Total: 40` - no big deal. Now, suppose we want to make it faster. One alternative is to unroll the loop: instead of doing one operation per iteration, we can do 10: ```c int main() { int counter = 4; int sum = 0; do { sum++; sum++; sum++; sum++; sum++; sum++; sum++; sum++; sum++; sum++; } while (--counter > 0); printf("Total: %d\n", sum); return 0; } ``` The result will be the same, but it is a bit more efficient. There is one problem here though: it would only work for numbers that are multiple of 10 (we could not print 46, for example) UNLESS we could jump into the middle of the loop somehow, and that is the trick! Using a switch case, we can skip part of an iteration, and if we add one more, we will be able to print whatever number we want: ```c int main() { int counter = 4; int sum = 0; int s = 6; switch (s) { do { case 0: sum++; case 9: sum++; case 8: sum++; case 7: sum++; case 6: sum++; case 5: sum++; case 4: sum++; case 3: sum++; case 2: sum++; case 1: sum++; } while (counter-- > 0); } printf("Total: %d\n", sum); return 0; } ``` In this case, `s` is `6`, which would make the program jump to this line: `case 6: sum++;` We would then execute the instructions until we hit the `} while (counter-- > 0);` line, and then jump back to the beginning of the loop (do {) and loop over and over again. This is cool and all, but I was more interested in **why** this works, and not **how**. It seems that this neat trick would be somehow lost when we compiled the code (or even more likely, be a syntax error). Let's peek at the assembly code (x86 64, no optimization). First, this is how the do-while loop gets translated: ```x86asm ; The do { does not get translated to anything do { ; This is where we increment 1 to %rbp-4 (a position in the stack), which is where the sum is sum++; 660: 83 45 fc 01 addl $0x1,-0x4(%rbp) ; These lines are, respectively: subtracting one from rbp-8 (a position in the stack) which is ; where is where the counter is; comparing the counter to 0, jumping back to line 660, which ; is where the sum++ is (the line above) if counter is still greater than 0 } while (--counter > 0); 688: 83 6d f8 01 subl $0x1,-0x8(%rbp) 68c: 83 7d f8 00 cmpl $0x0,-0x8(%rbp) 690: 7f ce jg 660 ``` Cool. Now let's see how a switch works. I made a very simple switch-case to see how it gets translated: ```c switch (num) { case 1: printf("One\n"); break; case 2: printf("Two or more\n"); case 3: printf("Three\n"); break; default: printf("!= 1, 2, or 3"); break; } ``` In Assembly: ```x86asm ; The next few lines are interesting. C gives us the impression that the comparisons ; happen as the program goes though the cases, but in reality, all the comparisons happen ; in the "switch" header. The following lines are: ; 1- Moving the value of "num" (stack - 4) to eax ; 2- Comparing eax to 2 and jumping to line 6c9 if equal ; 3- Comparing eax to 3 and jumping to line 6d5 if equal ; 4- Comparing eax to 1 and jumping to line 6e3, which is the "default" case if NOT equal switch (num) { 6a9: 8b 45 fc mov -0x4(%rbp),%eax 6ac: 83 f8 02 cmp $0x2,%eax 6af: 74 18 je 6c9 6b1: 83 f8 03 cmp $0x3,%eax 6b4: 74 1f je 6d5 6b6: 83 f8 01 cmp $0x1,%eax 6b9: 75 28 jne 6e3 ; If we did not make a jump, which happens to the case 1, we execute the following part. ; Notice that the "break" causes the program to jumpt 6f5, which is the next instruction ; after the end of the switch case 1: printf("One\n"); 6bb: 48 8d 3d d2 00 00 00 lea 0xd2(%rip),%rdi # 794 <_IO_stdin_used+0x4> 6c2: e8 a9 fe ff ff callq 570 break; 6c7: eb 2c jmp 6f5 ; This is case 2. It is similar to case 1, except that it does not have the "break", so ; it does not jump at the end, and ends up falling through to case "3" case 2: printf("Two or more\n"); 6c9: 48 8d 3d c8 00 00 00 lea 0xc8(%rip),%rdi # 798 <_IO_stdin_used+0x8> 6d0: e8 9b fe ff ff callq 570 ; The case 3 is similar to case 1 case 3: printf("Three\n"); 6d5: 48 8d 3d c8 00 00 00 lea 0xc8(%rip),%rdi # 7a4 <_IO_stdin_used+0x14> 6dc: e8 8f fe ff ff callq 570 break; 6e1: eb 12 jmp 6f5 ; The default case is similar to the other ones, but instead of jumping out of the ; switch, it just has a "no operation" instruction (nop), since it doesn't really need ; to jump anywhere default: printf("!= 1, 2, or 3"); 6e3: 48 8d 3d c0 00 00 00 lea 0xc0(%rip),%rdi # 7aa <_IO_stdin_used+0x1a> 6ea: b8 00 00 00 00 mov $0x0,%eax 6ef: e8 8c fe ff ff callq 580 break; 6f4: 90 nop } ``` Suddenly it doesn't seem that hard to join both control structures, right? It looks weird in C, but it seems like it would flow very naturally in Assembly. Let's take a look at the compiled code when we join them: ```x86asm ; Here is the switch header. We are going to subtract "s" from 9, jump out of the switch if ; "s" is above 9 (doesn't fit in any case), move "s" to eax, and then do some pointer ; calculations ("lea" means Load Effective Address, which means we are moving the address ; of something instead of its value) to figure out where we are going to jump to. This ; jump will be from 691 to 6b5, which is where we are incrementing "sum" switch (s) { 667: 83 7d fc 09 cmpl $0x9,-0x4(%rbp) 66b: 77 59 ja 6c6 66d: 8b 45 fc mov -0x4(%rbp),%eax 670: 48 8d 14 85 00 00 00 lea 0x0(,%rax,4),%rdx 677: 00 678: 48 8d 05 01 01 00 00 lea 0x101(%rip),%rax # 780 <_IO_stdin_used+0x10> 67f: 8b 04 02 mov (%rdx,%rax,1),%eax 682: 48 63 d0 movslq %eax,%rdx 685: 48 8d 05 f4 00 00 00 lea 0xf4(%rip),%rax # 780 <_IO_stdin_used+0x10> 68c: 48 01 d0 add %rdx,%rax 68f: ff e0 jmpq *%rax ; This part is just a repetition of increments for the variable "sum" - the switch will ; put us in one of the lines below do { case 0: sum++; 691: 83 45 f8 01 addl $0x1,-0x8(%rbp) case 9: sum++; 695: 83 45 f8 01 addl $0x1,-0x8(%rbp) case 8: sum++; 699: 83 45 f8 01 addl $0x1,-0x8(%rbp) case 7: sum++; 69d: 83 45 f8 01 addl $0x1,-0x8(%rbp) case 6: sum++; 6a1: 83 45 f8 01 addl $0x1,-0x8(%rbp) case 5: sum++; 6a5: 83 45 f8 01 addl $0x1,-0x8(%rbp) case 4: sum++; 6a9: 83 45 f8 01 addl $0x1,-0x8(%rbp) case 3: sum++; 6ad: 83 45 f8 01 addl $0x1,-0x8(%rbp) case 2: sum++; 6b1: 83 45 f8 01 addl $0x1,-0x8(%rbp) case 1: sum++; 6b5: 83 45 f8 01 addl $0x1,-0x8(%rbp) ; When we reach this part, we just check the loop counter and jump back to line 691 (the ; first increment) if we are still not done } while (counter-- > 0); 6b9: 8b 45 f4 mov -0xc(%rbp),%eax 6bc: 8d 50 ff lea -0x1(%rax),%edx 6bf: 89 55 f4 mov %edx,-0xc(%rbp) 6c2: 85 c0 test %eax,%eax 6c4: 7f cb jg 691 } ``` Notice how the code above is just a regular "do" loop with a header at the beginning, making us jump and skip some instructions if some conditions are met. At the end, this hack seems very simple when we look at the code generated by the compiler.