blog

SPO600 - Lab 6 - Algorithm Selection

In this post, I will evaluate three ways for scaling digital sound by a factor (like the volume level). The full instructions for this lab can be found here. To begin with, what is sound? Sounds are waves, vibrations, that travel through some medium (normally air, for us). These are analog waves. Look at my beautiful ASCII analog wave: ```json +1 | * * * | * * * * * * | * * * * * * 0 |*-----------*-----------*-----------*-----------*-----------*-----------* | * * * * * * | * * * * * * -1 | * * * ``` So, how do we store this in memory? A good way to do this would be sampling the wave **n** times per second, and then storing the results. For example, say that we are taking 8 samples for every oscillation in the wave. This would be our result: ```json +1 | * * * | * ' * * ' * * ' * | * ' ' ' * * ' ' ' * * ' ' ' * 0 |*--'--'--'--*-----------*--'--'--'--*-----------*--'--'--'--*-----------* |' ' ' ' '* *' ' ' ' '* *' ' ' ' '* *' |' ' ' ' ' * * ' ' ' ' ' * * ' ' ' ' ' * * ' -1 |' ' ' ' ' ' * ' ' ' ' ' ' ' * ' ' ' ' ' ' ' * ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' 0 ' 1 ' 0 ' -1 ' 0 ' 1 ' 0 ' -1 ' 0 ' 1 ' 0 ' -1 ' 0 0.7 0.7 -0.7 -0.7 0.7 0.7 -0.7 -0.7 0.7 0.7 -0.7 -0.7 ``` The digital representation of the first oscillation would be: `[0, 0.7, 1, 0.7, 0, -0.7, -1, -0.7]`. This we can store in memory. It is safe to say then that if we have more samples, we will have a better representation of the analog wave, but it will take more space in memory. So, what happens, for example, if our volume is 50% instead of 100%? In this case, our wave would only have half of its amplitude, and to represent this, we would have to multiply our digital sound by 0.5. The previous wave would then become `[0, 0.35, 0.5, 0.35, 0, -0.35, -0.5, -0.35]`. One important observation is: we typically don't store sound as a float, but as a 16 bit signed integer. This means that the highest value for a wave is 32,767, and the lowest is -32,768. This means that our wave would be stored like this: `[0, 22936, 32767, 22936, 0, -22937, -32768, -22937]`. Alright. What is the problem? Two things. First, multiplying integer x float is considerably slow. Second, a typical song is sampled at 44.1kHz, or 44,100 samples per second. Yes. This is a lot. You also have to consider that we typically have two channels, so it is also 88,200 samples per second. If we have our volume at a different level than 1, we will have to make 88,200 slow multiplications per second just to play a song. We need to come up with a good algorithm to do this quickly. This lab proposes three algorithms: 1. The "naive" one 2. Caching the operations in an array for quick lookup 3. Converting the volume factor into an integer, so the calculation becomes `integer * integer` instead of `float * integer` I will code all these algorithms and then run them in two machines, one is an x86 64 machine, and the other one is an Aarch64 machine. Then we can compare and discuss the results. # Algorithm 1 Here we simply multiply all samples by the volume factor. ```c #include #include #include #define LENGTH 250000000 #define VOLUME 0.75 int main() { srand(1); // We will use this to record the running time struct timeval timerBefore, timerAfter, timerResult; int16_t* arr; size_t i; long long sum = 0; // Allocating the memory arr = calloc(LENGTH, sizeof(int16_t)); // Seeding the array for (i = 0; i < LENGTH; i++) arr[i] = rand(); // I will set the first 11 values manually, so we // can check if the results are correct arr[0] = 0; arr[1] = 1; arr[2] = -1; arr[3] = 2; arr[4] = -2; arr[5] = 10; arr[6] = -10; arr[7] = 11; arr[8] = -11; arr[9] = 32767; arr[10] = -32768; gettimeofday(&timerBefore, NULL); // Scaling the samples // Here we are simply multiplying the sample by the volume. Since // The result would be a float, I am recasting it as an int16_t for (i = 0; i < LENGTH; i++) arr[i] = (int16_t)(arr[i] * VOLUME); gettimeofday(&timerAfter, NULL); // Printing the first 11 results for (i = 0; i < 11; i++) printf("%d: %d ", i, arr[i]); // Calculate the sum for (i = 0; i < LENGTH; i++) sum += arr[i]; // Freeing memory and printing result free(arr); timersub(&timerAfter, &timerBefore, &timerResult); printf("Time elapsed: %ld.%06ld ", (long int)timerResult.tv_sec, (long int)timerResult.tv_usec); printf("Sum: %ld ", sum); return 0; } ``` # Algorithm 2 Here we are pre-calculate all the possible values from -32.768 to 37,767 in an array, so we can just look it up instead of calculating it on the fly. The obvious disadvantage of this algorithm is that it has a memory footprint: we are storing all the precalculated values in an array. ```c #include #include #include #define LENGTH 250000000 #define VOLUME 0.75 int main() { srand(1); // We will use this to record the running time struct timeval timerBefore, timerAfter, timerResult; int16_t* arr; size_t i; long long sum = 0; int16_t lookupTable[65536]; // Allocating the memory arr = calloc(LENGTH, sizeof(int16_t)); // Seeding the array for (i = 0; i < LENGTH; i++) arr[i] = rand(); // I will set the first 11 values manually, so we // can check if the results are correct arr[0] = 0; arr[1] = 1; arr[2] = -1; arr[3] = 2; arr[4] = -2; arr[5] = 10; arr[6] = -10; arr[7] = 11; arr[8] = -11; arr[9] = 32767; arr[10] = -32768; // Calculate the lookup table // One important observation: we are looping from 0 to 65536, which size_t is // capable of, but we don't want this - after we reach 32767 we want it to // become a negative number to correctly represent a digital sound wave, if // we don't do this, instead of doing -1 * 0.75, we will be doing // 65536 * 0.75. Fixing this is simple: since the lookup table is already // storing int16_t, these positive numbers will be casted into signed // automatically, so we don't have to worry about it for (i = 0; i < 65536; i++) lookupTable[i] = (int16_t)(i * VOLUME); gettimeofday(&timerBefore, NULL); // Scaling the samples // Here we are looking up the result of the operation in the lookup table. // There is one very important detail here: what we have stored in arr[] is // the value of the sample (-32k to +32k) but our array indexes go from 0 to // 65k. I guess you could use some complicated logic to fix this, but the // easiest way to do this is simply cast the number into a u_int16_t: it will // make the number unsigned, and it will range from 0 to 65k // Another alternative is to make use a pointer that points to the middle of // this array and populate/search with it: when the number becomes // negative, it will jump back to the beginning of the array. I find this // way much cleaner though. for (i = 0; i < LENGTH; i++) arr[i] = lookupTable[(u_int16_t)arr[i]]; gettimeofday(&timerAfter, NULL); // Printing the first 11 results for (i = 0; i < 11; i++) printf("%d: %d ", i, arr[i]); // Calculate the sum for (i = 0; i < LENGTH; i++) sum += arr[i]; // Freeing memory and printing result free(arr); timersub(&timerAfter, &timerBefore, &timerResult); printf("Time elapsed: %ld.%06ld ", (long int)timerResult.tv_sec, (long int)timerResult.tv_usec); printf("Sum: %ld ", sum); return 0; } ``` # Algorithm 3 Here is the theory behind the last algorithm: we can replace that slow `int * float` multiplication by a bitshifting operation and an `int * int` multiplication. First, we can calculate the volume factor as an integer by doing this: ```c u_int32_t volumeInt = VOLUME * 256; ``` Why 256? Because binary. Take a look at how some of these results would look like: ```json volume | result 0 | 0000 0000 0000 0000 0.75 | 0000 0000 1100 0000 0.5 | 0000 0000 1000 0000 1 | 0000 0001 0000 0000 -0.5 | 1111 1111 1000 0000 -0.75 | 1111 1111 0100 0000 -1 | 1111 1111 0000 0000 ``` What is so special about this result? Simple: if we multiply this result by our sample, we are going to shift the bits to the left. Let's transform the samples into 32 bit numbers, and then multiply them by a volume of 0.5: ```json sample 16b | sample 32b 0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 1: 0000 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 0001 -1: 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 100: 0000 0000 0110 0100 0000 0000 0000 0000 0000 0000 0110 0100 -100: 1111 1111 1001 1100 1111 1111 1111 1111 1111 1111 1001 1100 32767: 0111 1111 1111 1111 0000 0000 0000 0000 0111 1111 1111 1111 -32768: 1000 0000 0000 0000 1111 1111 1111 1111 1000 0000 0000 0000 multiplied by: 0000 0000 1000 000 sample | result 0: 0000 0000 0000 0000 0000 0000 0000 0000 -> 0000 0000 0000 0000 0000 0000 0000 0000 1: 0000 0000 0000 0000 0000 0000 0000 0001 -> 0000 0000 0000 0000 0000 0000 1000 0000 -1: 1111 1111 1111 1111 1111 1111 1111 1111 -> 1111 1111 1111 1111 1111 1111 1000 0000 100: 0000 0000 0000 0000 0000 0000 0110 0100 -> 0000 0000 0000 0000 0011 0010 0000 0000 -100: 1111 1111 1111 1111 1111 1111 1001 1100 -> 1111 1111 1111 1111 1100 1110 0000 0000 32767: 0000 0000 0000 0000 0111 1111 1111 1111 -> 0000 0000 0011 1111 1111 1111 1000 0000 -32768: 1111 1111 1111 1111 1000 0000 0000 0000 -> 1111 1111 1100 0000 0000 0000 0000 0000 ``` Notice how the numbers remained unchanged, but they got shifted 7 bits to the left! Now, what happens if we shift them 8 bits to the right and convert it to 16b again? ```json before shifting (32b) | after shifting (16b) 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000: 0 0000 0000 0000 0000 0000 0000 1000 0000 0000 0000 0000 0000: 0 1111 1111 1111 1111 1111 1111 1000 0000 1111 1111 1111 1111: -1 0000 0000 0000 0000 0011 0010 0000 0000 0000 0000 0011 0010: 50 1111 1111 1111 1111 1100 1110 0000 0000 1111 1111 1100 1110: -50 0000 0000 0011 1111 1111 1111 1000 0000 0011 1111 1111 1111: 16383 1111 1111 1100 0000 0000 0000 0000 0000 1100 0000 0000 0000: -16384 ``` We would achieve the same result by simply shifting the number to the right once, but that is the point of this algorithm: the `volume * 256` factor will tell us how much to the left we should shift the bits, and then we shift them 8 bits to the right again. In this case, a 50% volume shifted the bits 7 times to the left, and 8 times to the right, which is the same as dividing the number by 2. Here is how the code looks like: ```c #include #include #include #define LENGTH 250000000 #define VOLUME 0.75 int main() { srand(1); // We will use this to record the running time struct timeval timerBefore, timerAfter, timerResult; int16_t* arr; size_t i; long long sum = 0; // Allocating the memory arr = calloc(LENGTH, sizeof(int16_t)); // Seeding the array for (i = 0; i < LENGTH; i++) arr[i] = rand(); // I will set the first 11 values manually, so we // can check if the results are correct arr[0] = 0; arr[1] = 1; arr[2] = -1; arr[3] = 2; arr[4] = -2; arr[5] = 10; arr[6] = -10; arr[7] = 11; arr[8] = -11; arr[9] = 32767; arr[10] = -32768; // Getting the volume in int int32_t volumeInt = VOLUME * 256; gettimeofday(&timerBefore, NULL); // Scaling the samples for (i = 0; i < LENGTH; i++) arr[i] = (int16_t)((arr[i] * volumeInt) >> 8); gettimeofday(&timerAfter, NULL); for (i = 0; i < 11; i++) printf("%d: %d ", i, arr[i]); // Calculate the result for (i = 0; i < LENGTH; i++) sum += arr[i]; free(arr); timersub(&timerAfter, &timerBefore, &timerResult); printf("Time elapsed: %ld.%06ld ", (long int)timerResult.tv_sec, (long int)timerResult.tv_usec); printf("Sum: %ld ", sum); return 0; } ``` # Results for x86 64 These are the results I get on my computer (x86 64): ```bash ./1.o 0: 0 1: 0 2: 0 3: 1 4: -1 5: 7 6: -7 7: 8 8: -8 9: 24575 10: -24576 Time elapsed: 0.187393 Sum: -202518466 ./2.o 0: 0 1: 0 2: 0 3: 1 4: -1 5: 7 6: -7 7: 8 8: -8 9: 24575 10: -24576 Time elapsed: 0.159220 Sum: -202518466 ./3.o 0: 0 1: 0 2: -1 3: 1 4: -2 5: 7 6: -8 7: 8 8: -9 9: 24575 10: -24576 Time elapsed: 0.070779 Sum: -296276756 ``` As expected, the third algorithm was a lot faster than the other two. There was a difference in the output of negative numbers: instead of 0, we got -1; instead of -1, we got -2, etc. This can be explained by how the operations were performed: in the first two algorithms, a sample of 0.9 would become 0; in this case, it becomes 1. However, this difference is not important: it is so small that we can safely ignore it. One thing I found very relevant: examining the assembly code generated, I noticed these lines for the bit-shifting loop: ```x86asm movdqa %xmm2,%xmm1 pmulhw %xmm3,%xmm2 pmullw %xmm3,%xmm1 movdqa %xmm1,%xmm0 punpckhwd %xmm2,%xmm1 punpcklwd %xmm2,%xmm0 psrad $0x8,%xmm1 psrad $0x8,%xmm0 movdqa %xmm0,%xmm2 punpcklwd %xmm1,%xmm0 punpckhwd %xmm1,%xmm2 movdqa %xmm0,%xmm1 punpcklwd %xmm2,%xmm0 punpckhwd %xmm2,%xmm1 punpcklwd %xmm1,%xmm0 ``` These **xmm** registers are vector registers for x86 64 machines. The compiler was smart enough to vectorize the code for us. # Results for Aarch64 Running this code in an Aarch64 machine, these are the results I got: ```bash ./1.o 0: 0 1: 0 2: 0 3: 1 4: -1 5: 7 6: -7 7: 8 8: -8 9: 24575 10: -24576 Time elapsed: 0.442601 Sum: -202518466 ./2.o 0: 0 1: 0 2: 0 3: 1 4: -1 5: 7 6: -7 7: 8 8: -8 9: 24575 10: -24576 Time elapsed: 0.720578 Sum: -202518466 ./3.o 0: 0 1: 0 2: -1 3: 1 4: -2 5: 7 6: -8 7: 8 8: -9 9: 24575 10: -24576 Time elapsed: 0.219452 Sum: -296276756 ``` Not exactly what I was expecting: it seems that looking up the values in the array actually took longer than doing the multiplications. I am not exactly an expert on Aarch64, so I will need some time to do some research and understand why this is happening. # Other conclusions For all tests I was using an array of 250,000,000 samples, and all these approaches managed to process the arrays in less than one second. This is means that both platforms would easily be able to play it as a 41.1kHz song with 2 channels. An alternative to make the third algorithm even faster would be using multi-core processing for the different channels on the songs. I am also interested to know which architecture consumed more power to achieve these results. Unfortunately, I could not find a reliable formula to calculate it, but we can make some guesses. Some factors to consider are how many Watts per instruction the processor needs, and how many instructions the processor needs to perform every calculation in the loop. The speed of the processor is not a good factor, since a machine that performs 10 instructions per second consuming 5W per operation consumes the same as a machine performing 5 instructions per second but consuming 10W each. We can -kind of- find out how many instructions we needed by looking at the assembly code: it is not very accurate, and it is hard to read the code generated by the **-O3** compilation. But it's better than nothing. The loop was unrolled 16 times for both machines, and in total, my x86 machine took around 170 instructions, while the ARM machine took 150. The results are very close. However, the ARM architecture is notorious for being more energy efficient (for every instruction) than X86 64 machines, so I think it is a safe-ish thing to assume that the Aarch64 machine was, indeed, more energy efficient.