Henry S. Coelho

More Inline Assembly

A while ago I made two posts: Algorithm Selection was about different algorithms for scaling digital sound, and Experimenting with Inline Assembly in C talking about how to use Inline Assembly in C. For this lab, I will join both things by making a better algorithm using Inline Assembly.

Here is the logic behind it: Instead of going through every single element in the array, we can use vector registers to operate with 8 elements at a time (8 elements of 16 bit each = 128 bits). To make sure the operation is as efficient as possible, we will implement this with inline assembly.

And here are the changes:

  // This variable will hold the last address of the array
  int16_t* limit;

  // Will be used for the assembler code: the volume will now be
  // stored in register 22, and a cursor for the array (for looping)
  // will be stored in register r20
  // - Question: Is there an alternate approach for this?
  // - Answer: Yes. Instead of naming the registers we are going to use,
  // we can leave it to the compiler. If we do this, the asm() instruction
  // will have to change, since we will not be able to use the name of
  // the register anymore
  register int16_t  volumeInt asm("r22");
  register int16_t *arrCursor asm("r20");

  // First and last address of the array
  arrCursor = arr;
  limit = arr + LENGTH;

Those are the variables I will need. Now, for the algorithm that goes inside the loop:

  // Here I am duplicating the volume factor in vector v1.8
  // - Question: What do you mean by "duplicating"?
  // - Answer: By duplicating, I mean that the volume factor
  // will be repeated through the vector. For example, say the
  // volume factor is "19" and the vector has 8 lanes. The vector
  // register will then be filled as "1919191919191919"
  asm("dup v1.8h, w22");

  // While we did not reach the last element of the array...
  while(arrCursor < limit) {
    asm(
      // Load eight shorts into the vector v0.8 (q0)
      "ldr q0, [x20]                  "

      // Multiply vector v0.8 by the volume factor and save
      // into v0.8
      "sqdmulh v0.8h, v0.8h, v1.8h    "

      // Store the value of v0.8 back in the array (all 8 shorts at once)
      // and advance the cursor to the next 8 shorts (16 bytes)
      "str q0, [x20],#16              "

      // Using "arrCursor" as input/output
      : "+r"(arrCursor)

      // Specifying that "limit" is a read-only register
      : "r"(limit)
      :
    );
  }

Another question that arises is: do we need the input/output sections in the inline assembly in this case, since we are addressing the registers directly? In theory, no, we wouldn't need it; however, if we do not specify it, the compiler will be confused: the compiler does not know what the asm procedure is doing, so it will see it as a "black box". Since it is a black box, the compiler is unaware that the controls for the loop while(arrCursor < limit) are being changed in it. In other words, the compiler will think that it is actually an infinite loop, and will replace it with a faster alternative. We explicitly tell the compiler that the arrCursor is being modified, so it will be aware that the loop is not infinite, and must be checked every iteration.

The result was slightly better than the previous implementation (which was around 0.22 seconds): the new version can finish in around 0.19 seconds. My guess is that the previous implementation was already been vectorized by the compiler, so the difference was not too significant.

Inline Assembly in mjpegtools

The next part of the lab is to pick an open source library that uses Inline Assembly and determine:

The library I choose is called mjpegtools. According to the documentation,

Programs for MJPEG recording and playback and simple cut-and-paste editting and MPEG compression of audio and video under Linux.

These are the instances of Inline Assembly I found:

file utils/cpuinfo.c

// In the instructions below, the application is moving the values
// from register B to register source index, placing the bytes 0x0f
// and 0xa2, and then exchanging the contents of the two registers.
// I don't know what the instructions from those bytes are, so sadly,
// I can't really tell what these sections are doing or if they could
// be better.
// The platform is obviously for x86, but the variation seem to be if
// the platform is 64 bits or not: the 32 bits will use the prefix "e",
// while the 64 bit version will use "r" (extended registers)
// What happens on other platforms? If I did not miss anyting, there
// are no other platforms supported.

#define CPUID   ".byte 0x0f, 0xa2; "
#ifdef __x86_64__
  asm("mov %%rbx, %%rsi"
#else
  asm("mov %%ebx, %%esi"
#endif
      CPUID""
#ifdef __x86_64__
      "xchg %%rsi, %%rbx"
#else
      "xchg %%esi, %%ebx"
#endif

// ...

#define RDTSC   ".byte 0x0f, 0x31; "
  asm volatile (RDTSC : "=A"(i) : );

file utils/cpu_accel.c

#ifdef HAVE_X86CPU 

/* Some miscelaneous stuff to allow checking whether SSE instructions cause
   illegal instruction errors.
*/

static sigjmp_buf sigill_recover;

static RETSIGTYPE sigillhandler(int sig )
{
    siglongjmp( sigill_recover, 1 );
}

typedef RETSIGTYPE (*__sig_t)(int);

static int testsseill()
{
    int illegal;
#if defined(__CYGWIN__)
    /* SSE causes a crash on CYGWIN, apparently.
       Perhaps the wrong signal is being caught or something along
       those line ;-) or maybe SSE itself won't work...
    */
    illegal = 1;
#else
    __sig_t old_handler = signal( SIGILL, sigillhandler);
    if( sigsetjmp( sigill_recover, 1 ) == 0 )
    {
        asm ( "movups %xmm0, %xmm0" );
        illegal = 0;
    }
    else
        illegal = 1;
    signal( SIGILL, old_handler );
#endif
    return illegal;
}

static int x86_accel (void)
{
    long eax, ebx, ecx, edx;
    int32_t AMD;
    int32_t caps;

    /* Slightly weirdified cpuid that preserves the ebx and edi required
       by gcc for PIC offset table and frame pointer */

#if defined(__LP64__) || defined(_LP64)
#  define REG_b "rbx"
#  define REG_S "rsi"
#else
#  define REG_b "ebx"
#  define REG_S "esi"
#endif

#define cpuid(op,eax,ebx,ecx,edx)    
    asm ( "push %%"REG_b"
" 
          "cpuid
" 
          "mov   %%"REG_b", %%"REG_S"
" 
          "pop   %%"REG_b"
"  
     : "=a" (eax),          
       "=S" (ebx),          
       "=c" (ecx),          
       "=d" (edx)           
     : "a" (op)         
     : "cc", "edi")

    asm ("pushf"
     "pop %0"
     "mov %0,%1"
     "xor $0x200000,%0"
     "push %0"
     "popf"
     "pushf"
     "pop %0"
         : "=a" (eax),
           "=c" (ecx)
     :
     : "cc");


    if (eax == ecx)        // no cpuid
    return 0;

    cpuid (0x00000000, eax, ebx, ecx, edx);
    if (!eax)            // vendor string only
    return 0;

    AMD = (ebx == 0x68747541) && (ecx == 0x444d4163) && (edx == 0x69746e65);

    cpuid (0x00000001, eax, ebx, ecx, edx);
    if (! (edx & 0x00800000))    // no MMX
    return 0;

    caps = ACCEL_X86_MMX;
    /* If SSE capable CPU has same MMX extensions as AMD
       and then some. However, to use SSE O.S. must have signalled
       it use of FXSAVE/FXRSTOR through CR4.OSFXSR and hence FXSR (bit 24)
       here
    */
    if ((edx & 0x02000000))    
        caps = ACCEL_X86_MMX | ACCEL_X86_MMXEXT;
    if( (edx & 0x03000000) == 0x03000000 )
    {
        /* Check whether O.S. has SSE support... has to be done with
           exception 'cos those Intel morons put the relevant bit
           in a reg that is only accesible in ring 0... doh! 
        */
        if( !testsseill() )
            caps |= ACCEL_X86_SSE;
    }

    cpuid (0x80000000, eax, ebx, ecx, edx);
    if (eax < 0x80000001)    // no extended capabilities
        return caps;

    cpuid (0x80000001, eax, ebx, ecx, edx);

    if (edx & 0x80000000)
    caps |= ACCEL_X86_3DNOW;

    if (AMD && (edx & 0x00400000))    // AMD MMX extensions
    {
        caps |= ACCEL_X86_MMXEXT;
    }

    return caps;
}
#endif


#ifdef HAVE_ALTIVEC
/* AltiVec optimized library for MJPEG tools MPEG-1/2 Video Encoder
 * Copyright (C) 2002  James Klicman <james@klicman.org>
 *
 * The altivec_detect() function has been moved here to workaround a bug in a
 * released version of GCC (3.3.3). When -maltivec and -mabi=altivec are
 * specified, the bug causes VRSAVE instructions at the beginning and end of
 * functions which do not use AltiVec. GCC 3.3.3 also lacks support for
 * '#pragma altivec_vrsave off' which would have been the preferred workaround.
 *
 * This AltiVec detection code relies on the operating system to provide an
 * illegal instruction signal if AltiVec is not present. It is known to work
 * on Mac OS X and Linux.
 */

static sigjmp_buf jmpbuf;

static void sig_ill(int sig)
{
    siglongjmp(jmpbuf, 1);
}

int detect_altivec()
{
    volatile int detected = 0; /* volatile (modified after sigsetjmp) */
    struct sigaction act, oact;

    act.sa_handler = sig_ill;
    sigemptyset(&act.sa_mask);
    act.sa_flags = 0;

    if (sigaction(SIGILL, &act, &oact)) {
    perror("sigaction");
    return 0;
    }

    if (sigsetjmp(jmpbuf, 1))
    goto noAltiVec;

    /* try to read an AltiVec register */ 
    altivec_copy_v0();

    detected = 1;

noAltiVec:
    if (sigaction(SIGILL, &oact, (struct sigaction *)0))
    perror("sigaction");

    return detected;
}
#endif

The code above is a bit cryptic, to say the least. Again, the platform is x86 and other platforms don't see to be supported. What it does is, again, not exactly obvious, but I can guess: the Inline Assembly seems to deal a lot with vector registers and vectorization, so it probably sets up a very efficient way to deal with multiple data that needs a single instruction (SIMD). This makes sense, since it is a library for MJPEG, dealing with big streams of data.

Since I am not exactly sure what the code does, I can't say I have a strong opinion about it. It does seem like it has a good reason to be there: enforcing vectorization. I am not sure, however, if it really needs to be written in assembly or could be left to the compiler to optimize it. I would say that the loss in portability is definitely a problem, since other architectures are becoming more popular now: it would probably be a good idea to write more cases for the other platforms, or look for an alternative solution.