SIMD/Uses/Conversion

< SIMD

In multimedia coding, converting data types is not uncommon. For instance, in JPEG and MPEG the reference DCT is operating on double precision intermediates, but the resulting coefficients are stored as integer values. Converting floating point values to integer types looks innocent enough written in C:

int i;
double d = 0.99;
i = (int) d;

The resulting value for i is 0, as casting in C will truncate any positions after the decimal point -- no rounding will occur. The same is true for JavaScript. (In JavaScript, the bitwise operators such as | will trigger an integer conversion.)

To get proper rounding, following approach is often employed:

if(d > 0) {
   i = (int) d + 0.5;
} else {
   i = (int) d - 0.5;
}

often written in a more compact fashion using a ternary operator:

i = (int) d > 0 ? (d + 0.5) : (d - 0.5);

This works, but is not particularly fast: For every value a conditional statement is executed, making the code very "branchy", which is generally disliked by pipelined processor designs. No matter what branch is taken, preceding the type conversion there's an additional arithmetic operation to apply the fitting offset.

SIMD instruction sets like SSE2 have specialized data conversion operations -- including, e.g., conversion of floating point values to integer values with rounding. In the case of SSE2, both scalar and vector values can be converted (i.e., a single value at a time or a vector of values). In the latter case, the conversion does not only avoid costly branches and additional arithmetic operations, but also gives several converted values with a single operation.

Example

From a public domain MPEG encoder (conversion of the DCT-output to integer values):

  for(sptr = sourcematrix,mptr=newmatrix;mptr<newmatrix+BLOCKSIZE;sptr++) {
#ifdef __SSE2__
      *(mptr++) = _mm_cvtsi128_si32(_mm_cvtpd_epi32(_mm_load1_pd(sptr)));
#else
      *(mptr++) = (int) (*sptr > 0 ? (*(sptr)+0.5):(*(sptr)-0.5));
#endif
  }

(mptr is defined as int *mptr, sptr as double *sptr).

If the code is compiled with SSE2 optimization enabled, SSE2 intrinsics are used to load a double value into a SSE2 register, convert the register contents to 32-bit integers, and extract one 32-bit integer that subsequently is written to memory. Although this code only converts one value at a time, the speed increase compared to the traditional conversion approach turns out to be significant.