SIMD operations are indispensible for high-performance computing. For the latest generation of x86 CPU's, SSE even quadruples the floating-point performance, making a Core 2 Quad at 3 GHz more powerful than a Geforce 8600 GTS. So it can really pay off to convert performance-critical scalar code into vector code.
Unfortunately, compilers are still not adequately capable of generating SIMD code themselves. A first reason for this is that not every scalar operation can be translated directly to a vector equivalent. One of these is the shift operation. You can shift every element by the same shift value, but not by independent values. A second reason why writing SIMD code is difficult is that you can't branch per element (or at least not without losing the benefits of SIMD). In this Code Gem I will address both issues to convert 32-bit floating-point values to 16-bit floating-point values.
Shifting four 32-bit values in an SSE register by independent values in another SSE register won't become available as a single instruction until SSE5 appears in late 2008. Fortunately there's a trick to do almost the same thing with good old SSE2 (available since Pentium 4's and Athlon 64's). Shifting an integer value is the same thing as multiplying (or dividing) it by a power of two:
x << y = x * 2^y
So if we can convert our shift values to this power of two we just have to multiply to perform our shift operation. For this we can (ab)use the IEEE-754 floating-point format. It consists of a sign bit (s), 8 exponent bits (e), and a mantissa (m), and represents the following number:
s * 2^(e - 127) * m
Note that it performs a power of two operation. By setting the exponent field to y + 127, and the sign and mantissa both to 1, we get the value we need in floating-point representation. All we have to do next is convert to integer and multiply. But there's another issue here. There is no SSE instruction for multiplying four 32-bit integer numbers. There is one for floating-point numbers though. So instead we convert our values to be shifted (x) to floating-point, multiply in floating-point format, and then convert back to integer. Let's sum this up in code:
static __declspec(align(16)) unsigned int bias[4] = {127, 127, 127, 127};
__asm
{
paddd xmm1, bias // y in xmm1
pslld xmm1, 23
cvtdq2ps xmm0, xmm0 // x in xmm0
mulps xmm0, xmm1
cvttps2dq xmm0, xmm0 // Result in xmm0
}
(Note that this code is aimed at Visual C++ 32-bit)What about precision and such? Converting an integer into a floating-point number can be done without losing any precision, as long as it fits into the mantissa field, which is 23 bits long. And because of an implicit 1 and the sign bit you can use signed integers up to 25 bits long. Anything that falls out of this range will loose some of the lower precision bits, which may or may not be critical for your application. Also, this version only works with positive shift values for shifting left, and when the result overflows it returns 0x80000000.
So it has several restrictions but for the situations where it's usable these five instructions are likely the fastest solution. Where I found this approach particularly useful is the parallel conversion of 32-bit floating-point numbers to 16-bit floating-point numbers (also known as the 'half' type).
This conversion has to deal with three cases: infinity, normal values, and denormal values. It's the denormal case where the SIMD shift comes in. Dealing with the three cases without branching can be achieved by masking (AND operation) and combining (OR operation). I'll save you from the details (but do not hesitate to ask questions) and immediately present the code:
static __declspec(align(16)) unsigned int abs[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF};
static __declspec(align(16)) unsigned int infinity[4] = {0x47FFEFFF, 0x47FFEFFF, 0x47FFEFFF, 0x47FFEFFF};
static __declspec(align(16)) unsigned int denormal[4] = {0x38800000, 0x38800000, 0x38800000, 0x38800000};
static __declspec(align(16)) unsigned int mantissa[4] = {0x007FFFFF, 0x007FFFFF, 0x007FFFFF, 0x007FFFFF};
static __declspec(align(16)) unsigned int oneDot[4] = {0x00800000, 0x00800000, 0x00800000, 0x00800000};
static __declspec(align(16)) unsigned int exponent[4] = {0x7F800000, 0x7F800000, 0x7F800000, 0x7F800000};
static __declspec(align(16)) unsigned int fixup[4] = {0x48000000, 0x48000000, 0x48000000, 0x48000000};
static __declspec(align(16)) unsigned int round1[4] = {0x00000001, 0x00000001, 0x00000001, 0x00000001};
static __declspec(align(16)) unsigned int round2[4] = {0x00000FFF, 0x00000FFF, 0x00000FFF, 0x00000FFF};
static __declspec(align(16)) unsigned int sign[4] = {0x80000000, 0x80000000, 0x80000000, 0x80000000};
static __declspec(align(16)) unsigned int base[4] = {0x00007FFF, 0x00007FFF, 0x00007FFF, 0x00007FFF};
static __declspec(align(16)) unsigned int adjust[4] = {0x07000000, 0x07000000, 0x07000000, 0x07000000};
__asm
{
movaps xmm1, xmm0 // Input in xmm0
// Compute masks
andps xmm0, abs
movaps xmm2, xmm0
movaps xmm3, xmm0
cmpnltps xmm2, infinity
cmpltps xmm3, denormal
// Denormal case
movaps xmm4, xmm0
andps xmm4, exponent
paddd xmm4, adjust
movaps xmm6, xmm0
andps xmm6, mantissa
orps xmm6, oneDot
cvtdq2ps xmm6, xmm6
mulps xmm6, xmm4
cvttps2dq xmm6, xmm6
// Normal case and combine
paddd xmm0, fixup
andps xmm6, xmm3
andnps xmm3, xmm0
orps xmm6, xmm3
// Correct rounding
movaps xmm0, xmm6
psrld xmm0, 13
andps xmm0, round1
paddd xmm0, round2
paddd xmm0, xmm6
// Combine with sign and infinity
psrld xmm0, 13
andps xmm1, sign
psrld xmm1, 16
orps xmm0, xmm2
andps xmm0, base
orps xmm0, xmm1 // Result in lower words of each element
}
A Core 2 Duo executes this code in about 14 clock cycles, or 3.5 cycles per element, which is clearly hard to beat with scalar code. Conversion from half to float can be done with a lookup table (one element at a time; SSE does not support gather operations yet).So I hope this inspires more people to explore the world of SIMD computation. This particular example might be useful for image processing, ray tracing, compression, etc, but the techniques have a much wider applicability. So feel free to discuss SIMD and especially SSE below.
Cheers,
Nicolas
P.S: SSE5 will include float-to-half and half-to-float conversion in a single instruction...












