Jump to content


SIMD shift and float-to-half conversion


14 replies to this topic

#1 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 25 September 2007 - 02:00 PM

Hi all,

SIMD operations are indispensible for high-performance computing. For the latest generation of x86 CPU's, SSE even quadruples the floating-point performance, making a Core 2 Quad at 3 GHz more powerful than a Geforce 8600 GTS. So it can really pay off to convert performance-critical scalar code into vector code.

Unfortunately, compilers are still not adequately capable of generating SIMD code themselves. A first reason for this is that not every scalar operation can be translated directly to a vector equivalent. One of these is the shift operation. You can shift every element by the same shift value, but not by independent values. A second reason why writing SIMD code is difficult is that you can't branch per element (or at least not without losing the benefits of SIMD). In this Code Gem I will address both issues to convert 32-bit floating-point values to 16-bit floating-point values.

Shifting four 32-bit values in an SSE register by independent values in another SSE register won't become available as a single instruction until SSE5 appears in late 2008. Fortunately there's a trick to do almost the same thing with good old SSE2 (available since Pentium 4's and Athlon 64's). Shifting an integer value is the same thing as multiplying (or dividing) it by a power of two:

x << y = x * 2^y

So if we can convert our shift values to this power of two we just have to multiply to perform our shift operation. For this we can (ab)use the IEEE-754 floating-point format. It consists of a sign bit (s), 8 exponent bits (e), and a mantissa (m), and represents the following number:

s * 2^(e - 127) * m

Note that it performs a power of two operation. By setting the exponent field to y + 127, and the sign and mantissa both to 1, we get the value we need in floating-point representation. All we have to do next is convert to integer and multiply. But there's another issue here. There is no SSE instruction for multiplying four 32-bit integer numbers. There is one for floating-point numbers though. So instead we convert our values to be shifted (x) to floating-point, multiply in floating-point format, and then convert back to integer. Let's sum this up in code:

static __declspec(align(16)) unsigned int bias[4] = {127, 127, 127, 127};


__asm

{

    paddd	xmm1, bias   // y in xmm1

    pslld	xmm1, 23


    cvtdq2ps	xmm0, xmm0   // x in xmm0

    mulps	xmm0, xmm1

    cvttps2dq	xmm0, xmm0   // Result in xmm0

}

(Note that this code is aimed at Visual C++ 32-bit)

What about precision and such? Converting an integer into a floating-point number can be done without losing any precision, as long as it fits into the mantissa field, which is 23 bits long. And because of an implicit 1 and the sign bit you can use signed integers up to 25 bits long. Anything that falls out of this range will loose some of the lower precision bits, which may or may not be critical for your application. Also, this version only works with positive shift values for shifting left, and when the result overflows it returns 0x80000000.

So it has several restrictions but for the situations where it's usable these five instructions are likely the fastest solution. Where I found this approach particularly useful is the parallel conversion of 32-bit floating-point numbers to 16-bit floating-point numbers (also known as the 'half' type).

This conversion has to deal with three cases: infinity, normal values, and denormal values. It's the denormal case where the SIMD shift comes in. Dealing with the three cases without branching can be achieved by masking (AND operation) and combining (OR operation). I'll save you from the details (but do not hesitate to ask questions) and immediately present the code:

static __declspec(align(16)) unsigned int abs[4]      = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF};

static __declspec(align(16)) unsigned int infinity[4] = {0x47FFEFFF, 0x47FFEFFF, 0x47FFEFFF, 0x47FFEFFF};

static __declspec(align(16)) unsigned int denormal[4] = {0x38800000, 0x38800000, 0x38800000, 0x38800000};

static __declspec(align(16)) unsigned int mantissa[4] = {0x007FFFFF, 0x007FFFFF, 0x007FFFFF, 0x007FFFFF};

static __declspec(align(16)) unsigned int oneDot[4]   = {0x00800000, 0x00800000, 0x00800000, 0x00800000};

static __declspec(align(16)) unsigned int exponent[4] = {0x7F800000, 0x7F800000, 0x7F800000, 0x7F800000};

static __declspec(align(16)) unsigned int fixup[4]    = {0x48000000, 0x48000000, 0x48000000, 0x48000000};

static __declspec(align(16)) unsigned int round1[4]   = {0x00000001, 0x00000001, 0x00000001, 0x00000001};

static __declspec(align(16)) unsigned int round2[4]   = {0x00000FFF, 0x00000FFF, 0x00000FFF, 0x00000FFF};

static __declspec(align(16)) unsigned int sign[4]     = {0x80000000, 0x80000000, 0x80000000, 0x80000000};

static __declspec(align(16)) unsigned int base[4]     = {0x00007FFF, 0x00007FFF, 0x00007FFF, 0x00007FFF};

static __declspec(align(16)) unsigned int adjust[4]   = {0x07000000, 0x07000000, 0x07000000, 0x07000000};


__asm

{

    movaps	xmm1, xmm0   // Input in xmm0


    // Compute masks

    andps	xmm0, abs

    movaps	xmm2, xmm0

    movaps	xmm3, xmm0

    cmpnltps	xmm2, infinity

    cmpltps	xmm3, denormal


    // Denormal case

    movaps	xmm4, xmm0

    andps	xmm4, exponent

    paddd	xmm4, adjust

    movaps	xmm6, xmm0

    andps	xmm6, mantissa

    orps	xmm6, oneDot

    cvtdq2ps	xmm6, xmm6

    mulps	xmm6, xmm4

    cvttps2dq	xmm6, xmm6


    // Normal case and combine

    paddd	xmm0, fixup

    andps	xmm6, xmm3

    andnps	xmm3, xmm0

    orps	xmm6, xmm3


    // Correct rounding

    movaps	xmm0, xmm6

    psrld	xmm0, 13

    andps	xmm0, round1

    paddd	xmm0, round2

    paddd	xmm0, xmm6


    // Combine with sign and infinity

    psrld	xmm0, 13

    andps	xmm1, sign

    psrld	xmm1, 16

    orps	xmm0, xmm2

    andps	xmm0, base

    orps	xmm0, xmm1   // Result in lower words of each element

}

A Core 2 Duo executes this code in about 14 clock cycles, or 3.5 cycles per element, which is clearly hard to beat with scalar code. Conversion from half to float can be done with a lookup table (one element at a time; SSE does not support gather operations yet).

So I hope this inspires more people to explore the world of SIMD computation. This particular example might be useful for image processing, ray tracing, compression, etc, but the techniques have a much wider applicability. So feel free to discuss SIMD and especially SSE below.

Cheers,

Nicolas

P.S: SSE5 will include float-to-half and half-to-float conversion in a single instruction...

#2 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 27 September 2007 - 06:53 PM

Very cool trick with the shifting!
I just might go through my code and see if I can't use it somewhere :)

Couldn't you do the reverse for half-to-float conversions? Now, there is no NaN or denormal checking in this code, but it works when the values are valid, albeit with some lost precission :happy:


static __declspec(align(16)) unsigned half_sign[4]	  = {0x00008000, 0x00008000, 0x00008000, 0x00008000};

static __declspec(align(16)) unsigned half_exponent[4]	  = {0x00007C00, 0x00007C00, 0x00007C00, 0x00007C00};

static __declspec(align(16)) unsigned half_mantissa[4]	  = {0x000003FF, 0x000003FF, 0x000003FF, 0x000003FF};

static __declspec(align(16)) unsigned half_bias_offset[4] = {0x0001C000, 0x0001C000, 0x0001C000, 0x0001C000};


__asm

{

	movaps	xmm1, xmm0  // Input in xmm0

	movaps	xmm2, xmm0


	andps	xmm0, half_sign

	andps	xmm1, half_exponent

	andps	xmm2, half_mantissa

	paddd	xmm1, half_bias_offset


	pslld	xmm0, 16

	pslld	xmm1, 13

	pslld	xmm2, 13


	orps	xmm1, xmm2

	orps	xmm0, xmm1  // Result in xmm0

}


"Stupid bug! You go squish now!!" - Homer Simpson

#3 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 28 September 2007 - 06:38 AM

Kenneth Gorking said:

Couldn't you do the reverse for half-to-float conversions?
Probably yes. The trickiest part is conversion of denormal half values to normalized float values. It requires a log2 operation, and this can be achieved by converting from integer to floating-point format, and extracting the exponent...

But performance-wise a lookup table for the whole 16-bit value is likely hard to beat, even if it has to happen one element at a time. If an Intel/AMD guy is reading this: we need true scatter/gather operations in SSE6! :w00t: It would be useful for a lot more than just handling half values.

#4 iollmann

    New Member

  • Members
  • Pip
  • 1 posts

Posted 03 October 2007 - 12:18 AM

You don't need a variable shift to do the denormals. Just use cvtdq2ps.

Also, if you are on MacOS X, you can make use of:

vImageConvert_Planar16FtoPlanarF
vImageConvert_PlanarFtoPlanar16F

...in Accelerate.framework to do these conversions.

Best Regards,

Ian Ollmann

#5 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 07 October 2007 - 11:05 AM

Hi Ian,

iollmann said:

You don't need a variable shift to do the denormals. Just use cvtdq2ps.
You probably meant cvt(t)ps2dq, like below?

static __declspec(align(16)) unsigned int abs[4]      = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF};

static __declspec(align(16)) unsigned int infinity[4] = {0x47FFEFFF, 0x47FFEFFF, 0x47FFEFFF, 0x47FFEFFF};

static __declspec(align(16)) unsigned int denormal[4] = {0x38800000, 0x38800000, 0x38800000, 0x38800000};

static __declspec(align(16)) unsigned int fixup[4]    = {0x48000000, 0x48000000, 0x48000000, 0x48000000};

static __declspec(align(16)) unsigned int round1[4]   = {0x00000001, 0x00000001, 0x00000001, 0x00000001};

static __declspec(align(16)) unsigned int round2[4]   = {0x00000FFF, 0x00000FFF, 0x00000FFF, 0x00000FFF};

static __declspec(align(16)) unsigned int sign[4]     = {0x80000000, 0x80000000, 0x80000000, 0x80000000};

static __declspec(align(16)) unsigned int base[4]     = {0x00007FFF, 0x00007FFF, 0x00007FFF, 0x00007FFF};

static __declspec(align(16)) unsigned int integer[4]  = {0x52000000, 0x52000000, 0x52000000, 0x52000000};


__asm

{

    movaps	xmm1, xmm0   // Input in xmm0


    // Compute masks

    andps	xmm0, abs

    movaps	xmm2, xmm0

    movaps	xmm3, xmm0

    cmpnltps	xmm2, infinity

    cmpltps	xmm3, denormal


    // Denormal case

    movaps	xmm6, xmm0

    mulps	xmm6, integer

    cvttps2dq	xmm6, xmm6


    // Normal case and combine

    paddd	xmm0, fixup

    andps	xmm6, xmm3

    andnps	xmm3, xmm0

    orps	xmm6, xmm3


    // Correct rounding

    movaps	xmm0, xmm6

    psrld	xmm0, 13

    andps	xmm0, round1

    paddd	xmm0, round2

    paddd	xmm0, xmm6


    // Combine with sign and infinity

    psrld	xmm0, 13

    andps	xmm1, sign

    psrld	xmm1, 16

    orps	xmm0, xmm2

    andps	xmm0, base

    orps	xmm0, xmm1   // Result in lower words of each element

}

Quote

Also, if you are on MacOS X, you can make use of:

vImageConvert_Planar16FtoPlanarF
vImageConvert_PlanarFtoPlanar16F
Interesting! Are those fully SSE optimized? I'm not experienced enough with Xcode to have a peek at the disassembly...

#6 J22

    Member

  • Members
  • PipPip
  • 92 posts

Posted 13 October 2007 - 02:03 PM

Nick said:

For the latest generation of x86 CPU's, SSE even quadruples the floating-point performance, making a Core 2 Quad at 3 GHz more powerful than a Geforce 8600 GTS
Just wondering, but isn't the throughput of basic SIMD instructions on quad core CPUs (INTEL Core 2 Extreme QX6850) only 4x3G=12G inst/s, while 8600GTS runs at 32x1.45G=46.4G inst/s, and 8800 Ultra runs at 128x1.5G=192G inst/s?

#7 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 13 October 2007 - 04:18 PM

J22 said:

Just wondering, but isn't the throughput of basic SIMD instructions on quad core CPUs (INTEL Core 2 Extreme QX6850) only 4x3G=12G inst/s, while 8600GTS runs at 32x1.45G=46.4G inst/s, and 8800 Ultra runs at 128x1.5G=192G inst/s?
Core 2 can execute one 4-component vector multiplication and one vector addition per clock cycle. So that's 8 floating-point operations x 4 cores x 3 GHz = 96 GFLOPS.

GeForce 8600 GTS can also do a multiplication and addition in the same cycle, but each stream processor is scalar, so it totals 92.8 GFLOPS.

#8 J22

    Member

  • Members
  • PipPip
  • 92 posts

Posted 14 October 2007 - 09:01 AM

Ah, that's right of course. I was under impression that stream processor operates on 4d vectors, not scalars, thus the confusion.

#9 GandiTheGrey

    New Member

  • Members
  • Pip
  • 2 posts

Posted 20 December 2007 - 09:03 AM

Nick - thanks indeed for the cool trick!

One technical question:
the max finite half-precision number possible is 65504.0, hexed to single as 0x477fe000. the 'next' single is 0x477fe001, which is the minimal single to qualify as an infinite (in half prec). So why set the 'infinite' mask to 0x47ffefff? Or maybe i got the rationale wrong altogether?

thanks,

-G

#10 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 20 December 2007 - 11:33 AM

I think that is because 0x477fe001 through 0x477fefff gets rounded down to 65504.0 anyway, so they're perfectly acceptable values.
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#11 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 20 December 2007 - 04:06 PM

I used the behavior of D3DXFloat32To16Array/D3DXFloat16To32Array, which does not have a representation for NaN nor infinity. It can represent up to 131008.0, but clamps anything higher to the representation of this number (0x7FFF).

So they give it a slightly bigger range than it would have had if all 1's in the exponent meant NaN or infinity. I think this is done to allow storing +-65536 without causing it to immediately jump to infinity. So you could store a 16-bit integer texture in a 16-bit floating-point texture.

'Infinity' is probably not the best name for the 0x47FFEFFF value I used. Maybe 'limit' would have prevented confusion.

Anyway, this is an interesting topic. The IEEE 754r draft currently describes a 16-bit floating-point format that does have a representation for NaN and infinity. I'm not sure if this makes sense because this format is intended for storate only; arithmetic operations are only defined staring from 32-bit. You never want to create a texture or vertex buffer with a NaN or infinite value. And I can't think of an application outside of multimedia where it does make sense...

#12 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 21 December 2007 - 10:20 AM

Nick said:

I used the behavior of D3DXFloat32To16Array/D3DXFloat16To32Array, which does not have a representation for NaN nor infinity.
I've done some extra testing and found out this is only true for DirectX 9, not DirectX 10. :excl: It also appears to have different rounding rules...

#13 GandiTheGrey

    New Member

  • Members
  • Pip
  • 2 posts

Posted 23 December 2007 - 12:10 PM

Nick said:

I used the behavior of D3DXFloat32To16Array/D3DXFloat16To32Array, which does not have a representation for NaN nor infinity. It can represent up to 131008.0, but clamps anything higher to the representation of this number (0x7FFF).

So they give it a slightly bigger range than it would have had if all 1's in the exponent meant NaN or infinity. I think this is done to allow storing +-65536 without causing it to immediately jump to infinity. So you could store a 16-bit integer texture in a 16-bit floating-point texture.

'Infinity' is probably not the best name for the 0x47FFEFFF value I used. Maybe 'limit' would have prevented confusion.

Anyway, this is an interesting topic. The IEEE 754r draft currently describes a 16-bit floating-point format that does have a representation for NaN and infinity. I'm not sure if this makes sense because this format is intended for storate only; arithmetic operations are only defined staring from 32-bit. You never want to create a texture or vertex buffer with a NaN or infinite value. And I can't think of an application outside of multimedia where it does make sense...
Ahh. Thanks!
doesn't D3DXFloat32To16Array use vectorized implementation anyhow? (odd, but possible)

#14 aqnuep

    New Member

  • Members
  • Pip
  • 1 posts

Posted 08 September 2009 - 05:15 PM

Nick said:

Core 2 can execute one 4-component vector multiplication and one vector addition per clock cycle. So that's 8 floating-point operations x 4 cores x 3 GHz = 96 GFLOPS.

GeForce 8600 GTS can also do a multiplication and addition in the same cycle, but each stream processor is scalar, so it totals 92.8 GFLOPS.

GeForce 8600 -> 113-139 GFLOPS (MAD) means 226-178 GFLOPS
Radeon HD2600 -> 144-192 GFLOPS (MAD) means 288-384 GFLOPS

And these values are valid not just for vector operations but also for independent scalar ones, of which those weak CPUs aren't able at the rate you mentioned.

Don't say that CPUs are faster than even mid-range GPUs. GPUs are far more powerfull than CPUs as raw computational hardware and also they even perform more powerful because they have a wider memory bus and higher memory clock rates. Both get starved sometimes and GPU is of course a limited piece of hardware but for raw calculations it performs much, much better.

#15 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 08 September 2009 - 09:06 PM

aqnuep said:

GeForce 8600 -> 113-139 GFLOPS (MAD) means 226-178 GFLOPS
I don't think so. The GeForce 8600 GTS has 32 stream processors at 1450 MHz, each capable of a single MAD. That's 92.8 GFLOPS total.

Quote

And these values are valid not just for vector operations but also for independent scalar ones, of which those weak CPUs aren't able at the rate you mentioned.
The GeForce's stream processors are clustered together to work as an SIMD array. That's exactly the same as a CPU's SIMD vectors. So on a CPU you can just as easily use the full computational power even for scalar operations, if you process multiple elements in parallel.

Quote

Don't say that CPUs are faster than even mid-range GPUs.
I'm only summing up the facts.

Quote

...also they even perform more powerful because they have a wider memory bus and higher memory clock rates.
Again that's not quite true. The GeForce 8600 GTS has 32 GB/s of bandwidth. A Core 2 Quad has up to 25.6 GB/s while the Core i7 can have 64 GB/s. But that's not all. A GPU wastes bandwidth on storing intermediate results in RAM memory. The CPU can store vast amounts of data in its caches, which offer a whopping 384 GB/s of bandwidth. For a lot of applications it's the average bandwidth that really matters.

Furthermore, a GPU completely chokes once you've used up all physical registers. On a GeForce 8600, each stream processor has only 16 physical registers. You run out of those really quickly if you try anything a little complex. It 'solves' this by running less threads, effectively decimating performance. Unlike the GPU, a CPU has a large stack and the cache hierarchy ensures it functions efficiently even for very long pieces of code.

This is why we often see the GPU fail against modern CPUs at GPGPU tasks. They are designed for doing graphics and other highly coherent tasks, but performance crumbles as soon as you need a bit of flexibility or complexity.

The simple conclusion is that for games you really shouldn't be using the GPU for anything other than graphics. Doing anything else is more often than not highly inefficient, eating away performance needed for rendering. It also has terrible round-trip latencies. So for a large number of other tasks, even computationally intensive ones, you're still best off doing them on the CPU. With quad-core starting to get real traction in the mainstream market you'd be a fool not to tap into that power.





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users