could something like this be done? (pseudocode)

mulpp xmm0, {65536\^2,65536,65536\^2,65536}

cvtps2pi mm0, xmm0 // get {x,0,z,0} to {x,z}

movhlps xmm0, xmm0 // get {0,y,0,w} to {y,0,w,0}

cvtps2pi mm1, xmm0 // get {y,0,w,0} to {y,w}

pand mm0,mm1 // bitwise combine..

the idea:

the first mulpp “bitshifts” the values..

from x,y,z,w to x<<32,y<<16,z<<32,w<<16

then you manually extract the <<32 and the <<16 pairs, and add them together, and you get the 4 words together..

or so..

well, thats my idea..

Hi all,

I’m looking for a way to perform the operations of the non-existant packusdw MMX instruction. I need it to convert four floating-point numbers in an SSE register to 0.16 fixed-point format, without saturation. Currently I use this:

But that’s significantly slower than it could have been when packusdw existed. It’s really important to me because it’s the bottleneck of my application. If you have any ideas to do this conversion/packing faster, please let me know!

Thanks.