0
102 Oct 20, 2004 at 13:35

Hi all,

I’m looking for a way to perform the operations of the non-existant packusdw MMX instruction. I need it to convert four floating-point numbers in an SSE register to 0.16 fixed-point format, without saturation. Currently I use this:

  mulps   xmm0, _65536
cvtps2pi  mm0, xmm0
movhlps  xmm0, xmm0
cvtps2pi  mm1, xmm0
// packusdw  mm0, mm1
pshufw   mm0, mm0, 0x08
pshufw   mm1, mm1, 0x08
punpckldq mm0, mm1


But that’s significantly slower than it could have been when packusdw existed. It’s really important to me because it’s the bottleneck of my application. If you have any ideas to do this conversion/packing faster, please let me know!

Thanks.

#### 18 Replies

0
101 Oct 20, 2004 at 23:11

could something like this be done? (pseudocode)
mulpp xmm0, {65536\^2,65536,65536\^2,65536}
cvtps2pi mm0, xmm0 // get {x,0,z,0} to {x,z}
movhlps xmm0, xmm0 // get {0,y,0,w} to {y,0,w,0}
cvtps2pi mm1, xmm0 // get {y,0,w,0} to {y,w}
pand mm0,mm1 // bitwise combine..

the idea:
the first mulpp “bitshifts” the values..

from x,y,z,w to x<<32,y<<16,z<<32,w<<16

then you manually extract the <<32 and the <<16 pairs, and add them together, and you get the 4 words together..

or so..

well, thats my idea..

0
102 Oct 21, 2004 at 08:13

Unfortunately that doesn’t work. :unsure: When the floating-point number is bigger than what can be represented in a 32-bit register, it returns 0x8000000 always. But I need it to wrap. Therefore there is no other way I think than to multiply by 65536, convert to integer, and select the lower 16-bit. So one alternative approach would be:

  mulps   xmm0, _65536
cvtps2pi  mm0, xmm0
movhlps  xmm0, xmm0
cvtps2pi  mm1, xmm0
// packusdw  mm0, mm1
pslld   mm1, 16
pand    mm0, _0000FFFF0000FFFF
por    mm0, mm1


But that’s the same number of instructions. :mellow: I’m not giving up yet, but it’s getting a tiny bit frustrating to be stuck at this current performance level just because Intel didn’t include a packusdw instruction. It would also have worked if pshufw was like shufps.

Anyway, maybe if I rewrite my code to work on the unpacked data, it’s eventually faster. Little chance of that, but I have to try it…

0
101 Oct 21, 2004 at 09:47

hm okay.. well i just had a short look at it yesterday night (well i guess it was even today..:D).

so.. what is it? for texture sampling? or anything else?

you need the xmm0 to wrap in 0..1 range, and then, whats the endresult format you need? xmm0 with 4 16bit integers? or back again xmm0 with 4 floats?

i guess 4 16bit integers, so that you can then do memory accesses into the texture (if its for texture sampling at all..:D)

well, i’ll take myself more time to think it over.. :D but yeah, the intel team has made some funny choises with their instruction set sometimes.. but hey, we’re still stuck in a x86 world, so the instruction set sucks anyways, no mather what :D

0
102 Oct 21, 2004 at 10:32

Indeed it’s for texture sampling. :D

The input is two SSE registers, one with four u coordinates, one with four v coordinates, for a 2x2 block. I convert them to integer and pack them together in two MMX registers. So every 16-bit component is a coordinate between 0.0 and 1.0 in fixed-point 0.16 format. From there it’s quite efficient to get sixteen offsets in total for all the texels to be read.

It all works perfectly with my new rasterizer, but it only equals my previous bilinear sampler in performance. This doesn’t seem like a bad starting point (many other optimizations are possible with 2x2 blocks and 8x8 blocks), but to reach my goals it has to be faster. The problem is that 2x2 blocks on edges and small polygons are not completely filled. To make it effective I have to be able to take four samples for at most the price of three. :cool:

I haven’t completely located all problems yet, but the average instruction throughput is one every two clock cycles, while it should be the other way around on my Pentium M. :unsure: One of the bottlenecks is packing the coordinates together, because all other instructions depend on that step…

0
101 Oct 21, 2004 at 19:21

The same number of instructions doesn’t always mean same speed.

Aside from that, I can’t help you any more, because I’m not an assembly programmer and the posts in this topic are confusing me utterly ;).

0
102 Oct 21, 2004 at 21:21

Woot!

I just got chills down my spine. :D Because of a small but important detail I overlooked, some calculations were repeated. After a quick fix, I suddenly got nearly double the framerate! In every situation, performance is higher than the previous version. And this is just the ‘prototype’ code to test the feasibility. I’m sure that I can actually fine-tune it further, and there are many new possibilities with this approach that add even more speedup. :happy:

0
101 Oct 22, 2004 at 00:00

ditto with fyhuang.

I cant understand the assembly, but I can understand 2x fps.

Congratulations!

0
101 Oct 22, 2004 at 10:30

i understand both. i’d just be interested to know what math you did twice you could remove now :D

but we can still think about finding a more optimal way to tex-sample, right? :D

0
102 Oct 22, 2004 at 11:17

I cant understand the assembly…

MMX code can look very complicated, but it’s actually not that bad. They all use the same naming convention. For example, packuswb is Pack Unsigned Saturated Words into Bytes. With a little drawing it becomes immediately clear:

The site where I got that picture from actually has a very nice introduction to both MMX and SSE: Tommesani.com. Getting started can be a little difficult, but once you know the common instructions and your applications run twice as fast, there’s no way back. :D

0
102 Oct 22, 2004 at 11:29

@davepermen

i understand both. i’d just be interested to know what math you did twice you could remove now :D

Well, it wasn’t just twice. :blush: With my new rasterizer, I was (stupidly) sampling 8x8 pixels at once. This worked fine for something like a skybox of course. But for a car model with many polygons the performance was half that of the previous rasterizer that sampled one pixel at a time. I first thought it was because 2x2 blocks aren’t always completely covered, which is inevitable, so I optimized it to the maximum but still didn’t reach my goal. Then I noticed my mistake, fixed it, and the framerate skyrocketed. :D

but we can still think about finding a more optimal way to tex-sample, right? :D

There’s always room for improvement. :cool: But for now I’m going to leave it like this. I could start using ‘legit hacks’, but at the cost of flexibility and maintenance. Premature optimization is still the root of all evil…

0
102 Oct 22, 2004 at 15:58

Then I noticed my mistake, fixed it, and the framerate skyrocketed.

I’m not sure if I read correctly, but what was your mistake exactly and how did you correct it? The only thing I know is that you repeated some calculations, but you didn’t give details :)

0
102 Oct 23, 2004 at 12:31

@john

I’m not sure if I read correctly, but what was your mistake exactly and how did you correct it? The only thing I know is that you repeated some calculations, but you didn’t give details :)

Sorry. It’s in the context of my new rasterizer. My stupidity was to sample the texture for a whole 8x8 block when even only a small fraction of it is visible. But the algorithm easily allows to determine coverage for 2x2 pixels. I simply had to add such a check -before- starting to sample the texture.

So the calculations that I repeated was to sample the texture even when those pixels are never even written to the screen. :blush: There’s still a bit of loss, because with my parallel approach I always sample 2x2 pixels, and for tiny polygons (and edges) these are not always all visible. But because of the reduced setup work these polygons still render faster. :D

0
101 Oct 23, 2004 at 14:38

okay, got it! :D

well, such mistakes can always happen :D

0
101 Oct 29, 2004 at 08:52

Use this. It is tested, and it works!

movaps xmm1, xmm0
movhlps xmm1, xmm0
cvtpspi mm0, xmm1
cvtpspi mm1, xmm0
packssdw mm1, mm0
movq mm0, mm1

0
102 Oct 29, 2004 at 09:38

@Kenneth Gorking

Use this. It is tested, and it works!

movaps xmm1, xmm0
movhlps xmm1, xmm0
cvtpspi mm0, xmm1
cvtpspi mm1, xmm0
packssdw mm1, mm0
movq mm0, mm1


Hi Kenneth. That’s not entirely what I needed. The packssdw instruction saturates the floating-point numbers to the range of a 16-bit signed integer, while I don’t want any saturation at all. I think my current method is the shortest possible.

Anyway, now that I’ve fixed repeating the same calculations, I’ve reached the performance I was aiming for. And after some profiling I found out that reordering my code will most probably have a much greater effect than trying to eliminate instructions.

Thanks for trying! :blush:

0
101 Oct 29, 2004 at 16:33

when, nick, do you think, can we see your improvements? :)

0
102 Oct 29, 2004 at 17:03

@davepermen

when, nick, do you think, can we see your improvements? :)

I’m planning to create a demo as soon as possible. Last week was quite busy and I first want to release the new SoftWire article. But if everything goes well it can be finished in the next few weeks. It will be exactly the same per-pixel shaded car demo. And the framerate will be equal. :rolleyes: The big difference will be that when you get closer to the car, performance doesn’t suddenly go into the single digits. The biggest advantage of the parallel approach is filling many pixels. Maybe I’ll also try rendering a Quake 3 scene again…

I’ve reached 75 FPS for a skybox with accurate bilinear filtering and per-pixel perspective correction. Such numbers were simply not possible for software rendering before. :cool: Although this still seems slow compared to hardware, it really starts to become useful. It scales very well when sampling more than one texture per pixel (setup is done only once). And most of all, shading operations are really fast compared to sampling. Furthermore I haven’t even started with low-level hacks yet. I can do a faster kind of filtering, a slightly less accurate kind of perspective correction, etc. without noticable quality loss.

It’s really fun again now that my effort is paying off! :happy:

0
101 Oct 29, 2004 at 23:03

i’m very happy with you, and proud of your work (can i say that.. uhm.. i guess thats incorrect english :D anyways).

it sounds really great, all you’ve written. espencially the zoom in without real performance drop, that sounds _REALLY_ cool!

and yeah, you have the same problem as i have with raytracing. now you got something that scales well. but the startup barrier is still quite high. but, as you say, 75fps for a fullscreen 1 textured skybox, thats the barriere you had to beat. realtime fullscreen rendering., now, with the good scaling, it can go on.

i’m currently at a much different level, trying to get realtime at the most simple scenes. scalability is not a big problem in raytracing anyways. but the initial pay for each frame is very high.

i can really feel with you now having fun again. now that you got enough power to work with, now you can start playing :D lowlevel, highlevel, etc.. thats just too cool! *congrats*

as a sidenote. today was my first time i was live on stage as ‘dj davepermen’ in the club escape (www.club-escape.ch). wohoow, that was a great feeling, too.. and now, i’m going to the netherlands, to dj tiësto (www.id-t.com/tiestoinconcert/2004/)…

so i’ll not be around that weekend :( anyways, have a nice time, and good luck with softwire and sw-shader!!! can’t wait for the demos! i’m back at monday night.

dave’s off.