Jump to content


I've just wet my pants...


4 replies to this topic

#1 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 20 November 2007 - 08:43 PM

I browsed the instruction reference of the upcomming SSE4 instruction set, and I have to say that this stuff will rock!


  • Finally we have a parallel 32*32 bit integer multiplication. All C-compilers that have stubs for auto-vectorization (gcc and icc) will be able to *do* something with ordinary C-Code. This will make a _huge_ difference for everyone with a single recompile once the stuff is integrated into the compilers.


  • We get instructions that improve string search and string compare operations by almost a magnitude. Not that I do much of them in my code, but I do know how bad the x86 architecture is at this discipline. From experience I know that coders are lazy and do string-stuff a alot. Now we'll be able to do 128 bits worth of compare in a single cylcle! Horray! Once these instructions find their way into the libraries well all will see a huge performance improvement!


  • Population count (number of one-bits) of a register. Sounds useless unless you need it! Our antialiased polygon routine calls this a lot and it offers lots of optimization opportinities for those who deal with numbers on a bit-level. Thanks a lot Intel! Doing popcount without a dedicrated instruction was to expensive to ever explore the hack'ish ways to abuse it. I'm sure we'll see a lot of cool bit-twiddeling stuff in the near future.


  • A dedicated CRC32 instruction. Well - not bad for a start but if it would be just me I'd had made a dedicated gallious multiplier instruction with a polynom of choice. Nevertheless - even if you don't need to do CRC32, you can abuse it in very creative ways...


  • And a lot of other things that make life much easier... Dot-Products with variable width (it was about time). Insert and Extract instructions and much more..


  • Amost forgot to mention: The most lame instruction ever - the integer divide - is now twice as fast (they changed from Radix8 to Radix16). I love that. IDIV is in my performance-chart in the top fifth.. Takes way to much time on the x86.


Just checked what's up with the GCC folks. They are hard working to integrate the new instructions into the vectorizer and optimizer..

I've become somewhat of a DSP-guy during the last month, and I already have all the stuff (for a price - general purpose code sucks like hell!). Nowadays I have to say that SSE with the last additions comes really close..

I still miss bit-interleaving, bit-reversal and bit-replication instructions (can live with poor mans gallious multiplier aka CRC32) but who knows.. maybe in a year or two they might add them..
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#2 .oisyn

    DevMaster Staff

  • Moderators
  • 1822 posts

Posted 20 November 2007 - 09:38 PM

SSE4 is nice... BUT, AMD is not going to use it. So I don't know how useful it will be to support it in your game, unless your willing to create two binaries.

AMD has proposed SSE5 though, and there's a bit of overlap with SSE4, but I don't exactly know which instructions.
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#3 Nils Pipenbrinck

    Senior Member

  • Members
  • PipPipPipPip
  • 597 posts

Posted 20 November 2007 - 10:12 PM

I had a brief look at it as well.

They need one more of those useless patent fight courses. Then AMD and Inte lawyers will work out a cross-licence agreement. In the end both of them can implement SSE4 and SSE5 and everyone is the winner (especially the lawyers in the midde, but never mind..)


It will just happen as it did x times before.. no surprise..


To the instuction sets changes itself:

SSE5 Looks nice but they just added obvious stuff. No real logic from a hardware point of view required. All those visionary things that I've seen in SSE5 are missing. Nice but no cigar. The parallel integer multiplication (I think even AMD will have it) but much more important the string compare things will make the *real* difference.
My music: http://myspace.com/planetarchh <-- my music

My stuff: torus.untergrund.net <-- some diy electronic stuff and more.

#4 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1225 posts

Posted 20 November 2007 - 11:31 PM

.oisyn said:

SSE4 is nice... BUT, AMD is not going to use it. So I don't know how useful it will be to support it in your game, unless your willing to create two binaries.
According to Wikipedia the Phenom processors support SSE4a...

Edit: Okay, it looks like SSE4a is nowhere close to Intel's SSE4.1. That sucks. When will they (both Intel and AMD) learn that not sharing their instruction sets will only delay the adoption of them, making it dead silicon for many years? Nobody likes writing multiple versions of their software, let alone testing it. Plus you have to practically buy both a Penryn and K10 if you want to be on the cutting edge, or pick sides and risk getting bad critique from the very vocal early adopters.

It's sad that after almost 10 years of SSE they still have to add extensions to make it functionally complete. I fear this all originates from bad communication/understanding between hardware designers and software developers.

#5 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1225 posts

Posted 20 November 2007 - 11:37 PM

What's still missing is true scatter/gather; storing and loading multiple elements to/from entirely different memory locations. This would require multiple memory controllers, but with multi-core they already have those...

With scatter/gather just about anything can be vectorized without requiring complicated memory layout and/or shuffling.





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users