- Finally we have a parallel 32*32 bit integer multiplication. All C-compilers that have stubs for auto-vectorization (gcc and icc) will be able to *do* something with ordinary C-Code. This will make a _huge_ difference for everyone with a single recompile once the stuff is integrated into the compilers.
- We get instructions that improve string search and string compare operations by almost a magnitude. Not that I do much of them in my code, but I do know how bad the x86 architecture is at this discipline. From experience I know that coders are lazy and do string-stuff a alot. Now we'll be able to do 128 bits worth of compare in a single cylcle! Horray! Once these instructions find their way into the libraries well all will see a huge performance improvement!
- Population count (number of one-bits) of a register. Sounds useless unless you need it! Our antialiased polygon routine calls this a lot and it offers lots of optimization opportinities for those who deal with numbers on a bit-level. Thanks a lot Intel! Doing popcount without a dedicrated instruction was to expensive to ever explore the hack'ish ways to abuse it. I'm sure we'll see a lot of cool bit-twiddeling stuff in the near future.
- A dedicated CRC32 instruction. Well - not bad for a start but if it would be just me I'd had made a dedicated gallious multiplier instruction with a polynom of choice. Nevertheless - even if you don't need to do CRC32, you can abuse it in very creative ways...
- And a lot of other things that make life much easier... Dot-Products with variable width (it was about time). Insert and Extract instructions and much more..
- Amost forgot to mention: The most lame instruction ever - the integer divide - is now twice as fast (they changed from Radix8 to Radix16). I love that. IDIV is in my performance-chart in the top fifth.. Takes way to much time on the x86.
Just checked what's up with the GCC folks. They are hard working to integrate the new instructions into the vectorizer and optimizer..
I've become somewhat of a DSP-guy during the last month, and I already have all the stuff (for a price - general purpose code sucks like hell!). Nowadays I have to say that SSE with the last additions comes really close..
I still miss bit-interleaving, bit-reversal and bit-replication instructions (can live with poor mans gallious multiplier aka CRC32) but who knows.. maybe in a year or two they might add them..












