Posted 21 March 2011 - 12:45 AM
[quote name='Elhardt']Since it's too much of a paint to quote each of your comments, I'll just address them without quoting.[/quote]
It's pretty easy if you start a reply with the Quote button, and then select some text and hit the button that looks like a text balloon. Doesn't really matter to me though.
[QUOTE]1) Nondestructive three and four operand instructions most certainly make code more efficient and faster. That's why all modern CPU's have that feature and that's why Intel and AMD are adding it, and AMD is demanding the fused multiply accumulate have 4 operands. Intel is really talking up that feature in their AVX preview document. Anytime you can reduce instruction count to do the same amount of work, that's an improvement in speed and reduction in code size and a more straight forward and elegant solution for us assembly programmers.[/QUOTE]
No, there is no certainty that it's more efficient or faster.
First of all there are lots of operations that are destructive. So encoding a non-destructive instruction and analyzing an extra register dependency would be a waste.
Secondly, modern x86 microarchitectures are so wide it's hard to ever exceed the issue width.
And last but not least, there's no reason why a move instruction followed by a dependent arthmetic instruction can't be fused into one micro-instruction, if they wanted to.
So I'm not saying non-destructive instructions are a bad idea, but they're not without disadvantages and even if it's a win there's no reason it can't be done with existing x86 encodings.
[quote]2) Auto Incrementing isn't about RISC or advanced addressing modes. The main thing a computer does is scoop up data from memory, process it and write it back out. That means incrementing pointers, and all modern and not so modern CISC CPU's also auto increment and decrement including the 68K. On the Intel it's more add or subtract instructions added to your code. I like tight and efficient and Intel isn't that.[/quote]
Again, it's more complicated than that and you have to look at the disadvantages too. The majority of instructions don't need to increment any pointer. So you have an ALU sitting there doing nothing most of the time. Also, you need a write back path for the pointer, and it creates another dependency.
Also, in short loops where the pointer incrementing could be a bottleneck, unrolling can amortize or completely hide that cost. And you can also use the loop counter as an index instead of incrementing the pointers.
[quote]3) I've just backed up what I said with the above two comments. Naturally when an instruction can do more than one operation at the same time it's going to be faster and reduce code.[/quote]
In theory, yes, but in practice it comes at a cost. And this cost may force the CPU designers to make the CPU more narrow, clock it lower, or charge more to customers.
x86 CPUs have always had an excellent performance/price ratio. So you really have to look at the bigger picture.
[quote]Having written the world's fastest software 3D render in 100% assembly code on the 68K, PowerPC, and Intel chips, sometimes the amount of code, especially in small loops can be significantly more on the Intel than the PowerPC just because of 1 and 2 above...[/quote]
For what it's worth, I'm the lead developer of SwiftShader, and I'm not convinced that the x86 ISA has any limitations that are severe enough to compromise its foreseeable future.
[quote]As for Apple choosing Intel, they certainly didn't do it because of performance or architecture. After all, Apple, Atari, Commodore, Next, Sun, Silicon Graphics, etc. all could have picked the primitive Intel architecture at several points in the past and didn't.[/quote]
The past tells us very little here. A modern x86 CPU uses RISC micro-instructions, and the decoding takes only a few percent of the die space. The ISA has very little effect on performance/price.
Nowadays performance/power is also an extremely important factor. The PowerPC 970FX at 2.5 GHz consumes up to 100 Watt, while a Pentium 4 at 3.8 GHz consumes 65 Watt (both at 90 nm). And that's for the old and horribly inefficient NetBurst microarchitecture. The Core, Core 2 and second generation Core 2 microarchitecture offer vastly more performance/Watt. Whatever the reasons, clearly the PPC ISA wasn't enough of a benefit to keep up with Intel.
[quote]Twice by accident I came across people porting PowerPC Altivec code to Intel's SIMD complaining their code was running at about half the speed even on Intel's running at a higher clock rate.[/quote]
Was that on anything prior to Core 2 or Phenom? They used to have 64-bit execution units for 128-bit SSE operations, so it took two micro-instructions.
[quote]That's what residual 1975 8085 architecture gets you.[/quote]
Seems like a really premature conclusion to me.
[quote]4) I don't know who plays God and determines the x87 is no longer needed.[/quote]
I didn't say it's no longer needed. I said it's deprecated (in favor of SSE2).
[quote]The x87 does a lot more than just +,-,/,* and sqrt. It does trig, logs...[/quote]
Those use slow microcode. On x64 they're implemented as library functions using SSE2 instructions, without loss of performance or precision.
[quote]...floating point compares you can branch on...[/quote]
The SSE comiss instruction does that too.
[quote]It also works on double/extended precision.[/quote]
SSE2 introduced double precision support, and is supported by all x64 implementations. Extended precision is also deprecated.
[quote]And Intel just recently brought radix-16 division to the x87 for much faster divides than in SSE.[/quote]
They're both the same speed.
[quote]And unless the entire world has thrown away all computers and applications prior to SSE2, the x87 is still doing most of the FP computations out there.[/quote]
The x87 instructions are executed by the same execution units as SSE. Transcendental functions are implemented in microcode. So aside from the instruction decoding and register stack implementation, it is practically completely replaced by SSE.
[quote]Performance critical software has always been important, even before 64 bit CPU's. And again, as a software developer, I don't like writing software that only runs for a tiny percentage of people with the latest CPU's and 64bit OS's. Even with 16 registers, the Intel register set isn't large compared to most modern CPU's, but at least they finally addressed that miserable problem, though way too late and in a limited manner.[/quote]
Regardless of the architecture, the move from 32-bit to 64-bit takes many years. And x86's lack of architectural register is also compensated by a large rename register set and relatively fast L1 caches.
[quote]6) "x86 has conditional move instructions too.". Yes, and that's a nice feature, though not very often usable. I've heard that Visual Studio doesn't even use it. Take the ARM CPU, all of it's instructions can be executed conditionally.[/quote]
And once more it doesn't come for free. It requires more encoding bits, and requires more logic in the critical path, which increases the latency even when it's not being used.
[quote]7) If the Intel has prefixes for branch hints, that must be something new.[/quote]
It has been supported for over a decade now. Hardly something new.
[quote]8) Maybe on the Intel you want Cmps and Jccs placed together, but you really don't have a choice in the matter.[/quote]
AMD Bulldozer will also support macro-op fusion for compare and jump instructions. So optimizing compilers all pair them together.
[quote]9) I'm glad AVX code doesn't suffer from misaligned code, but again, that doesn't help the hundreds of millions of computers out there now, nor the regular CPU instruction set.[/quote]
So what? It's not as if talking about PowerPC helps them either.
x86 has always offered high performance at an affordable price. And it doesn't look like this is about to change.
[quote]And the instruction set they keep expanding is the SIMD stuff, while the archaic main CPU continues.[/quote]
Single-threaded scalar performance has increased tremendously over the years, despite the ISA staying the same. This shows that there are more critical things to real-life performance than the instruction set architecture.
[quote]Their constant half-assed changes and additions are making it hell for developers. Instead of taking their time and doing things right in the first place, like Motorola did with Altivec, they just keep spreading their SIMD additions piecemeal over many years. It's a mess.[/quote]
SSE2 includes the vast majority of SIMD instructions you'll use. The rest is largely application specific. So the "mess" isn't all that horrible really.
Personally I started using LLVM for dynamic code generation. It explicitly supports vector operations and fully abstracts the target architecture. So you hardly have to worry about whether or not a specific instruction is supported.
[quote]So back to my point, it's the Intel's archaic architecture that's reducing the CPU's throughput. Hell, I could do 3D vertex transforms in fewer clock cycles on the PowerPC's regular old FPU than on Intel's early SSE using vector instructions.[/quote]
And what about wall clock time, and processor cost?
[quote]Lack of non destructive 3 and 4 operand instructions and fused multiply/accumulate really hold/held SSE back. They should have been there on day one.[/quote]
That would have been nice for developers, but the economic reality dictated otherwise. It's a gamble whether a certain ISA extension will offer a good ROI. Other architectures have come and gone, while x86 holds a dominant position after a full three decades.
[quote]Now we have to wait another 10 years for AVX to fully penetrate into people's hands. And it looks like Intel may be playing their same old game again; releasing AVX without FMA so that after everybody buys a new computer, a few months later they'll need to buy another to get FMA. The turmoil never ends.[/QUOTE]
You don't seem to realize the cost. FMA requires both port 0 and port 1 to be equipped with FMA ALUs, and each of them need an additional 256-bit operand. But the register file has to also be capable of providing all these operands to the two pipelines. And that's not all. Because of the increase in througput the memory subsystem has to be upgraded as well.
All this really isn't trivial, and it may take several years for the investment to be worth it. After all, the benefit in practice is much less than the theorectical doubling in performance.
Personally I would love to see it sooner rather than later, and I believe the lack of gather/scatter support is even more critical, but when looking at the bigger picture I really think we're getting a lot of bang for the buck and there's some interesting technology on the roadmap that keeps it exciting.