Jump to content


Fast and accurate sine/cosine


103 replies to this topic

#101 Elhardt

    New Member

  • Members
  • Pip
  • 3 posts

Posted 15 March 2011 - 02:14 AM

Nick said:

Thanks for sharing the results of your implementation!

You're welcome.

Nick said:

The instruction set may be flawed, but that hardly matters. Decoding the x86 instructions into RISC micro-operations only takes a few percent of the chip space. And it still shrinks with every generation.

The only thing that's sorely lacking from the instruction set, is support for gather/scatter operations (the parallel equivalent of load/store). That would allow to parallelize loops much more efficiently and together with FMA would make the CPU capable of very high effective throughput.

There's a lot more missing than that in the regular instruction set. No 3 operand non-destructive instructions means wasted time duplicating values all the time. No auto incrementing registers means constantly having to manually increment address pointers. Just those two things alone can increase the size of the code and reduce it's speed by a large factor in some cases. Not to mention, no bit field operations, awful stack based FPU (still needed for many things), way too few registers unless you're programming in and running on a 64 bit OS. Lacks multiple status registers like the PowerPC that can be used to reduce the number of conditional branches. No way to specify whether a branch is more likely to be taken or fall through like on the PowerPC. No way to specify whether an instruction modifies the status register or not like on the PowerPC, so compare instructions need to be placed right before conditional branches giving little time for branch prediction and also limiting flexibility in coding. Even its horrible variable sized instruction set with instructions up to 17 bytes in size have lead to my benchmarks on this very topic to run at different speeds depending how the code aligns in memory. My empty loop test can slow down 50% just by its placement in memory.

-Elhardt

#102 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 15 March 2011 - 10:46 AM

Elhardt said:

There's a lot more missing than that in the regular instruction set. No 3 operand non-destructive instructions means wasted time duplicating values all the time.
MOV instructions can execute on port 0, 1 and 5 on Intel's recent architectures. So the lack of non-destructive instructions hardly has any effect. Also keep in mind that keeping track of additional operand dependencies would complicate out-of-order execution.

The reason they did choose to support non-destructive instructions for AVX, is probably mainly a power efficiency descision, and I can also see it affecting the register file implementation.

Quote

No auto incrementing registers means constantly having to manually increment address pointers.
Again that's just a compromise. RISC architectures have to separately load operands, so it makes sense to also make this load instruction do another bit of useful work. x86 can reference memory with advanced addressing modes in every instruction.

Quote

Just those two things alone can increase the size of the code and reduce it's speed by a large factor in some cases.
Do you have any data to back up just how large that factor really is, taking the hardware costs into account as well?

I think the fact that Apple switched to Intel CPUs speaks for itself. The PowerPC instruction set looks nicer but at the end of the day that's not the main thing determining performance and cost.

Quote

...awful stack based FPU (still needed for many things)...
What would you still need it for? x87 was officially deprecated for all x86-64 ABIs eight years ago.

Quote

...way too few registers unless you're programming in and running on a 64 bit OS.
True, but there's little reason not to build an x64 version of performance critical software.

Quote

Lacks multiple status registers like the PowerPC that can be used to reduce the number of conditional branches.
x86 has conditional move instructions too.

Quote

No way to specify whether a branch is more likely to be taken or fall through like on the PowerPC.
Yes there is. The 2Eh and 3Eh prefix bytes can be used as hints. Also, there are well defined rules for branches for which there is no history in the predictor. Forward jumps are assumed not taken, while backward jumps are assumed taken. This allows you (or the compiler) to structure the code more optimally.

Quote

No way to specify whether an instruction modifies the status register or not like on the PowerPC, so compare instructions need to be placed right before conditional branches giving little time for branch prediction and also limiting flexibility in coding.
Actually you want them to be placed together, so macro-op fusion can be performed and they get executed as one.

Quote

Even its horrible variable sized instruction set with instructions up to 17 bytes in size have lead to my benchmarks on this very topic to run at different speeds depending how the code aligns in memory. My empty loop test can slow down 50% just by its placement in memory.
The Sandy Bridge architecture doesn't suffer from misaligned code. Also, variable sized instructions actually make the code more dense in some cases, and it has allowed Intel and AMD to keep expanding the capabilities over many decades. So again it's a mixed bad of advantages and disadvantages.

#103 Elhardt

    New Member

  • Members
  • Pip
  • 3 posts

Posted 17 March 2011 - 09:45 AM

Since it's too much of a paint to quote each of your comments, I'll just address them without quoting.

1) Nondestructive three and four operand instructions most certainly make code more efficient and faster. That's why all modern CPU's have that feature and that's why Intel and AMD are adding it, and AMD is demanding the fused multiply accumulate have 4 operands. Intel is really talking up that feature in their AVX preview document. Anytime you can reduce instruction count to do the same amount of work, that's an improvement in speed and reduction in code size and a more straight forward and elegant solution for us assembly programmers.

2) Auto Incrementing isn't about RISC or advanced addressing modes. The main thing a computer does is scoop up data from memory, process it and write it back out. That means incrementing pointers, and all modern and not so modern CISC CPU's also auto increment and decrement including the 68K. On the Intel it's more add or subtract instructions added to your code. I like tight and efficient and Intel isn't that.

3) I've just backed up what I said with the above two comments. Naturally when an instruction can do more than one operation at the same time it's going to be faster and reduce code. Having written the world's fastest software 3D render in 100% assembly code on the 68K, PowerPC, and Intel chips, sometimes the amount of code, especially in small loops can be significantly more on the Intel than the PowerPC just because of 1 and 2 above...As for Apple choosing Intel, they certainly didn't do it because of performance or architecture. After all, Apple, Atari, Commodore, Next, Sun, Silicon Graphics, etc. all could have picked the primitive Intel architecture at several points in the past and didn't. Even modern game machines are using PowerPC or MIPs. Apple has published many benchmarks showing the PowerPC's superior performance, and that wasn't even using assembly code where more of the PowerPC's advanced features can be used. Twice by accident I came across people porting PowerPC Altivec code to Intel's SIMD complaining their code was running at about half the speed even on Intel's running at a higher clock rate. That's what residual 1975 8085 architecture gets you.

4) I don't know who plays God and determines the x87 is no longer needed. The x87 does a lot more than just +,-,/,* and sqrt. It does trig, logs, floating point compares you can branch on, and more. It also works on double/extended precision. And Intel just recently brought radix-16 division to the x87 for much faster divides than in SSE. And unless the entire world has thrown away all computers and applications prior to SSE2, the x87 is still doing most of the FP computations out there.

5) Performance critical software has always been important, even before 64 bit CPU's. And again, as a software developer, I don't like writing software that only runs for a tiny percentage of people with the latest CPU's and 64bit OS's. Even with 16 registers, the Intel register set isn't large compared to most modern CPU's, but at least they finally addressed that miserable problem, though way too late and in a limited manner.

6) "x86 has conditional move instructions too.". Yes, and that's a nice feature, though not very often usable. I've heard that Visual Studio doesn't even use it. Take the ARM CPU, all of it's instructions can be executed conditionally.

7) If the Intel has prefixes for branch hints, that must be something new. I've never run across any assembly code symbols that would specify such a thing in the past. On the PowerPC one puts a + or - after the opcode to specify. The other rules about forward and backward branches are the same as on the PowerPC.

8) Maybe on the Intel you want Cmps and Jccs placed together, but you really don't have a choice in the matter.

9) I'm glad AVX code doesn't suffer from misaligned code, but again, that doesn't help the hundreds of millions of computers out there now, nor the regular CPU instruction set. And the instruction set they keep expanding is the SIMD stuff, while the archaic main CPU continues. Their constant half-assed changes and additions are making it hell for developers. Instead of taking their time and doing things right in the first place, like Motorola did with Altivec, they just keep spreading their SIMD additions piecemeal over many years. It's a mess.

So back to my point, it's the Intel's archaic architecture that's reducing the CPU's throughput. Hell, I could do 3D vertex transforms in fewer clock cycles on the PowerPC's regular old FPU than on Intel's early SSE using vector instructions. Lack of non destructive 3 and 4 operand instructions and fused multiply/accumulate really hold/held SSE back. They should have been there on day one. Now we have to wait another 10 years for AVX to fully penetrate into people's hands. And it looks like Intel may be playing their same old game again; releasing AVX without FMA so that after everybody buys a new computer, a few months later they'll need to buy another to get FMA. The turmoil never ends.

#104 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 21 March 2011 - 12:45 AM

[quote name='Elhardt']Since it's too much of a paint to quote each of your comments, I'll just address them without quoting.[/quote]
It's pretty easy if you start a reply with the Quote button, and then select some text and hit the button that looks like a text balloon. Doesn't really matter to me though.
[QUOTE]1) Nondestructive three and four operand instructions most certainly make code more efficient and faster. That's why all modern CPU's have that feature and that's why Intel and AMD are adding it, and AMD is demanding the fused multiply accumulate have 4 operands. Intel is really talking up that feature in their AVX preview document. Anytime you can reduce instruction count to do the same amount of work, that's an improvement in speed and reduction in code size and a more straight forward and elegant solution for us assembly programmers.[/QUOTE]
No, there is no certainty that it's more efficient or faster.

First of all there are lots of operations that are destructive. So encoding a non-destructive instruction and analyzing an extra register dependency would be a waste.

Secondly, modern x86 microarchitectures are so wide it's hard to ever exceed the issue width.

And last but not least, there's no reason why a move instruction followed by a dependent arthmetic instruction can't be fused into one micro-instruction, if they wanted to.

So I'm not saying non-destructive instructions are a bad idea, but they're not without disadvantages and even if it's a win there's no reason it can't be done with existing x86 encodings.
[quote]2) Auto Incrementing isn't about RISC or advanced addressing modes. The main thing a computer does is scoop up data from memory, process it and write it back out. That means incrementing pointers, and all modern and not so modern CISC CPU's also auto increment and decrement including the 68K. On the Intel it's more add or subtract instructions added to your code. I like tight and efficient and Intel isn't that.[/quote]
Again, it's more complicated than that and you have to look at the disadvantages too. The majority of instructions don't need to increment any pointer. So you have an ALU sitting there doing nothing most of the time. Also, you need a write back path for the pointer, and it creates another dependency.

Also, in short loops where the pointer incrementing could be a bottleneck, unrolling can amortize or completely hide that cost. And you can also use the loop counter as an index instead of incrementing the pointers.
[quote]3) I've just backed up what I said with the above two comments. Naturally when an instruction can do more than one operation at the same time it's going to be faster and reduce code.[/quote]
In theory, yes, but in practice it comes at a cost. And this cost may force the CPU designers to make the CPU more narrow, clock it lower, or charge more to customers.

x86 CPUs have always had an excellent performance/price ratio. So you really have to look at the bigger picture.
[quote]Having written the world's fastest software 3D render in 100% assembly code on the 68K, PowerPC, and Intel chips, sometimes the amount of code, especially in small loops can be significantly more on the Intel than the PowerPC just because of 1 and 2 above...[/quote]
For what it's worth, I'm the lead developer of SwiftShader, and I'm not convinced that the x86 ISA has any limitations that are severe enough to compromise its foreseeable future.
[quote]As for Apple choosing Intel, they certainly didn't do it because of performance or architecture. After all, Apple, Atari, Commodore, Next, Sun, Silicon Graphics, etc. all could have picked the primitive Intel architecture at several points in the past and didn't.[/quote]
The past tells us very little here. A modern x86 CPU uses RISC micro-instructions, and the decoding takes only a few percent of the die space. The ISA has very little effect on performance/price.

Nowadays performance/power is also an extremely important factor. The PowerPC 970FX at 2.5 GHz consumes up to 100 Watt, while a Pentium 4 at 3.8 GHz consumes 65 Watt (both at 90 nm). And that's for the old and horribly inefficient NetBurst microarchitecture. The Core, Core 2 and second generation Core 2 microarchitecture offer vastly more performance/Watt. Whatever the reasons, clearly the PPC ISA wasn't enough of a benefit to keep up with Intel.
[quote]Twice by accident I came across people porting PowerPC Altivec code to Intel's SIMD complaining their code was running at about half the speed even on Intel's running at a higher clock rate.[/quote]
Was that on anything prior to Core 2 or Phenom? They used to have 64-bit execution units for 128-bit SSE operations, so it took two micro-instructions.
[quote]That's what residual 1975 8085 architecture gets you.[/quote]
Seems like a really premature conclusion to me.
[quote]4) I don't know who plays God and determines the x87 is no longer needed.[/quote]
I didn't say it's no longer needed. I said it's deprecated (in favor of SSE2).
[quote]The x87 does a lot more than just +,-,/,* and sqrt. It does trig, logs...[/quote]
Those use slow microcode. On x64 they're implemented as library functions using SSE2 instructions, without loss of performance or precision.
[quote]...floating point compares you can branch on...[/quote]
The SSE comiss instruction does that too.
[quote]It also works on double/extended precision.[/quote]
SSE2 introduced double precision support, and is supported by all x64 implementations. Extended precision is also deprecated.
[quote]And Intel just recently brought radix-16 division to the x87 for much faster divides than in SSE.[/quote]
They're both the same speed.
[quote]And unless the entire world has thrown away all computers and applications prior to SSE2, the x87 is still doing most of the FP computations out there.[/quote]
The x87 instructions are executed by the same execution units as SSE. Transcendental functions are implemented in microcode. So aside from the instruction decoding and register stack implementation, it is practically completely replaced by SSE.
[quote]Performance critical software has always been important, even before 64 bit CPU's. And again, as a software developer, I don't like writing software that only runs for a tiny percentage of people with the latest CPU's and 64bit OS's. Even with 16 registers, the Intel register set isn't large compared to most modern CPU's, but at least they finally addressed that miserable problem, though way too late and in a limited manner.[/quote]
Regardless of the architecture, the move from 32-bit to 64-bit takes many years. And x86's lack of architectural register is also compensated by a large rename register set and relatively fast L1 caches.
[quote]6) "x86 has conditional move instructions too.". Yes, and that's a nice feature, though not very often usable. I've heard that Visual Studio doesn't even use it. Take the ARM CPU, all of it's instructions can be executed conditionally.[/quote]
And once more it doesn't come for free. It requires more encoding bits, and requires more logic in the critical path, which increases the latency even when it's not being used.
[quote]7) If the Intel has prefixes for branch hints, that must be something new.[/quote]
It has been supported for over a decade now. Hardly something new.
[quote]8) Maybe on the Intel you want Cmps and Jccs placed together, but you really don't have a choice in the matter.[/quote]
AMD Bulldozer will also support macro-op fusion for compare and jump instructions. So optimizing compilers all pair them together.
[quote]9) I'm glad AVX code doesn't suffer from misaligned code, but again, that doesn't help the hundreds of millions of computers out there now, nor the regular CPU instruction set.[/quote]
So what? It's not as if talking about PowerPC helps them either.

x86 has always offered high performance at an affordable price. And it doesn't look like this is about to change.
[quote]And the instruction set they keep expanding is the SIMD stuff, while the archaic main CPU continues.[/quote]
Single-threaded scalar performance has increased tremendously over the years, despite the ISA staying the same. This shows that there are more critical things to real-life performance than the instruction set architecture.
[quote]Their constant half-assed changes and additions are making it hell for developers. Instead of taking their time and doing things right in the first place, like Motorola did with Altivec, they just keep spreading their SIMD additions piecemeal over many years. It's a mess.[/quote]
SSE2 includes the vast majority of SIMD instructions you'll use. The rest is largely application specific. So the "mess" isn't all that horrible really.

Personally I started using LLVM for dynamic code generation. It explicitly supports vector operations and fully abstracts the target architecture. So you hardly have to worry about whether or not a specific instruction is supported.
[quote]So back to my point, it's the Intel's archaic architecture that's reducing the CPU's throughput. Hell, I could do 3D vertex transforms in fewer clock cycles on the PowerPC's regular old FPU than on Intel's early SSE using vector instructions.[/quote]
And what about wall clock time, and processor cost?
[quote]Lack of non destructive 3 and 4 operand instructions and fused multiply/accumulate really hold/held SSE back. They should have been there on day one.[/quote]
That would have been nice for developers, but the economic reality dictated otherwise. It's a gamble whether a certain ISA extension will offer a good ROI. Other architectures have come and gone, while x86 holds a dominant position after a full three decades.
[quote]Now we have to wait another 10 years for AVX to fully penetrate into people's hands. And it looks like Intel may be playing their same old game again; releasing AVX without FMA so that after everybody buys a new computer, a few months later they'll need to buy another to get FMA. The turmoil never ends.[/QUOTE]
You don't seem to realize the cost. FMA requires both port 0 and port 1 to be equipped with FMA ALUs, and each of them need an additional 256-bit operand. But the register file has to also be capable of providing all these operands to the two pipelines. And that's not all. Because of the increase in througput the memory subsystem has to be upgraded as well.

All this really isn't trivial, and it may take several years for the investment to be worth it. After all, the benefit in practice is much less than the theorectical doubling in performance.

Personally I would love to see it sooner rather than later, and I believe the lack of gather/scatter support is even more critical, but when looking at the bigger picture I really think we're getting a lot of bang for the buck and there's some interesting technology on the roadmap that keeps it exciting.





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users