Nick said:
pfrcp has 14-bit mantissa precision, pfsqrt 15-bit. That's not a whole lot better than the SSE versions. I mean, especially for the purposes these instructions are used for it shouldn't matter that much (e.g. vector normalization). And the 3DNow! instructions have quite high latencies on Athlon 64 as well.
Reciprocals and square roots do happen outside of vector normalization you know.
Fact is i need both of them, fast and robust and if possible with a bit pattern that doesn't look like i got it from a RNG.
Plus i wasn't addressing relative latency of 3DNow & SSE but functionnalities. It seems you've missed the wildcard in my previous post, so please check "3DNow! Technology Manual" for PFRCPIT1, PFRCPIT2, PFRSQIT1 & co.
Compare that to:
static __m128 rcp_sqrt_nr_robust_ps(const __m128 arg) {
const __m128
mask = cmpeq(arg, all_zero()),
nr = rsqrtps(arg),
muls = mulps(mulps(nr, nr), arg),
filter = andnps(mask, muls),
beta = mulps(rt::cst::section.half, nr),
gamma = subps(rt::cst::section.three, filter),
final = mulps(beta, gamma);
return final;
}
I could have spammed the same kind of uglyness for related ops like 1/x etc... Not so attractive anymore eh? (or robust in fact)
Nick said:
Sorry, what credit are you asking for then? :huh: And I read almost every page of the Intel manuals (plus some chapters in
The Unabridged Pentium 4) so it's hard to get my attention with that, sorry. I also got several courses about processor architecture and digital design if that matters...
That's fine & dandy, but one got to wonder if we're reading the same manuals.
From a quick searh through
IA-32 Intel Architecture Optimization Reference Manual, Order Number:248966-012
2-85, Vectorization
"User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches"
2-89, User/Source Coding Rules
"User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches. 2-86"
3-37, Instruction Selection
"The following section gives some guidelines for choosing instructions to complete a task. One barrier to SIMD computation can be the existence of data-dependent branches. Conditional moves can be used to eliminate data-dependent branches. Conditional moves can be emulated in SIMD computation by using masked compares and logicals, as shown in Example3-21."
4-2, General Rules on SIMD Integer Code
"Emulate conditional moves by using masked compares and logicals instead of using conditional branches."
Should i quote some more or did you get the point by now?
Nick said:
My point was that using a couple binary logic instructions is quite effective to remove some branches. Sure, there is no cmov equivalent, but that's not so terrible. Besides, what would that instruction look like anyway? There is no vector of compare flags. And that's a good thing because they would add a whole lot of hardware complexity (making the latencies even bigger). The current SSE architecture is pretty clean in my opinion.
Excuse me but isn't one of the main point of a CISC arch to provide code compression via byzantine instruction encoding?
I was kind enough to restrict my example to x86 and i've never ever alluded to flags. I've merely pointed out the inconsistency where on one hand you have cmov* on the ALU side, fcmov* on the FPU side and nothing like that for SSE. You'll notice those cmov* and fcmov* could also be emulated.
I would have expected a single op doing exactly what a andn/and/or sequence would do; admitedly ternary ops aren't the norm on x86, but anything would have done the trick: use a blessed register as the implicit mask (aka accumulator) or encode it as a prefix or whatever.
It makes no sense when decoding bandwitdh is scarce and you're struggling with troughoutput to say the norm is replace to conditional branches with a 3 freaking op sequence.
Nick said:
Intel has always chosen the budget solution, and look where they are now. Sure, x86 ain't perfect, but it's what makes my PC tick and almost everyone else's. As for x87, yes the register stack can be annoying, but its design started in 1976 and uses the robust IEEE 754 standard (also created by Intel) we still use today.
Revisionist kids these days... :)
You know there was a (computing) world before Intel and it will still be there when everybody will have forgotten about them.
Repeat after me: stack based fpu stink, they never ever made sense. Ever.
Again it might come as a surprise but others got it mostly right:
http://en.wikipedia..../Motorola_68882
Now better than rephrasing what the marketing department wrote, get the story about IEEE 754 from the Man himself. Please. And remove those rosy teinted glasses.
http://www.cs.berkel...s/754story.html