one word: interesting =)

Please log in or register to post a reply.

hehe, i’ve went trough this quite a while ago, while sitting in school, bored by the math lessons i knew already. i tried different ways to approximate sine, cosine, and all the others. but i don’t remember. i haven’t had your idea to use a lerp between the parabola and the squared parabola, though. i think i used x\^3 because it looks more like sine by default.

exp was great, as you could split the value into integer and fractional part, do the integer with bitshifting, and the fraction with some nice approx. something like that i remember..

man, thats years back :D

hm.. you should present a fast way to fit a number into the -pi/2 - +pi/2 range.

is there a fractional() instruction in SSE? or some modulus directly?

Hello,

This is my first post here, but I’ve visited here for a while now.

I did a quick test on accuracies and found the fast sin function to have about 12.9% error without added precision and -0.2% error with added precision.

Here’s the code, if you’d like to have a look to make sure I didn’t do something that might be giving false figures.

```
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>
const float PI = 3.14159265358979323846264338327950288f;
const float PI2 = 6.28318530717958647692528676655900577f;
const float PID2 = 1.57079632679489661923132169163975144f;
const float PI_SQR = 9.86960440108935861883449099987615114f;
template <typename T, bool accurate>
T FastSin(const T& theta)
{
const T B = 4 / PI;
const T C = -4 / PI_SQR;
T y = B * theta + C * theta * abs(theta);
if (accurate)
{
// const float Q = 0.775;
const T P = 0.225;
y = P * (y * abs(y) - y) + y; // Q * y + P * y * abs(y)
}
return y;
}
template <typename T>
T randf()
{
int randi = rand();
return (T)(((double)(randi % RAND_MAX))/((double)RAND_MAX));
}
void CheckAccuracy()
{
const size_t checks = 10000;
typedef double real;
std::cout << "Checking accuracy randomly." << std::endl;
srand((unsigned)std::time(0));
real Sin[checks], fastSin[checks], fastAccurSin[checks];
for (int i = 0; i < checks; i++)
{
real randFloat = randf<real>();
Sin[i] = sin(randFloat);
fastSin[i] = FastSin<real, false>(randFloat);
fastAccurSin[i] = FastSin<real, true>(randFloat);
}
real SinToFastSinAccurs[checks], SinToFastAccurSinAccurs[checks];
for (int i = 0; i < checks; i++)
{
// (measured - accurate)/accurate
SinToFastSinAccurs[i] = (fastSin[i] - Sin[i])/Sin[i];
SinToFastAccurSinAccurs[i] = (fastAccurSin[i] - Sin[i])/Sin[i];
}
real FinalSinToFastSin = ((real)0), FinalSinToFastAccurSin = ((real)0);
for (int i = 0; i < checks; i++)
{
FinalSinToFastSin += SinToFastSinAccurs[i];
FinalSinToFastAccurSin += SinToFastAccurSinAccurs[i];
}
FinalSinToFastSin /= (((real)checks) / ((real)100.0));
FinalSinToFastAccurSin /= (((real)checks) / ((real)100.0));
std::cout << "FinalSinToFastSin %: " << FinalSinToFastSin << "\nFinalSinToFastAccurSin %: " << FinalSinToFastAccurSin << std::endl;
}
int main()
{
CheckAccuracy();
system("Pause");
return 0;
}
```

Slaru

Note: Slaru, i’m sure you could gain a couple ulp by adding more digits to your constants. :blink:

@Nick

I’d love to hear if anyone knows good approximations for other transcendental functions like the exponential, logarithm and power functions.

See this patch for gcc contributed by AMD. All vectorized. May be too precise and slow for your purpose tho.

http://gcc.gnu.org/ml/gcc-patches/2006-03/msg01611.html

Further down the thread there’s also some rather amusing remarks from
Intel drones.

@davepermen

is there a fractional() instruction in SSE? or some modulus directly?

No. There’s some basic arithmetic (+,-,*,/) and bitwise ops (or, and, nand, xor) and that’s it. Heck, even the usual cond move is missing.

@Slaru

I did a quick test on accuracies and found the fast sin function to have about 12.9% error without added precision and -0.2% error with added precision. Here’s the code, if you’d like to have a look to make sure I didn’t do something that might be giving false figures…

You’re measuring relative errors, which are a bigger than absolute errors. You can however improve relative accuracy a tiny bit by using Q = 0.782, P = 0.218, as mentioned in the article.

By the way, the relative error of the Taylor series is infinity at x = pi. :o Absolute error makes more sense here.

@Nick

By the way, the relative error of the Taylor series is infinity at x = pi. :o Absolute error makes more sense here.

Relative errors do make sense when talking about signals, because the amplitude isn’t always equal to one. And it also makes sense to compare the error for different angles …

@tbp

No. There’s some basic arithmetic (+,-,*,/) and bitwise ops (or, and, nand, xor) and that’s it. Heck, even the usual cond move is missing.

It also has very fast approximations of reciproke and reciproke square root (12-bit mantissa precision). For fractions we can convert to integer and convert back fairly fast. Conditional move can be done with bitwise logic.

@bramz

Relative errors do make sense when talking about signals, because the amplitude isn’t always equal to one. And it also makes sense to compare the error for different angles …

In particular, it also makes sense to talk about the total harmonic distortion

@bramz

Relative errors do make sense when talking about signals, because the amplitude isn’t always equal to one. And it also makes sense to compare the error for different angles …

Sure, but when the signal is around zero the relative error can explode even when the approximation is almost coincident with the real curve.

So what we really need is a function to weigh the error absolutely near zero, and somewhat relative towards infinity. exp(-x) seems very suited. This way the maximum ‘exponential’ error of the parabola is 0.037, compared to 0.075 for the Taylor series. The squared parabola gives me 0.0006 with adjusted Q = 0.7775, P = 0.2225.

No matter how you look at it, this is a fast and accurate approximation. :happy:

@Nick

So what we really need is a function to weigh the error absolutely near zero, and somewhat relative towards infinity. exp(-x) seems very suited.

Let us not mix two kinds of error measurments, ok? =) The reason why it goes towards infinity there, is because … well … because you’re making an infinitely large error, no matter how you look at it. However, we don’t need to panic either. For sines, we can get a nice figure of overall relative error by using the total harmonic distortion as I mentioned in the other post. This is in particular true if you’re using this function to generate signals.

And btw, I _do_ like your approximation ;)

@Nick

It also has very fast approximations of reciproke and reciproke square root (12-bit mantissa precision).

… which are pretty much useless by themselves. You’ll need a
Newton-Raphson step to get within a couple ulp. And then you’ll need to
fix NaN blackholes at 0 and infinity.

Even if you only care about ]0, inf[ latency has already gone through
the roof.

Anyway, you’re right i forgot to mention them :)

@Nick

For fractions we can convert to integer and convert back fairly fast. Conditional move can be done with bitwise logic.

And someone posted code to get something looking like a sin/cos with a
bunch of adds & muls… so that doesn’t say much ;)

I stand by my initial point, that is paying for the latency, decoding
bandwitdth etc for 3 ops (andn, and, or) for something as mundane as a
cond move is silly. Even more so when you strive to avoid branches when
dealing with SSE to begin with, branches that you remove with… you
know what.

By the way, for those who think the squared parabola factor fell out of thin air, expand the formula and compare it with the Taylor series about x = pi/2:

Q (4/pi x - 4/pi\^2 x\^2) + P (4/pi x - 4/pi\^2 x\^2)\^2

= Q 4/pi - (Q 4/pi\^2 + P 16/pi\^2) x\^2 - P 32/pi\^3 x\^3 + P 16/pi\^4
x\^4

\~= 0.986 x + 0.0507 x\^2 - 0.232 x\^3 + 0.0370 x\^4

sin(x \~ pi/2)

\~= 1 - 1/2 (x - 1/2 Pi)\^2 + 1/24 (x - 1/2 Pi)\^4

\~= 0.0200 + 0.925 x + 0.117 x\^2 - 0.262 x\^3 + 0.0417 x\^4

Note they are both polynomials with roughly comparable coefficients. The parabola method just has ‘corrected’ coefficients to make sure sin(0) = 0, sin(pi/2) = 1, sin(pi) = 0 and the erros are minimal.

In fact you can also approach the problem starting from a generic order 4 polynomial:

A + B x + C x\^2 + D x\^3 + E x\^4

Using the restrictions sin(0) = 0, sin(pi/2) = 1, sin(pi) = 0 yields: A = 0, B = 1/(4 pi) (16 + 2 D pi\^3 + 3 E pi\^4), C = -1/4 (16 + 6 D pi\^3 + 7 E pi\^4) / pi\^2. D and E are still free parameters. So we can choose two more restrictions, for example, the first and second derivative in pi/2 should be 0 and -1 respectively. This yields A = 0, B = -1/2 (-16 + pi\^2) / pi, C = 1/2 (-48 + 5 pi\^2) / pi\^2, D = -4 (-8 + pi\^2) / pi\^3, E = 2 (-8 + pi\^2) / pi\^4. Which gives us:

0.976 x + 0.0683 x\^2 - 0.241 x\^3 + 0.0384 x\^4

This is pretty close to the squared parabola method and has some nice properties, but the maximum error is a little bigger. By the way, for order 3 and 4 polynomials we both get the simple parabola.

Anyway, the ad-hoc explanation is a bit easier to understand I think, but this gives it a bit of theoretical background. It’s nothing more than fitting a function with a polynomial using some restrictions (exact points and derivatives). Minimizing the error is another very useful restriction…

@bramz

The reason why it goes towards infinity there, is because … well … because you’re making an infinitely large error, no matter how you look at it.

Not me, Taylor. :blink:

However, we don’t need to panic either. For sines, we can get a nice figure of overall relative error by using the total harmonic distortion as I mentioned in the other post. This is in particular true if you’re using this function to generate signals.

How does that work exactly? The Wikipedia explanation is a bit cryptic…

And btw, I _do_ like your approximation ;)

I wasn’t doubting that. :happy:

@Nick

Not me, Taylor. :blink:

ok, fair enough ;)

How does that work exactly? The Wikipedia explanation is a bit cryptic…

Well, basically, you’ll need to decompose your ‘signal’ (-pi..pi) in
its harmonics … you
know, thinks like

y(theta) = \sum_{k=1}\^inf A_k sin(k * theta).

(Normally, you won’t have cosine terms, since you have an odd function. Except maybe for a DC term, though I don’t think you have one. If you would have cosine terms, you better write it down using the exponential form with complex amplitudes)

Then, the THD is defined as the ratio of the **RMS** of the harmonics to
the fundamental:

\frac {**\sqrt** \sum_{k=2}\^inf A_k\^2} {\abs A_1}.

One thingy though: it silently ignores the DC term … so if you simply would add a number like 0.123456 to y, the THD would not change.

**edit: need to use the total RMS, not total power**

@tbp

… which are pretty much useless by themselves. You’ll need a Newton-Raphson step to get within a couple ulp. And then you’ll need to fix NaN blackholes at 0 and infinity.

Even if you only care about ]0, inf[ latency has already gone through the roof.

That’s really the best possible. Remember that a divps/sqrtps is 39
clock cycles **non-pipelined**. So if you only need 12-bit or 24-bit
mantissa precision these instructions are great. What more could you
expect? They can’t break the laws of semiconductor physics.

I stand by my initial point, that is paying for the latency, decoding bandwitdth etc for 3 ops (andn, and, or) for something as mundane as a cond move is silly. Even more so when you strive to avoid branches when dealing with SSE to begin with, branches that you remove with… you know what.

Why is that silly? For many situations, like for the cosine wrap, you can simplify the binary logic. Also, latency by itself is not that big a problem thanks to pipelining, out-of-order execution, and jump prediction. Ok, the Pentium 4 took that idea a little too far, but seriously, if you know a magical architecture that combines high clock frequency with low latency then let me know.

So, I understand your frustration about modern CPU performance characteristics, but Intel and AMD really know what they’re doing. x86 ain’t perfect but it’s legacy (tons of support for it) and it’s flexible (new instructions get added every generation). It’s certainly possible to create faster chips with a new instruction set tuned for today’s applications but if it’s not flexible enough it’s not going to survive. Intel beated Alpha and PowerPC with affordable and fast processors.

@Nick

That’s really the best possible. … What more could you expect?

1998 has called and wants the 3DNow pfrcp*/pfrsq* back.

@Nick

Why is that silly? For many situations, like for the cosine wrap, you can simplify the binary logic. Also, latency by itself is not that big a problem thanks to pipelining, out-of-order execution, and jump prediction. Ok, the Pentium 4 took that idea a little too far, but seriously, if you know a magical architecture that combines high clock frequency with low latency then let me know.

You can spin it till hell freezes over, Intel & AMD optimization manuals clearly say: thou shall remove thy ugly branches. If you don’t lend me any credit maybe that should get your attention, i think.

So we have to remove branches, and for that more often than not we’ll
get into the the cond move pattern.

Now let’s see what we had for ages on the integer side of those same
CPU. Surprise! There’s both those “binary logic” ops and… tada …
cmov.

I’m sure dudes at Intel thought they had encoding bits and silicon to
waste and they were like let’s add another instruction just for the fun.

Really i’m asking for consistency. And common sense.

@Nick

So, I understand your frustration about modern CPU performance characteristics, but Intel and AMD really know what they’re doing. x86 ain’t perfect but it’s legacy (tons of support for it) and it’s flexible (new instructions get added every generation).

Err, correct me if i’m wrong but i’m pretty sure there was like 0 legacy issues when Intel first introduced SSE.

And allow me to put the “they know what they’re doing” back into proper historical context: that’s the same guys that have gone with a segmented memory model at the same time ie Motorola was going the flat way. Or those that later on saw the light again when they thought stack based fpu was the best thing since sliced bread when everyone and his brother was using registers. :whistle:

Please save me that kind of PR, because even if my memory may not be what it used to be, i was there.

@tbp

1998 has called and wants the 3DNow pfrcp*/pfrsq* back.

pfrcp has 14-bit mantissa precision, pfsqrt 15-bit. That’s not a whole lot better than the SSE versions. I mean, especially for the purposes these instructions are used for it shouldn’t matter that much (e.g. vector normalization). And the 3DNow! instructions have quite high latencies on Athlon 64 as well.

You can spin it till hell freezes over, Intel & AMD optimization manuals clearly say: thou shall remove thy ugly branches. If you don’t lend me any credit maybe that should get your attention, i think.

Sorry, what credit are you asking for then? :huh: And I read almost every page of the Intel manuals (plus some chapters in The Unabridged Pentium 4) so it’s hard to get my attention with that, sorry. I also got several courses about processor architecture and digital design if that matters…

So we have to remove branches, and for that more often than not we’ll get into the the cond move pattern.

Now let’s see what we had for ages on the integer side of those same CPU. Surprise! There’s both those “binary logic” ops and… tada … cmov.

I’m sure dudes at Intel thought they had encoding bits and silicon to waste and they were like let’s add another instruction just for the fun. Really i’m asking for consistency. And common sense.

My point was that using a couple binary logic instructions is quite effective to remove some branches. Sure, there is no cmov equivalent, but that’s not so terrible. Besides, what would that instruction look like anyway? There is no vector of compare flags. And that’s a good thing because they would add a whole lot of hardware complexity (making the latencies even bigger). The current SSE architecture is pretty clean in my opinion.

Implementing cmov in the integer pipeline is easy since flags have always been there. They did the right thing for SSE by giving it a cleaner modern architecture. That means less consistency but I don’t mind. A single scalar SSE version of cmov would probably have been doable but then there is no vector equivalent (no consistency) and it’s pretty useless anyway. For predictable branches you can still use comiss.

So I’m not sure what you’re asking for exactly… :surrender

And allow me to put the “they know what they’re doing” back into proper historical context: that’s the same guys that have gone with a segmented memory model at the same time ie Motorola was going the flat way. Or those that later on saw the light again when they thought stack based fpu was the best thing since sliced bread when everyone and his brother was using registers. :whistle:

Intel has always chosen the budget solution, and look where they are now. Sure, x86 ain’t perfect, but it’s what makes my PC tick and almost everyone else’s. As for x87, yes the register stack can be annoying, but its design started in 1976 and uses the robust IEEE 754 standard (also created by Intel) we still use today.

Please save me that kind of PR, because even if my memory may not be what it used to be, i was there.

I’m sorry but what PR? I have an AMD destop and Intel laptop. And I’m willing to change to something else if it’s fast, affordable, easy to develop for and flexible enough for the future.

Anyway, I have good hopes for the future of x86. Intel finally realized clock frequency isn’t the only thing that determines performance. By increasing issue width, making SIMD units 4-wide instead of only 2-wide, and using reverse hyper-threading, floating-point performance can still increase a lot.

@Nick

pfrcp has 14-bit mantissa precision, pfsqrt 15-bit. That’s not a whole lot better than the SSE versions. I mean, especially for the purposes these instructions are used for it shouldn’t matter that much (e.g. vector normalization). And the 3DNow! instructions have quite high latencies on Athlon 64 as well.

Reciprocals and square roots do happen outside of vector normalization
you know.

Fact is i need both of them, fast and robust and if possible with a bit
pattern that doesn’t look like i got it from a RNG.

Plus i wasn’t addressing relative latency of 3DNow & SSE but
functionnalities. It seems you’ve missed the wildcard in my previous
post, so please check “3DNow! Technology Manual” for PFRCPIT1, PFRCPIT2,
PFRSQIT1 & co.

Compare that to:

```
static __m128 rcp_sqrt_nr_robust_ps(const __m128 arg) {
const __m128
mask = cmpeq(arg, all_zero()),
nr = rsqrtps(arg),
muls = mulps(mulps(nr, nr), arg),
filter = andnps(mask, muls),
beta = mulps(rt::cst::section.half, nr),
gamma = subps(rt::cst::section.three, filter),
final = mulps(beta, gamma);
return final;
}
```

I could have spammed the same kind of uglyness for related ops like 1/x etc… Not so attractive anymore eh? (or robust in fact)

@Nick

Sorry, what credit are you asking for then? :huh: And I read almost every page of the Intel manuals (plus some chapters in The Unabridged Pentium 4) so it’s hard to get my attention with that, sorry. I also got several courses about processor architecture and digital design if that matters…

That’s fine & dandy, but one got to wonder if we’re reading the same manuals.

From a quick searh through **IA-32 Intel Architecture Optimization
Reference Manual, Order Number:248966-012**

2-85, Vectorization

“User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of
conditional branches inside loops and consider using SSE instructions to
eliminate branches”

2-89, User/Source Coding Rules

“User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of
conditional branches inside loops and consider using SSE instructions to
eliminate branches. 2-86”

3-37, Instruction Selection

“The following section gives some guidelines for choosing instructions
to complete a task. One barrier to SIMD computation can be the existence
of data-dependent branches. Conditional moves can be used to eliminate
data-dependent branches. Conditional moves can be emulated in SIMD
computation by using masked compares and logicals, as shown in
Example3-21.”

4-2, General Rules on SIMD Integer Code

“Emulate conditional moves by using masked compares and logicals
instead of using conditional branches.”

Should i quote some more or did you get the point by now?

@Nick

My point was that using a couple binary logic instructions is quite effective to remove some branches. Sure, there is no cmov equivalent, but that’s not so terrible. Besides, what would that instruction look like anyway? There is no vector of compare flags. And that’s a good thing because they would add a whole lot of hardware complexity (making the latencies even bigger). The current SSE architecture is pretty clean in my opinion.

Excuse me but isn’t one of the main point of a CISC arch to provide code
compression via byzantine instruction encoding?

I was kind enough to restrict my example to x86 and i’ve never ever
alluded to flags. I’ve merely pointed out the inconsistency where on one
hand you have cmov* on the ALU side, fcmov* on the FPU side and
nothing like that for SSE. You’ll notice those cmov* and fcmov* could
also be emulated.

I would have expected a single op doing exactly what a andn/and/or
sequence would do; admitedly ternary ops aren’t the norm on x86, but
anything would have done the trick: use a blessed register as the
implicit mask (aka accumulator) or encode it as a prefix or whatever.

It makes no sense when decoding bandwitdh is scarce and you’re
struggling with troughoutput to say the norm is replace to conditional
branches with a 3 freaking op sequence.

@Nick

Intel has always chosen the budget solution, and look where they are now. Sure, x86 ain’t perfect, but it’s what makes my PC tick and almost everyone else’s. As for x87, yes the register stack can be annoying, but its design started in 1976 and uses the robust IEEE 754 standard (also created by Intel) we still use today.

Revisionist kids these days… :)

You know there was a (computing) world before Intel and it will still
be there when everybody will have forgotten about them.

Repeat after me: stack based fpu stink, they never ever made sense.
Ever.

Again it might come as a surprise but others got it mostly right:
http://en.wikipedia.org/wiki/Motorola_68882

Now better than rephrasing what the marketing department wrote, get the story about IEEE 754 from the Man himself. Please. And remove those rosy teinted glasses.

http://www.cs.berkeley.edu/\~wkahan/ieee754status/754story.html

@tbp

Reciprocals and square roots do happen outside of vector normalization you know.

Indeed I know.

Fact is i need both of them, fast and robust and if possible with a bit pattern that doesn’t look like i got it from a RNG.

Then please explain what you’re using it for exactly and what’s the source of your frustration. Because right now you’re barking up the wrong tree.

Plus i wasn’t addressing relative latency of 3DNow & SSE but functionnalities. It seems you’ve missed the wildcard in my previous post, so please check “3DNow! Technology Manual” for PFRCPIT1, PFRCPIT2, PFRSQIT1 & co.

Compare that to:`static __m128 rcp_sqrt_nr_robust_ps(const __m128 arg) { const __m128 mask = cmpeq(arg, all_zero()), nr = rsqrtps(arg), muls = mulps(mulps(nr, nr), arg), filter = andnps(mask, muls), beta = mulps(rt::cst::section.half, nr), gamma = subps(rt::cst::section.three, filter), final = mulps(beta, gamma); return final; }`

I could have spammed the same kind of uglyness for related ops like 1/x etc… Not so attractive anymore eh? (or robust in fact)

I already know how Newton-Rhapson refinement works and the related
3DNow! instructions, I use it myself, now what’s your point? That SSE
intrinsics can be unattractive? I got hundreds of lines of run-time
intrinsics that
are very tricky and it’s *not* all SSE code.

Should i quote some more or did you get the point by now?

No need to quote anything, I know the optimization manual by heart. I still don’t get your point though. You want an SSE cmov? Show me how that would work.

Excuse me but isn’t one of the main point of a CISC arch to provide code compression via byzantine instruction encoding?

As I’m sure you know, every modern processor is RISC behind the decoder. So even if they added every exotic instruction you can think of, they wouldn’t execute much faster than when you have to implement them yourself. That’s also why SSE is very RISC-like and adding more instructions would just be a waste of opcodes and microcode ROM. Compact code ain’t that important any more.

Besides, it’s not because you see a use for certain instructions for
your specific application, that they need to be added to the instruction
set. They constantly have to evaluate transistor cost and overhead
versus speedup over *all* existing software. It simply pays off more to
improve the architecture and not touch the instruction set. Even though
it’s CISC you can’t go adding instructions like mad.

I would have expected a single op doing exactly what a andn/and/or sequence would do; admitedly ternary ops aren’t the norm on x86, but anything would have done the trick: use a blessed register as the implicit mask (aka accumulator) or encode it as a prefix or whatever.

Sorry, that’s not a good idea. You would be giving a nice modern architecture a special-purpose accumulator register and have instructions with side effects. That’s actually worse than having an FPU with a register stack.

Ternary operations would require register files with extra read ports. The area of a register file is proportional to the square of the number of ports. It also has a direct influence on latencies. The only time that can be acceptable is if all operations are ternary (which saves move operations).

It makes no sense when decoding bandwitdh is scarce and you’re struggling with troughoutput to say the norm is replace to conditional branches with a 3 freaking op sequence.

There is nothing we can change about the masking sequence. But that’s not the actual problem anyway. Future x86 CPUs will just need better decoders, higher issue width and wider execution units, like I said before. That’s infinitely better than breaking the architecture for that one application where it could add a few percent performance.

Revisionist kids these days… :)

Don’t mock me. I’m only giving explanations, not excuses. I too wish that x86 was a cleaner architecture or something superiour took over. For the time being we’re just going to have to live with it and understand its limitiations so we still get the best out of it.

You know there was a (computing) world before Intel and it will still be there when everybody will have forgotten about them.

Indeed I know.

Repeat after me: stack based fpu stink, they never ever made sense. Ever.

Early compiler technology was stack based for most part. So the FPU’s register stack made it easier to write compilers. Nowadays it makes no sense to write FPU code manually, compilers generate perfect code, so why bother. Also, fxch has zero latency so it’s pretty much like accessing a flat register file anyway.

Again it might come as a surprise but others got it mostly right: http://en.wikipedia.org/wiki/Motorola_68882

That’s not a surprise at all considering it was designed 10 years later.

Now better than rephrasing what the marketing department wrote, get the story about IEEE 754 from the Man himself. Please. And remove those rosy teinted glasses.

http://www.cs.berkeley.edu/\~wkahan/ieee754status/754story.html

I have already read that article. Not so long ago actually. I was looking for a way to detect problems caused by NaNs, you remember. The answer is not in the article but it was an interesting read. But what’s your point? It’s Intel who initiated the standard. It would be foolish to ignore their influence on the digital world.

As a matter of fact I got cool new glasses last Saturday, black with copper orange. Not rosy teinted though, crystal clear.

So, I did a quick test on the THD, and this is what I’ve found:

taylor: 2.4%

parabolic: 3.8%

extra precision: 0.078%

So, you can see that your version with extra precision performs much better than the others. Your parabolic version is actually worse than the taylor one. That’s of course because you’re making rather large errors over almost the complete domain. Taylor only gets (very) bad towards pi. There is however one major fact in favour of the parabolic one: it’s continuous.

The parabolic also has only odd harmonics (k = 2n+1), while taylor has both even and odd harmonics. The latter however, might be in favour of the Taylor series when it comes to psychoacoustics (how we experience the signal when we here it) because the third harmonic gives a ‘covered’ sound, while having the 2nd, 3rd, 4th, … all together gives a ‘brassy’ sound. It’s a bit like transistors vs tubes …

It might be interesting to take the general formula with four coefficients, and to optimize it for minimal THD.

Bramz

BTW, that other one you proposed: 0.976 x + 0.0683 x\^2 - 0.241 x\^3 + 0.0384 x\^4, it’s THD is 0.32%.

is is definitely the most interesting and detailed post i have seen on any forum :)

im going to try this out myself and will keep u posted if i do get something for the other functions :)

@bramz

So, I did a quick test on the THD, and this is what I’ve found:

taylor: 2.4%

parabolic: 3.8%

extra precision: 0.078%So, you can see that your version with extra precision performs much better than the others. Your parabolic version is actually worse than the taylor one. That’s of course because you’re making rather large errors over almost the complete domain. Taylor only gets (very) bad towards pi. There is however one major fact in favour of the parabolic one: it’s continuous. It also has only odd harmonics (k = 2n+1), while taylor has both even and odd harmonics.

Thanks! :yes: THD really is an interesting error measure. Looking at the graphs it makes sense the single parabola is worste than the Taylor series, except towards x = pi. Anyway the Taylor series is totally useless compared to the squared parabola method. :happy:

It might be interesting to take the general formula with four coefficients, and to optimize it for minimal THD.

I’ll give that a try when I find the time.

@Nick

I already know how Newton-Rhapson refinement works and the related 3DNow! instructions, I use it myself, now what’s your point?

My point was that not providing instructions to refine results with SSE was a mistake, that as AMD demonstrated in 1998 it was possible to provide that kind of functionnality on x86 in an elegant and effective fashion; see the posted code up there were it takes 8 ops, 2 mem access to do get within a couple ulp over a restrained domain. It doesn’t sound like “really the best possible” in my book.

While it makes me all warm inside to know you’ve read all books - “la
chair est triste, helas! et j’ai lu tous les livres” -, written awesome
software and you’re still on track to cure cancer i was kinda expecting
after 3 posts that you would answer the point as opposed to wave hands
and make broad claims.

@Nick

<…>

There is nothing we can change about the masking sequence. But that’s not the actual problem anyway. Future x86 CPUs will just need better decoders, higher issue width and wider execution units, like I said before. That’s infinitely better than breaking the architecture for that one application where it could add a few percent performance.

Which can be summed up as: let’s comfortably forget there was nothing to break when SSE was introduced (introduction, as in it didn’t exist before) and keep faith in Moore’s law to save us all.

Btw, i’m glad you decreeted that wasn’t a problem but could you convince
my current cpu to comply with your ruling? Thank you.

@Nick

Early compiler technology was stack based for most part. So the FPU’s register stack made it easier to write compilers.

Eh? I must be living in an alternate reality.

Oh and fxch was made almost ‘free’ by the time of the P3, that is a
couple of decades later.

@Nick

I have already read that article. Not so long ago actually.

Then, why do you say “they’ve created it”? It makes no sense.

x87 wasn’t the first fpu to be designed and as a matter of fact there
was no PC to speak about before some wacko at IBM picked the worst
processor design of the time back in 1980.

No i’m not frustrated. No i’m not barking. I just find incredibly funny
that one could put “they know what they’re doing” and Intel in the same
sentence given their track record. Circumstances & fortunate events
paving the way for a monopoly in a given sector are not to be mixed up
with engineering prowness.

Note that i’m not saying they systematically mess it up. I’m just that
asking for some critical judgement.

So please compare, gah, 4 iterations of SSE to 3DNow!, Altivec etc
etc…

You could start here,
http://www.eecg.toronto.edu/\~corinna/vector/svx/index.html

On that note, since the Moore law argument was made, there’s no need for further discussion really. Forgive me if i’ll selectively remember your code and not your amusing remarks about book you’ve read or courses you’ve taken.

Thanks for your attention.

We could also used Taylor series for exp function.

I did it for my math library(http://glm.g-truc.net) and they are quiet efficient but low precision. My functions are build to use an “half precision”.

inline float fast(float x)

{

// This has a better looking and same performance in release mode than the following code. However, in debug mode it’s slower.

// return 1.0f + x * (1.0f + x * 0.5f * (1.0f + x * 0.3333333333f * (1.0f + x * 0.25 * (1.0f + x * 0.2f))));

float x2 = x * x;

float x3 = x2 * x;

float x4 = x3 * x;

float x5 = x4 * x;

return 1.0f + x + (x2 * 0.5f) + (x3 * 0.1666666667f) + (x4 * 0.041666667f) + (x5 * 0.008333333333f);

}

I also tryed with the log function but the result was good, VC7.1 is really fastest.

It’s also work well with asin, acos, atan and tan functions:

inline float fastAsin(float x)

{

return x + (x * x * x * 0.166666667f) + (x * x * x * x * x * 0.075f) + (x * x * x * x * x * x * x * 0.0446428571f) + (x * x * x * x * x * x * x * x * x * 0.0303819444f);// + (x * x * x * x * x * x * x * x * x * x * x * 0.022372159f);

}

inline float fastAcos(float x)

{

return 1.5707963267948966192313216916398f - glm::fastAsinGTX(x); //(PI / 2)

}

inline float fastAtan(float x)

{

return x - (x * x * x * 0.333333333333f) + (x * x * x * x * x * 0.2f) - (x * x * x * x * x * x * x * 0.1428571429f) + (x * x * x * x * x * x * x * x * x * 0.111111111111f) - (x * x * x * x * x * x * x * x * x * x * x * 0.0909090909f);

}

inline float fastTan(float x)

{

return x + (x * x * x * 0.3333333333f) + (x * x * x * x * x * 0.1333333333333f) + (x * x * x * x * x * x * x * 0.0539682539f);

}

I have try with SSE instructions and I didn’t kown this “optimisation” so I very interested :p

Note that i’m not saying they systematically mess it up. I’m just that asking for some critical judgement.

I think this is the essential part of our discussion. I fully agree that x86 ain’t perfect. SIMD performance is definitely below par compared to several other CPUs. So yes I’m critical about that.

Despite that, these processors are cheap and they are everywhere. You may regret that, but frankly I don’t care much how this came to be. I just want to get the maximum out of them. There’s little point complaining about lack of instructions or the ‘right’ register layout. Heck, most people don’t even know assembly, they only care about ‘C++’ performance. And that’s understandably what Intel focusses on, by including mainly (but not exclusively) instructions that can be used by compilers, fixing old mistakes (fxch for free), improving jump prediction, removing the decoder bottleneck (trace cache), doubling integer throughput (rapid execution engine), HyperThreading, etc. And considering that developing a new architecture takes 4-6 years and they’re limited by the legacy instruction set I still think they’re doing the best possible. It’s always easy to criticise afterwards…

Apple’s switch from IBM PowerPC to Intel also sounds to me like they admit Intel produces great processors. They even claim it’s 2x faster. And they must be looking forward too, knowing that Core will be very cost effective. They’re not switching for no reason. And I’m sure AMD is working on a great new architecture as well. Reverse HyperThreading is a great idea and an integrated memory controller for XDR could blow away Intel’s memory performance.

So I guess I’m a realistic optimist. x86 still has great potential and clearly the instruction set doesn’t matter all that much. Both Intel and AMD could beat AltiVec if they wanted to. They’re just waiting for the right moment to make this economical, which might be sooner than expected…

tbp & Nick:

If you forgive me for disturbing when the mastrers are talking, and for beeing completely off topic…

First of all, I have never done any asm coding myself, and I am by no means literate on low-level processor architecture. But I am interested!

What do you think of the cell architecture for low-level coding? Will it be a nightmare to work with, with a totally messed up design incorporating every flaw you could make… Or will it be pure pleasure with total controll and possibilities for writing high performing code?

It seemed neat to me, although I wouldn’t really know.

@geon

What do you think of the cell architecture for low-level coding? Will it be a nightmare to work with, with a totally messed up design incorporating every flaw you could make… Or will it be pure pleasure with total controll and possibilities for writing high performing code?

Hi geon,

It’s nice to hear your have an interest in processor architecture!

But I’m not sure if this thread is the ideal place to discuss it. It would be great if you could start a topic about it in the General Development forum though. :yes: I don’t know all that much about Cell myself but I’ll do some research and hopefully we can have an interesting discussion there.

Cheer,

Nick

P.S: Just a thought: Actually a ‘Hardware’ forum could be useful. It would be the place where we can discuss the low level stuff that has little to do with the programming itself but can be essential to create great games (like looking ahead what will be possible with future CPUs and GPUs). Anyone feel the same or is the current list of forums more than sufficient?

@tbp

Eh? I must be living in an alternate reality.

Oh and fxch was made almost ‘free’ by the time of the P3, that is a couple of decades later.

fxch was already almost free on the first generation Pentium. As far as I remember it was the only float instruction that could execute in the V-pipeline making it a practially free instruction for any sequence of fpu instructions.

Later on (P2) they broke up the fixed UV-pipeline design and made fxch truely free. It was more a prefix-kind instruction (also it could still consume a cycle or two in extreme cases).

Anyways, I have to admit, writing stack based fpu code was never easy, even with fxch. Otoh one could outperform the compiles a lot back these days, that was fun.

Hi Nick,

Thank you for your very interesting article.

I have a question though: you say Taylor behaves very nicely up to
Pi/2. Since the shape in the range [0 ; Pi/2] is enough for any sinus or
cosinus part, what about using only this part?

I would be interested in seeing a more fair comparison. :-)

@zavie

I have a question though: you say Taylor behaves very nicely up to Pi/2. Since the shape in the range [0 ; Pi/2] is enough for any sinus or cosinus part, what about using only this part? I would be interested in seeing a more fair comparison. :-)

They are all polynomials, they just have different coefficients. The Taylor series is a way to approximate a function around a certain point (0 in this case), but the coefficients it produces are not ideal for the best approximation in a range ([-pi, pi] in this case).

Using only the [-pi/2, pi/2] range, the Taylor series is more accurate. But that’s not too surprising since it uses more terms, making it slower to compute. Using a generic polynomial with the same number of terms and minimizing the error yields even better results. So again the Taylor series fails.

In fact the Taylor series starts from the same generic polynomial but instead of using restrictions that minimize the error over a range it requires that the 0’th derivative, 1’th derivative, … n’th derivative in 0 is equal to that of the function to be approximated.

Few months ago I implemented method described by Nick using fixedpoint math only. This routine works fine for me and maybe someone else would find it useful. So I’ve decided to post it here :)

```
fp16 ne3d_sintp_02pi(fp16 angle)
{
fp16 Xi2,Xi3,Yi,Yi2,Yi3,Yi4;
angle = angle >> 1;
Xi2 = ((angle >> 1) * (angle >> 2)) >> 13;
Xi3 = (6640 * Xi2) >> 12;
Yi = (((20860 * angle) >> 13) - Xi3) >> 1;
Yi2 = (Yi * Yi) >> 14;
Yi3 = (14746 * Yi2) >> 16;
Yi4 = (25395 * Yi ) >> 14;
Yi = Yi4 + Yi3;
return Yi;
}
```

Hi iMalc,

Do a search for MinMax polynomials. These give you the best approximation.

This stuff here is a must read as well: http://www.research.scea.com/gdc2003/fast-math-functions.html

@iMalc

I once tried that, but the gains are very minimal. Furthermore, in the case of sine and cosine it’s faster to use a parabola and the square of -the same- parabola than to use a generic 4’th degree polynomial. Basically it puts a small restriction on the ‘shape’ of the polynomial, making it faster to compute. It also happens to make it efficient to compute the [-pi, 0] part and makes things perfectly symmetrical.

For other situations the more generic minimax algorithm can give excellent results. If you want a Real Challenge™ though try finding a fast approximation for pow and then share it with me. ;)

You might also be interested in this thread: Approximate Math Library.

Exact value for Q when minimizing the absolute error:

Q = (3150/(pi\^3)) - (30240/(pi\^5)) - 2

(Your Q=0.775160898644)

P+Q=1 -> P=1-Q

Please check out the original formula from Nick (first post on page 1). BTW, the external images of his post do not work anymore. Too sad…

@roosp

BTW, the external images of his post do not work anymore.

Thanks for noticing. I still have the images, but no place to host them. :surrender

@roosp

Yeah, I know that. I think I didn’t express myself correctly.

I want to know how to find (with maths) P (or Q) that minimizes the absolute error and the relative one.

@Nick

Thanks for noticing. I still have the images, but no place to host them. :surrender

I can host the images if you want.

@yashez

I can host the images if you want.

Thanks a lot for the offer, but it seems like I can’t edit the original post anyway. I’ve PM’ed our webmaster to see whether he could host the images on-site…

@yashez

Yeah, I know that. I think I didn’t express myself correctly. I want to know how to find (with maths) P (or Q) that minimizes the absolute error and the relative one.

I got the values from this website http://fastsine.tripod.com

Error plots and some code. Excellent list with references.

Ah, I have a couple more formulas for you to ponder.

Originally I calculated values for a generic degree 4 polynomial, but I like symmetry, so all the formulas that follow are symmetric around pi/2.

Starting with minimizing absolute error we get

\~ 0.985893x + 0.052042 x\^2 - 0.232915 x\^3 + 0.037069 x\^4

minimizing relative error

\~ 0.983251 x + 0.056247 x\^2 - 0.235056 x\^3 + 0.037410 x\^4

minimizing maximum error

\~ 0.988023 x + 0.048651 x\^2 - 0.231188 x\^3 + 0.036795 x\^4

And a graph of the errors over the range [0, pi]

green - min absolute error

blue - min maximum error

magenta - min relative error

red - nick’s formula

Then I also calculated some more… so another graph

green, blue, magenta, red - same as above

black - min absolute error of the derivatives

yellow - min sum of absolute error and absolute error of the
derivatives

cyan - min square of the error

And, an interesting note, nick’s formula intersects sin(x) at pi/6 and 5*pi/6.

Hi Nick,

I know this is an old post but I just wanted you to know it has really saved my bacon. I’ve been struggling with a high speed motor control loop and needed a fast sincos().

I’m running on a Luminary Micro at 50MHz. The gcc sincos (presumably taylor series) ran in 104uS. Your polynomial approximation ran at 19.2uS.

Believe it or not this was still too long. So I implemented it in fixed point and got the sincos down to 2.3uS.

Now that’s a spicy meatball!

@evarobotics

I know this is an old post but I just wanted you to know it has really saved my bacon. I’ve been struggling with a high speed motor control loop and needed a fast sincos().

I’m running on a Luminary Micro at 50MHz. The gcc sincos (presumably taylor series) ran in 104uS. Your polynomial approximation ran at 19.2uS.

Believe it or not this was still too long. So I implemented it in fixed point and got the sincos down to 2.3uS.

That’s an impressive result! I’m happy this approximation was of use to you. Recently someone also told me he succesfully used it on an AVR microprocessor.@Sol_HSA

I wonder why this hasn’t been submitted to game programming gems.. =)

On this site it’s accessible to anyone. Besides, an entire book could be dedicated to tricks like this. ;)@v71

Becasue it didn’t encounter Nick ‘s likenings

What do you mean?

@Nick

On this site it’s accessible to anyone. Besides, an entire book could be dedicated to tricks like this. ;)

Posting it in the book would make it accessible to a larger amount of people, and the GPG books are filled with nuggets like these, so it would definitely fit in. And would look nice in your CV. =)

@Groove

We could also used Taylor series for exp function.

I did it for my math library(http://glm.g-truc.net) and they are quiet efficient but low precision. My functions are build to use an “half precision”.

I also tryed with the log function but the result was good, VC7.1 is really fastest.

It’s also work well with asin, acos, atan and tan functions:

I have try with SSE instructions and I didn’t kown this “optimisation” so I very interested :p

I’ve used Nick’s Sin, Cosine stuff for an Iphone project that I am working on… those functions work a treat. Thanks Nick.

The reason I’ve resurrected this thread is that I need a fast way to compute the ACOS function. The one in the quotes is impractical on an ARM cpu with a VFP. All those multiplications will stall the VFP baddly, so realizing that there are far better people at working out math than me, is there a nice cheap fast way to do this that will give reasonably results.

Thanks for any help.

For the arcsine guy, one big problem is that the derivative of arcsin/arccos(which is pi/2-arcsin) goes to infinity at the boundary, so polynomials will never work there. One trick is to let yourself take a square root if you’re close to the edge, then those evaulations will be slow (but still \~4 times faster than an arcsine), but for most of the range you are much faster. Yes, ugly if, but not much of a way around it…

Sample code, good to <10\^-4, average abs err of 10\^-5, \~10 times faster on average than arcsine, and apologies for the line breaks:

```
typedef double mytype; //try float here too, if you like
const mytype asin4_params1[5]={ 6.32559537178112e-05, 9.97002719101181e-01, 3.23729856176963e-02, 3.89287300071597e-02, 1.93549238398372e-01};
const mytype asin4_params2[7]={ 2.09625797161885e+01, -1.74835553411477e+02, 6.13575281494908e+02, -1.14033116228467e+03, 1.19159992307311e+03, -6.63957441058529e+02, 1.54421991537526e+02};
const mytype asin4_params3[4]={ 1.57080010233116e+00, -1.41437401362252e+00, 1.84777752400778e-03, -1.24625163381900e-01};
const mytype asin4_split1=0.6;
const mytype asin4_split2=0.925;
static inline mytype asin4(mytype x)
{
if (x<asin4_split1)
return asin4_params1[0]+x*(asin4_params1[1]+x*(asin4_params1[2]+x*(asin4_params1[3]+x*(asin4_params1[4]))));
if (x<asin4_split2)
return asin4_params2[0]+x*(asin4_params2[1]+x*(asin4_params2[2]+x*(asin4_params2[3]+x*(asin4_params2[4]+x*(asin4_params2[5]+x*asin4_params2[6])))));
mytype xx=sqrt(1-x);
return asin4_params3[0]+xx*(asin4_params3[1]+xx*(asin4_params3[2]+xx*asin4_params3[3]));
}
```

Am a bit late to the party here. Nicks sin() is great.

Ive implemented it for the iPhone in ARM assembler using their VFP SIMD unit. Getting about 10.5 times the std lib sinf() performance.

You can save some work by combining range reduction with the evaluation, and by making the evaluation operate on the range -0.5…0.5 rather than -pi/2…pi/2.

The ARM VFP unit is heavily pipelined, and one issue with this code is that its not very friendly on deep pipelines - lots of dataflow dependencies in there.

```
float fast_sin(float x)
{
// simplify the range reduction
// reduce to a -0.5..0.5 range, then apply polynomial directly
const float P = 0.225; // const float Q = 0.775;
x = x * M_1_PI;
int k = (int) round(x);
x = x - k;
float y = (4 - 4 * abs(x)) * x;
y = P * (y * abs(y) - y) + y;
return (k&1) ? -y : y;
}
```

I ended up using this in the end:

inline float fastAcos(float x)

{

return sqrt(1-x)*(1.5707963267948966192313216916398f +
x*(-0.213300989f + x*(0.077980478f + x*-0.02164095f)));

}

When interleaved with other code the Arm/VFP assembly version pipelines pretty good.

@Andy201

1.5707963267948966192313216916398f

LOL. You realize float only has about 7 or 8 decimal digits of precision, right?

Instead of LOL, the compiler will take that into account… The thing is the code works, it gives the correct results. I dare you to try it. Jesus, you are a prized prick.

If you can come up with anything as fast as that, which will pipeline with an ARM vfp, then go ahead - that square root is almost free when pipelined. But I guess you can’t…

@Andy201

the compiler will take that into account…

Of course it will. I’m not claiming the code doesn’t work. Don’t take it so personally, I was just amused by the unnecessary number of digits. ;)

I have a sincos routine using 16bit fractional integers, and using the extra term mentioned here the results are quite impressive.

it was also based on code i found elsewhere that ive lost the reference
to,

using a method to calculate sin and cos at the same time sharing some
of the calculations, I dont claim to know in detail how the underling
thoery of it works.

it gives close to 4 bit error when measured RMS, in fact strangly enough, the fractional int scores a lower rms error than with double.

as its for FFT on 10 bit adc this should be sufficient.

I tested the accuracy of the code using c# for ease, c# is a bit picky about casting, if you multiply two 16 bit numbers it becomes 32 bits so you have to cast it back to 16 bits to make sure. I also had run time check for overflow turned on for thorough checking

ive not done timing analysys yet on the target, as im porting some code to a different cpu.

```
/// <summary>
/// test for fast sincos functions using signed fractional ints,
/// results compared well to using double.
/// fractional int is stored as n*0x8000 where n is from -1 to less than 1
/// ie fixed point where there are no places before the point.
/// +1.0 cant be stored, but -1.0 can , so the functions have to be adjusted slightly so they dont ever return +1.0
/// multiply two fractional ints and you have a 30bit number plus 2 sign bits
/// so you need to left shit to get 31 bits+sign, then take the top 16 bits to get a signed fractional int.
/// </summary>
static bool accurate = true;
/// <summary>
/// based on sin=3/4 -x^2 +x (where x=angle-1/4 turn)
/// </summary>
/// <param name="phase">angle 0-0xffff for 0-360' </param>
/// <returns>maxint-minint</returns>
public static SinCos sin_i(Int16 phase)
{
Int16 x;
x = (Int16)((phase & 0x3fff) - 0x2000);
x *=2;
Int16 x2 = (Int16)((x * (Int32)x)/0x8000);
//instead of dividing by 0x8000 efficiency can be made by shifting left 1 then taking high word.
Int16 y = (Int16)(0x5fff - x2);
SinCos r;
if ((phase & 0x4000) != 0)
r = new SinCos(y - x, -x - y);
else
r = new SinCos(y + x, -x + y);
if (accurate)
{ //efficiences can be made by putting this in the above, but its clearer here
if ((phase & 0x4000) != 0)
r.cos = (int)((r.cos * (((Int16)(0.225 * 0x10000)) * -r.cos / 0x10000 + (Int16)(0x8000 * .775))) / 0x8000);
else
r.cos = (int)((r.cos * (((Int16)(0.225 * 0x10000)) * r.cos / 0x10000 + (Int16)(0x8000 * .775))) / 0x8000);
r.sin = (int)((r.sin * (((int)(0.225 * 0x10000)) * r.sin / 0x10000 + (int)(0x8000 * .775))) / 0x8000);
//instead of dividing by 0x8000 efficiency can be made by shifting right 15 bits
//(not enough bits to do shift left first as above)
}
if ((phase & 0x8000) != 0)
{
r.sin = -r.sin;
r.cos = -r.cos;
}
return r;
}
```

Here is a version of fast_sin in ARM NEON Assembly. Please feel to point out improvements.

```
float scalers2[4] = {-1.f, 1.f / 16.f, 0.225f, M_1_PI};
float fours[4] = {4.f, 4.f, 4.f, 4.f};
int addOne[4] = {1, 1, 1, 1};
int negativeOne[4] = {-1, -1, -1, -1};
float fin[4] = {x, 0.f,0.f,0.f};
float fout[4];
float toRet;
asm volatile(
//Load in data to start
"vld1.32 {q6}, [%[scalers2]] \n\t"
"vld1.32 {q3}, [%[fours]] \n\t"
"vld1.32 {q10}, [%[addone]] \n\t"
"vld1.32 {q11}, [%[negativeOne]] \n\t"
//"vld1.32 {q2}, [%[toRet]] \n\t"
"vld1.32 {q1}, [%[input]] \n\t"
//Need sin of PMt
//P = d13[0]
//M_1_PI = d13[1]
//x = x * M_1_PI; (scalers2[3])
"vmul.f32 q0, q1, d13[1] \n\t"
//int k = (int) x; //Was rounded
//Convert to int and back to float (q9 retains int version)
"vcvt.s32.f32 q9, q0 \n\t"
//Convert to float
"vcvt.f32.s32 q2, q9 \n\t"
//x = x - k;
"vsub.f32 q0, q0, q2 \n\t"
//4 * fabsf2(x) we can ignore this fabsf as we know that our x is always positive
"vmul.f32 q2, q0, q3 \n\t"
//(4 - 4 * fabsf2(x))
"vsub.f32 q2, q3, q2 \n\t"
//(4 - 4 * fabsf2(x)) * x;
"vmul.f32 q0, q0, q2 \n\t"
//fabsf2(y)
"vabs.f32 q2, q0 \n\t"
//y * fabsf2(y)
"vmul.f32 q2, q0, q2 \n\t"
//(y * fabsf2(y) - y)
"vsub.f32 q2, q2, q0 \n\t"
// P * (y * fabsf2(y) - y) (scalers[2]
"vmul.f32 q2, q2, d13[0] \n\t"
//P * (y * fabsf2(y) - y) + y;
"vadd.f32 q0, q0, q2 \n\t"
////If K is odd, then we have a negative number, else positive
//convert to int
//"vcvt.s32.f32 q8, q0 \n\t"
//Perform an And
"vand.I32 q2, q9, q10 \n\t"
//Negate
"vmul.I32 q2, q2, q11 \n\t"
//Make sure it is +1 or -1
"vorr.i32 q2, q2, q10 \n\t"
//if we are 1, we want a negative (convert to float)
"vcvt.f32.s32 q2, q2 \n\t"
//multiply by -1 (scalers2[0])
//"vmul.f32 q9, q9, d12[0] \n\t"
//return (k&1) ? -y : y;
"vmul.f32 q0, q0, q2 \n\t"
"vst1.f32 {q0}, [%[toRet]] \n\t"
:
:[scalers2] "r" (scalers2),
[fours] "r" (fours),
[addone] "r" (addOne),
[negativeOne] "r" (negativeOne),
[toRet] "r" (fout),
[input] "r" (fin)
:"q0","q1", "q2", "q3", "q4", "q6", "q8", "q9", "q10", "q11", "memory", "cc"
);
```

I tried implementing the cos() faster/less accurate function in php (5.3.0) and it was actually 5-20% slower.

I used the implementation from: http://lab.polygonal.de/2007/07/18/fast-and-accurate-sinecosine-approximation/

Using Windows Vista. Wampserver (php/mysql/apache). CPU: Intel Q6600.

@akreider

I tried implementing the cos() faster/less accurate function in php (5.3.0) and it was actually 5-20% slower.

If the script is interpreted then that’s to be expected. It takes longer to interpret many fast operations than to interpret one slow operation.

You probably want to look at ASP.NET for higher performance.

Hi Nick,

Thank you very much for posting this great tip. Sorry for this late reply to such an old post, but I just came across it. I’ve been trying to reproduce your results but I’m getting some strange output and I’m hopeful that you will take a look and let me know if you see my error.

When I plug your equations into an Excel spreadsheet, the graph looks great. Here is a link to a screenshot of the Excel graph:

http://www.flickr.com/photos/droidfan1/5120869190/

But when I write to a csv file from a C program, then graph the output, I get this strange output:

http://www.flickr.com/photos/droidfan1/5120869214/

I’m using gcc as my compiler. Here is the C program:

http://www.flickr.com/photos/droidfan1/5120266081/

Any suggestions are very much appreciated.

Best regards,

Rich

@DroidFan1

Hi Nick,

Thank you very much for posting this great tip. Sorry for this late reply to such an old post, but I just came across it. I’ve been trying to reproduce your results but I’m getting some strange output and I’m hopeful that you will take a look and let me know if you see my error.

When I plug your equations into an Excel spreadsheet, the graph looks great. Here is a link to a screenshot of the Excel graph:

http://www.flickr.com/photos/droidfan1/5120869190/

But when I write to a csv file from a C program, then graph the output, I get this strange output:

http://www.flickr.com/photos/droidfan1/5120869214/

I’m using gcc as my compiler. Here is the C program:

http://www.flickr.com/photos/droidfan1/5120266081/

Any suggestions are very much appreciated.

Best regards,

Rich

My previous post included only a screenshot of the C source code. Here is a link to the actual text of the source file:

Rich

Hi Rich, Your code works fine here. The csv file contains a beautiful sine.

So something else must be going on. Does your csv file contain 0.5 at 30 degrees, 1.0 at 90 degrees and 0.5 at 150 degrees? If not then something must have gone wrong in this code on your platform. Else it’s an issue with displaying it in Excel.

@Nick

Thanks for the reply Nick.

No my csv file does not contain the expected data. Here it is:

0,0.000000

1,0.017222

2,0.034444

3,0.051667

4,0.068889

5,0.086111

6,0.103333

7,0.120556

8,0.137778

9,0.155000

10,0.172222

11,0.189444

12,0.206667

13,0.223889

14,0.241111

15,0.258333

16,0.275556

17,0.292778

18,0.310000

19,0.327222

20,0.344444

21,0.361667

22,0.378889

23,0.396111

24,0.413333

25,0.430556

26,0.447778

27,0.465000

28,0.482222

29,0.499444

30,0.516667

31,0.533889

32,0.551111

33,0.568333

34,0.585556

35,0.602778

36,0.620000

37,0.637222

38,0.654444

39,0.671667

40,0.688889

41,0.706111

42,0.723333

43,0.740556

44,0.757778

45,1.000000

46,1.022222

47,1.044444

48,1.066667

49,1.088889

50,1.111111

51,1.133333

52,1.155556

53,1.177778

54,1.200000

55,1.222222

56,1.244444

57,1.266667

58,0.680933

59,0.692673

60,0.704413

61,0.716153

62,0.727894

63,0.739634

64,0.751374

65,0.763114

66,0.774854

67,1.014961

68,1.030110

69,1.045258

70,1.060407

71,1.075556

72,1.090704

73,1.105853

74,1.121001

75,1.136150

76,1.151299

77,1.166448

78,1.181596

79,1.196745

80,1.211894

81,1.227042

82,1.242191

83,1.257339

84,1.272488

85,1.287637

86,1.302786

87,1.317934

88,1.333083

89,1.348232

90,1.363380

91,1.378529

92,1.393678

93,1.408826

94,1.423975

95,1.439124

96,1.454272

97,1.469421

98,1.484570

99,1.499718

100,1.514867

101,1.530016

102,1.545164

103,1.560313

104,1.575462

105,1.590611

106,1.605759

107,1.620908

108,1.636056

109,1.651205

110,1.666354

111,1.681502

112,1.696651

113,1.711800

114,1.726948

115,0.719695

116,0.725953

117,0.732211

118,0.738469

119,0.744728

120,0.750986

121,0.757244

122,0.763502

123,0.769760

124,1.001315

125,1.009390

126,1.017465

127,1.025540

128,1.033615

129,1.041690

130,1.049765

131,1.057840

132,1.065915

133,1.073990

134,1.082066

135,1.090141

136,1.098216

137,1.106291

138,1.114366

139,1.122441

140,1.130516

141,1.138592

142,1.146667

143,1.154742

144,1.162817

145,1.170892

146,1.178967

147,1.187042

148,1.195117

149,1.203192

150,1.211267

151,1.219343

152,1.227418

153,1.235493

154,1.243568

155,1.251643

156,1.259718

157,1.267793

158,1.275868

159,1.283944

160,1.292019

161,1.300094

162,1.308169

163,1.316244

164,1.324319

165,1.332394

166,1.340469

167,1.348545

168,1.356620

169,1.364695

170,1.372770

171,1.380845

172,0.133508

173,0.134285

174,0.135061

175,0.135837

176,0.136613

177,0.137390

178,0.138166

179,0.138942

180,0.139718

Rich

@Nick

Nick,

I changed the two abs() function calls to fabs(), and it’s working great. Thanks for the feedback. It was helpful to know that it was working properly on another machine.

Best regards,

Rich

@Nick

You’ll have to use a debugger to see where the calculation goes wrong on your machine.

Nick,

I’m using gcc. Which compiler are you using?

Rich

@DroidFan1

I changed the two abs() function calls to fabs(), and it’s working great. Thanks for the feedback. It was helpful to know that it was working properly on another machine.

Oh, I get it now. You’re compiling it as C code while I’m compiling as C++. Because C doesn’t support function overloading, abs() is only defined for integers.

I’m using gcc. Which compiler are you using?

I’m using Visual Studio 2005. If I set it to compile as C code I can replicate the results you’re seeing.

I also sometimes use GCC, for which I prefer the Code::Blocks IDE. It conventiently integrates the GDB debugger.

But if you’re developing on and for Windows I would definitely recommend using the free Visual Studio Express compiler.

Thanks Nick, yes that makes sense. I will check out Code::Blocks and Visual Studio Express. Thanks for the tips.

Rich

Nick,

If I understand correctly, your ‘sine’ function is limited to an input range of [-pi, pi]. Do you have any ideas of how to support a greater range?

Rich

@DroidFan1

If I understand correctly, your ‘sine’ function is limited to an input range of [-pi, pi]. Do you have any ideas of how to support a greater range?

All you need is the modulo of the input and pi to extend the range.

Note though that fmod() is typically quite slow, which can defeat the purpose of trying to outperform sin(). But since the denominator is a constant value (pi), it can be greatly optimized:

```
inline float mod_pi(float x)
{
const float pi = 3.14159265f;
float x1 = x * (1.0f / pi);
return pi * (x1 - (int)x1);
}
```

Nick, your estimation is rather amazing. It inspired me to write my own one that uses a similar number of operations, but also uses a (relatively small) lookup table. As it turns out, a lookup table with 128 sets of numbers (A, B, C) each of which describes a parabolic estimation of a very short segment of the function cos from 0 to pi/2 will give a maximum error of no more than 0.000000126 (1.26e-07), and an average error of about 0.0000000117 (1.17e-08).

For testing purposes, I wrote this in Python first; however, I will be porting it to C soon.

The precomputed values can be hard coded, but I will include the code that generated them here anyway:

```
def parabolaFit(p1,p2,p3):
x1,y1=p1
x2,y2=p2
x3,y3=p3
x12=x1-x2
x13=x1-x3
x23=x2-x3
y1x1213=y1/(x12*x13)
y2x1223=y2/(x12*x23)
y3x1323=y3/(x13*x23)
a=y1x1213-y2x1223+y3x1323
b=-y1x1213*(x2+x3)+y2x1223*(x1+x3)-y3x1323*(x1+x2)
c=y1x1213*x2*x3-y2x1223*x1*x3+y3x1323*x1*x2
return a,b,c
def precomputeCosineModel(num):
model=[]
size=math.pi/2/num
for i in range(num):
x1=i*size
x2=x1+size*2/3 # These fractions are magic numbers that seem to produce a
x3=x1+size*7/6 # particularly good model. There may be better values.
pt1=(x1,math.cos(x1))
pt2=(x2,math.cos(x2))
pt3=(x3,math.cos(x3))
model.append(parabolaFit(pt1,pt2,pt3))
return model
MODEL=precomputeCosineModel(128)
TWOPI=2*math.pi
HALFPI=math.pi/2
SCALE=2*(len(MODEL)-1)/math.pi
def sine(theta):
global MODEL,TWOPI,HALFPI,SCALE
theta=theta%TWOPI
t=abs((theta%math.pi)-HALFPI)
whichparabola=int(t*SCALE+0.5)
a,b,c=MODEL[whichparabola]
y=a*t*t+b*t+c
if math.pi<theta:
return -y
return y
```

This therefore uses:

2 moduli (one of which yours also needs if it is to have a domain
outside of [-pi,pi])

4 additions/subtractions

4 multiplications (maybe one more if I use your absolute value trick)

0 divisions (maybe 1 if I use your absolute value trick)

1 array index

I have not tested whether this ends up being faster or slower once it is in C, but it certainly has more precision. I thought I’d post it here in case it helped anyone who needed a different sort of time/memory/precision trade-off than your solution provided.

Edit: I left out a vital piece of code when I submitted this the first time: namely, the function parabolaFit, which takes three (x,y) points and returns (a,b,c) which describe the parabola through those points! Thank you, linear algebra, for allowing me to compute the inverse of [[x1\^2, x1, 1], [x2\^2, x2, 1], [x3\^2, x3, 1]] :)

Edit 2: Inserted magic numbers into precomputeCosineModel function to make it produce a better model.

For anyone who is interested, optimal values for the parabola models seem to be (after several hours of brute forcing):

```
MODEL=[(-0.49992962755573234, -1.0980232887746368e-07, 1.0),
(-0.49990011985522964, -2.4878253437455193e-06, 1.0000000166453054),
(-0.49974116132402857, -1.0823941353727674e-05, 1.0000001252510031),
(-0.49950695415332391, -2.8806940785000143e-05, 1.0000004693985092),
(-0.49919754344650935, -6.0136395284521466e-05, 1.0000012612471667),
(-0.49881293270371074, -0.0001084815176659382, 1.0000027784910901),
(-0.49835321576722147, -0.00017753753167265685, 1.000005369676503),
(-0.49781843214313687, -0.00027097403126607919, 1.000009448527285),
(-0.49720869023281833, -0.00039245476393783832, 1.0000154964220234),
(-0.49652407307244251, -0.00054565792161023058, 1.0000240641696634),
(-0.49576467665906004, -0.00073421302752367187, 1.0000357652273157),
(-0.49493061808634725, -0.00096176832036338328, 1.0000512821260092),
(-0.49402202278274604, -0.0012319431790140179, 1.0000713623956952),
(-0.49303903481865458, -0.0015483704759287479, 1.0000968225839868),
(-0.49198178996369446, -0.0019146303399401911, 1.000128539616685),
(-0.49085046181016617, -0.0023343176828680761, 1.0001674569167662),
(-0.48964521327061084, -0.0028110083653727414, 1.0002145846954871),
(-0.48836620827730481, -0.0033482432689182556, 1.0002709938950045),
(-0.48701368498920855, -0.00394956446911393, 1.0003378233319153),
(-0.48558780343196667, -0.0046184801987749343, 1.0004162680274522),
(-0.48408879715993564, -0.0053584856502879585, 1.0005075900834677),
(-0.48251688626492351, -0.0061730584702190412, 1.0006131127485045),
(-0.48087231650446333, -0.0070656520075539388, 1.0007342181465122),
(-0.47915532957462753, -0.0080396847605567357, 1.0008723496156233),
(-0.47736617990232022, -0.0090985495690446179, 1.0010290096303327),
(-0.47550514046134701, -0.010245647615796431, 1.0012057623110093),
(-0.47357248148954412, -0.011484306193674508, 1.0014042170498367),
(-0.47156852428969293, -0.012817852341586904, 1.0016260648969775),
(-0.46949353488389106, -0.014249583455672264, 1.0018730240583564),
(-0.46734784650404376, -0.015782759252440458, 1.0021468932578577),
(-0.46513177736966149, -0.017420611641178547, 1.0024495106027058),
(-0.46284566764174451, -0.01916634679983701, 1.002782769585433),
(-0.46048984585324404, -0.021023115827221384, 1.0031486172263804),
(-0.45806467708431764, -0.022994064988368167, 1.0035490570395174),
(-0.4555705320461072, -0.025082293249258678, 1.0039861396905538),
(-0.45300777408188114, -0.027290857611423677, 1.0044619576840477),
(-0.45037679045149637, -0.02962278417162181, 1.0049786605147859),
(-0.4476779912893869, -0.032081062111514465, 1.0055384464339019),
(-0.44491177103147844, -0.034668641633809109, 1.0061435518195829),
(-0.44207854204620495, -0.037388435692257617, 1.0067962658728122),
(-0.43917874375173477, -0.040243306190076523, 1.0074989101028025),
(-0.43621280987273181, -0.043236079956155651, 1.0082538630250726),
(-0.4331811798555088, -0.04636954118358165, 1.0090635264004364),
(-0.4300843121284012, -0.049646442208419662, 1.0099303618596125),
(-0.42692267318612304, -0.053069473410920498, 1.0108568568063054),
(-0.42369674322003342, -0.05664128652545989, 1.0118455358050653),
(-0.4204070152809819, -0.060364479079310737, 1.0128989620100801),
(-0.41705396250101168, -0.06424163186990943, 1.0140197428020863),
(-0.4136381099193685, -0.068275238558734339, 1.0152104948033558),
(-0.41015996374406216, -0.072467769074498214, 1.0164738927350541),
(-0.40662005159372727, -0.076821627637463613, 1.0178126132422067),
(-0.40301890053226014, -0.081339199054868458, 1.0192293988934467),
(-0.39935705763559937, -0.086022775370832805, 1.0207269800326426),
(-0.39563507662267688, -0.090874618532679752, 1.0223081358586299),
(-0.39185350434454341, -0.095896946650047638, 1.0239756612764328),
(-0.38801292916341146, -0.10109190864188038, 1.0257323930961495),
(-0.3841139193898932, -0.10646159924626372, 1.0275811493548772),
(-0.38015706421789958, -0.11200807036593119, 1.0295248035920748),
(-0.37614295623103883, -0.11773331115612513, 1.0315662325219033),
(-0.37207220579791189, -0.12363925689900934, 1.0337083376641059),
(-0.36794541959878263, -0.12972778385811362, 1.0359540222951913),
(-0.36376321906232723, -0.13600070590813484, 1.0383062050045615),
(-0.35952623652866106, -0.14245978309506768, 1.0407678223913326),
(-0.35523511910317362, -0.14910671132742356, 1.0433418256011453),
(-0.35089050208564287, -0.1559431424284427, 1.0460311498294304),
(-0.34649303626703276, -0.16297065205229877, 1.0488387639254326),
(-0.34204339941803802, -0.17019075720021581, 1.0517676321378988),
(-0.33754224450551518, -0.17760491965852759, 1.05482070792313),
(-0.33299025451418018, -0.18521453916967687, 1.0580009560243164),
(-0.32838812622809044, -0.19302093609893275, 1.0613113511064367),
(-0.32373653410714021, -0.20102538631999758, 1.0647548413786618),
(-0.3190361984155517, -0.20922909258710209, 1.0683343904687153),
(-0.31428781044066573, -0.21763319380608323, 1.0720529405371475),
(-0.30949209528370147, -0.22623877295391423, 1.0759134462292925),
(-0.30464976904622643, -0.2350468294914104, 1.0799188302530833),
(-0.29976156716841629, -0.24405831482174203, 1.0840720094933414),
(-0.29482822329375719, -0.25327410038194853, 1.08837589789695),
(-0.28985047676105491, -0.26269499907745197, 1.0928333818222529),
(-0.28482907946195507, -0.27232175825029559, 1.0974473344310056),
(-0.27976478789990811, -0.28215504323308571, 1.1022206081890178),
(-0.27465836464004811, -0.29219546304470062, 1.1071560364357962),
(-0.26951057917979943, -0.30244355589713673, 1.1122564286841423),
(-0.26432220637326725, -0.31289978979147465, 1.1175245691333056),
(-0.25909402813211685, -0.32356456251189009, 1.1229632151263116),
(-0.2538268306430474, -0.33443820288760157, 1.1285750953596239),
(-0.24852140781133444, -0.34552096907579705, 1.1343629081356741),
(-0.24317855847052225, -0.3568130487998406, 1.1403293188758024),
(-0.23779908741491002, -0.36831455910783051, 1.146476958592314),
(-0.23238380463959019, -0.38002554579472397, 1.1528084222169983),
(-0.22693352563814218, -0.39194598304701966, 1.1593262662103987),
(-0.22144907140185352, -0.40407577334137146, 1.1660330069583509),
(-0.21593126765866397, -0.41641474740732159, 1.1729311190910758),
(-0.21038094553894857, -0.42896266355026419, 1.1800230329822328),
(-0.20479894075821739, -0.44171920795161196, 1.187311133343578),
(-0.19918609408161489, -0.45468399404095783, 1.194797757459469),
(-0.19354325062872405, -0.46785656258438196, 1.2024851925830025),
(-0.18787126030005602, -0.48123638150851777, 1.2103756746652103),
(-0.18217097731450277, -0.49482284544661065, 1.2184713867629922),
(-0.17644325998700286, -0.50861527637028747, 1.2267744562424323),
(-0.17068897098484162, -0.5226129224097753, 1.235286953755359),
(-0.16490897680842076, -0.53681495892046982, 1.2440108905285658),
(-0.15910414803094369, -0.55122048726364137, 1.2529482178709352),
(-0.15327535868839398, -0.56582853581659076, 1.2621008239207094),
(-0.14742348663002369, -0.5806380593194338, 1.2714705327438247),
(-0.14154941316204941, -0.59564793884213763, 1.2810591019441424),
(-0.13565402283422315, -0.61085698201099903, 1.2908682213329896),
(-0.12973820357084356, -0.62626392301111322, 1.3008995106724539),
(-0.12380284618377223, -0.64186742276177167, 1.3111545185988021),
(-0.11784884457779558, -0.65766606828371177, 1.3216347199827656),
(-0.11187709527234972, -0.67365837379066851, 1.3323415144869371),
(-0.10588849776263635, -0.68984277981851938, 1.3432762254413739),
(-0.099883953871470657, -0.70621765398798353, 1.3544400972443227),
(-0.093864367741649885, -0.72278129108939171, 1.3658342941198731),
(-0.087830646010325877, -0.73953191252386508, 1.3774598984042659),
(-0.081783697280677761, -0.7564676674722125, 1.3893179084578964),
(-0.075724432306464431, -0.77358663239215042, 1.4014092378730194),
(-0.069653763414849829, -0.79088681134863248, 1.4137347127749704),
(-0.063572605055916415, -0.80836613638985655, 1.4262950711662394),
(-0.057481872727544014, -0.82602246761128251, 1.4390909604548976),
(-0.051382483895497016, -0.84385359345000366, 1.4521229365887647),
(-0.045275357195569438, -0.8618572310235747, 1.4653914621058461),
(-0.039161412026312556, -0.88003102627047547, 1.4788969047361393),
(-0.033041569286910648, -0.89837255444929731, 1.4926395357348825),
(-0.026916750722656545, -0.91687932020993901, 1.5066195285814403),
(-0.020787878528609742, -0.93554875816204097, 1.520836957263483),
(-0.014655875775902497, -0.95437823298836033, 1.5352917953314795),
(-0.0085216658471105426, -0.97336504019687564, 1.5499839135214026),
(-0.0023861727290572753, -0.99250640610146945, 1.5649130793537367)]
```

This improves the average deviation to 1.14e-08, and does not change the worst deviation (still no worse than 1.26e-07).

Edit: Sorry to waste your time. It turns out that this approximation is not very fast after all. It probably has to do with the array indices.

Found this interesting algorithm and wonder, why use extra multiplication?

It is possible to use only 4 multiplication instead of 5 in the original algorithm.

```
float sine(float x)
{
const float pi = 3.1415926536f;
const float P = 0.225f;
const float A = 16 * sqrt(P);
const float B = (1 - P) / sqrt(P);
float y = x / (2 * pi);
// mulps xmm0, inv2PI
y = y - floor(y + 0.5); // y in range -0.5..0.5
// using SSE2 here, RN flag set
// cvtps2dq xmm1, xmm0
// cvtdq2ps xmm1, xmm1
// subps xmm0, xmm1
y = A * y * (0.5 - abs(y));
// movaps xmm1, xmm0
// andps xmm1, abs
// subps xmm1, half
// mulps xmm0, xmm1
// mulps xmm0, minusA
return y * (B + abs(y));
// movaps xmm1, xmm0
// andps xmm1, abs
// addps xmm1, B
// mulps xmm0, xmm1
}
```

@’

‘]Found this interesting algorithm and wonder, why use extra multiplication? It is possible to use only 4 multiplication instead of 5 in the original algorithm.

Finally someone notices! :happy:

Indeed one more multiply can be eliminated. It was a remnant of using the generic parabola formula A + B * x + C * x\^2. With A being 0 it can be simplified to x * (B + C * x).

The second part can also be optimized. Instead of computing Q * y + P * y * abs(y) as P * (y * abs(y) - y) + y, which is possible since Q + P = 1, it can also be rewritten as y * (Q + P * abs(y)). This eliminates the subtraction.

I thought I’d try to get this all together into a function that loops before/after pi/-pi. I might have missed a few optimizations along the way, but there’s no divides and no branching, so I think it lends itself to sse, and works with positive/negative inputs. Somebody care to check me on this?

inline float sine(float x)

{

const float B = 4/PI;

const float C = -4/(PI*PI);

const float recip2 = 1 / (PI * 2);

const float pi2 = PI * 2;

float y;

//convert the input value to a range of 0 to 1

x = (x + PI) * recip2;

//make it loop

x -= (long)(x-(x<0));

//convert back from 0-1 to -pi to +pi.

x = (x * pi2) - PI;

//original function

y = B * x + C * x * abs(x); //can abs be inlined? It’s a matter of
turning off the sign bit.

return (y);

//left out the higher precision.

}

inline float cosine(float x)

{

const float B = 4/PI;

const float C = -4/(PI*PI);

float y;

const float recip2 = 1 / (PI * 2);

const float pi2 = PI * 2;

const float hpi=PI/2;

x+=hpi; //shift for cosine

x = (x + PI) * recip2;

x -= (long)(x-(x<0));

x = (x * pi2) - PI;

y = B * x + C * x * abs(x);

return (y);

}

My x86 is a little rusty. I grew up in 68000 land, and I really don’t know how the x86 c compiler works, but it looks like it may be doing a branch? So, is binary logic without branching not possible on x86? I remember doing it in mmx once, but does the normal x86 allow for it? Is there anything similar to test and set, where you compare two values and set a result?

Anyway, here’s the assembler output from the offending code.

; 340 : //make it loop

; 341 : x -= (long)(x-(x<0));

mov DWORD PTR tv76[esp-4], 1

fld DWORD PTR _PI

fadd ST(1), ST(0)

fxch ST(1)

fmul DWORD PTR __real@3e22f983

fldz

fcomp ST(1)

fnstsw ax

test ah, 65 ; 00000041H

je SHORT $LN4@sine

mov DWORD PTR tv76[esp-4], 0

$LN4@sine:

; 342 : //convert back from 0-1 to -pi to +pi.

; 343 : x = (x * pi2) - PI;

ok, so doing some checking and x86 does have a cmov opcode (conditional move) but visual studio doesn’t use it. Anyway, I’d love to see if anyone has a better solution? I’m really, really not up for doing any assembler at the moment.

Also, thinking about it, it would be nice to have an SineCosine function that does both at the same time and saves a few cycles.

Yeah, MSVC doesn’t use the cmovxx instructions, for some reason. Anyway, there are more ways to test wheter a number is negative, one of them being:

```
long sign = (((long&)x & 0x80000000) >> 31);
x -= (long)(x) - sign;
```

which results in the following assembly:

```
; 42 : long sign = (((long&)x & 0x80000000) >> 31);
; 43 : x -= (long)(x) - sign;
mov eax, DWORD PTR _x$[ebp]
cvttss2si ecx, xmm0
shr eax, 31 ; 0000001fH
xorps xmm2, xmm2
sub ecx, eax
cvtsi2ss xmm2, ecx
```

No branches in sight, and the output is exactly the same as the original :)

On an a completely different note, I noticed that if I compacted the two
statements to one

x -= (long)(x) - (((long&)x & 0x80000000) >> 31);

MSVC reverts to generating x87 assembly for this function, instead of
SSE2, that it was instructed to use… Weird :wacko:

There’s a FRNDINT opcode in x86 that rounds according to the rounding mode in the FP control word. That might also do the trick. I have some code around here somewhere for changing the control word.

IEEE floating point specifies four possible rounding modes:

:nearest

In this mode, the inexact results are rounded to the nearer of the two
possible result values. If the neither possibility is nearer, then the
even alternative is chosen. This form of rounding is also called
``round to even’’, and is the form of rounding specified for the
Common Lisp round function.

:positive-infinity

This mode rounds inexact results to the possible value closer to
positive infinity. This is analogous to the Common Lisp ceiling
function.

:negative-infinity

This mode rounds inexact results to the possible value closer to
negative infinity. This is analogous to the Common Lisp floor function.

:zero

This mode rounds inexact results to the possible value closer to zero.
This is analogous to the Common Lisp truncate function.

x87 code has been deprecated since 2005, so you should stick with SSE. Also, FRNDINT is slow on all processors.

What do you suddenly need rounding for? If you need rounding according to the current rounding mode, you can use cvtss2si instead.

@Dan_Ritchie

IEEE floating point specifies four possible rounding modes:

:nearest

In this mode, the inexact results are rounded to the nearer of the two possible result values. If the neither possibility is nearer, then the even alternative is chosen. This form of rounding is also called ``round to even’’, and is the form of rounding specified for the Common Lisp round function.:positive-infinity

This mode rounds inexact results to the possible value closer to positive infinity. This is analogous to the Common Lisp ceiling function.:negative-infinity

This mode rounds inexact results to the possible value closer to negative infinity. This is analogous to the Common Lisp floor function.:zero

This mode rounds inexact results to the possible value closer to zero. This is analogous to the Common Lisp truncate function.

Nitpick: IEEE-754 was updated in 2008 to add a variation of the “nearest” rounding mode that ties away from zero.

>>What do you suddenly need rounding for?

You could avoid most of this if you just used the right mode for the cast, …

; 42 : long sign = (((long&)x & 0x80000000) >> 31);

; 43 : x -= (long)(x) - sign;

and just have x-=(long)x. But it might be slower to set the rounding mode, I don’t know.

The range wrapping can be done by making use of the fact that a IEEE-754 single-precision floating-point number has 23 mantissa bits. Also, you don’t have to convert back to the [-PI, PI] range; just adjust the parabola coefficients:

```
float sine(float x)
{
// Convert the input value to a range of -1 to 1
x = x * (1.0f / PI);
// Wrap around
volatile float z = (x + 25165824.0f);
x = x - (z - 25165824.0f);
#if LOW_SINE_PRECISION
return 4.0f * (x - x * abs(x));
#else
float y = x - x * abs(x);
const float Q = 3.1f;
const float P = 3.6f;
return y * (Q + P * abs(y));
#endif
}
```

Of course an AVX version of this code can compute 8 values in parallel, and it can make use of the VROUNDPS instruction to do the wrapping even faster. Future CPUs will also feature fused multiply-add (FMA).

@Nick

Indeed one more multiply can be eliminated. It was a remnant of using the generic parabola formula A + B * x + C * x\^2. With A being 0 it can be simplified to x * (B + C * x). The second part can also be optimized. Instead of computing Q * y + P * y * abs(y) as P * (y * abs(y) - y) + y, which is possible since Q + P = 1, it can also be rewritten as y * (Q + P * abs(y)). This eliminates the subtraction.

I really appreciate this fast sine algorithm. I was searching for something faster than the hardware fsin and hopefully trying to avoid lookup tables. Just a correction, you forgot the abs(x) in the first part of the equation. It should be:

New Simplified Parabola

y = x * (B + C * abs(x))

Refinement

y = y * (Q + P * abs(y))

ATTENTION FOR THOSE INTERESTED IN ACTUAL EXECUTION SPEEDS:

I did some experimentation writing this in SSE assembly code and as such have actual execution times and also can explain how to achieve even higher rates and also point out that not all SSE vector units are the same. Here’s what I tried.

First I wrote it using scalar instructions to calculate a single sine. My code expects the input to be in the range of -PI to +PI as input. It uses 12 instructions to load an input number, calculate, then write back the result. Plus there’s two instructions for the loop since I looped it 1 billion times to time against the ms clock. On my 2.2 GHz Athlon it was taking 30 clock cycles to calculate that one sine. Even that is much better than the average 90 to 100+ I was getting using the x87 FSIN instruction.

Since the algorithm has a lot of dependencies in it, most of those 30 clock cycles are spent waiting, so I interleaved another scalar sine calc in there and it only increased the time by 2 clock cycles: 32 clocks for two sines. Then I added another sine and it went to 36 clocks for three sines. Then I went to four sines and it went to 47 clocks which is 11.75 clocks per sine. Keep in mind this is running 4 sine calculations through in scalar mode which is more flexible if you’re not working with variable 128 bit aligned and all in the same place.

Now here’s where things get more unpredictable. One would assume that just replacing the scalar instructions with vector / packed instructions should give 16 vector sines in the same time as 4 scalar. It depends on the CPU. The earlier CPU’s with SSE didn’t process 4 floats at once. They only did two at once and pipelined the second pair behind the first. So that could slow it down almost to half. But it’s possible to chop some clock cycles off of that because running 4 loads, then 4 movs, then 4 ands, then 4 mults, etc. doesn’t effectly use the different execution units in the vector processor. So I shifted instructions around and interleaved them further. That took several clock cycles off the time. The end result on my old 2.2GHz Athlon is 70.5 clock cycles for 16 sines, so that’s about 4.4 clocks per sine and a throughput of almost a half billion sines per second.

Then I moved the code to my 2.8GHz Athlon II that I bought last year. It actually does process 4 floats at once, so you’d think it would be almost twice as fast. Turns out it was even faster so there’s more going on there that I know about. It calcs 16 sines in 29.5 clocks, so now we’re at 1.84 clock cycles per sine and a throughput of over 1.5 billion sines per second, triple that of my older computer.

So now we have some realworld results. If the code were written for the Motorola Altivec/Velocity engine, the 12 instructions can be reduced to 8 because the PowerPC/Altivec architecture is modern and not some archaic 1975 architecture like the Intel CPU. On the PowerPC, the move instructions disappear because you get a free move with all PowerPC/Altivec instructions, and the fused Multiply/Add not only means reduced instructions and double the speed for over separate multiplies and adds, but it removes the latency caused by the add waiting for the multiply to finish (although my code fills in that wasted time because it calculates 16 sines instead of the original 4). Intel / AMD will finally have those capabilities too supposedly later this year. It’s taken them 12 years to catch up. Unfortunately the main CPU instruction set is still crappy and inefficient, and I doubt that will ever change.

@Elhardt

It calcs 16 sines in 29.5 clocks, so now we’re at 1.84 clock cycles per sine and a throughput of over 1.5 billion sines per second, triple that of my older computer.

Thanks for sharing the results of your implementation!

I believe the Athlon II, which uses the K10 architecture, also benefits significantly from the extra instruction fetching capabilities. Load reordering also lessens the impact of cache misses.

On the PowerPC, the move instructions disappear because you get a free move with all PowerPC/Altivec instructions…

The AVX instruction set, supported by Intel’s latest Sandy Bridge architecture, also features independent destination operands.

Intel / AMD will finally have those capabilities too supposedly later this year.

AMD will support FMA4 in the Bulldozer architecture, to be launched this summer.

There’s no confirmation yet when Intel will support it, although it’s part of the AVX spec. The good news is they can theoretically double the performance (again), while AMD chose to share two 128-bit units between two cores (for now).

It’s taken them 12 years to catch up. Unfortunately the main CPU instruction set is still crappy and inefficient, and I doubt that will ever change.

The instruction set may be flawed, but that hardly matters. Decoding the x86 instructions into RISC micro-operations only takes a few percent of the chip space. And it still shrinks with every generation.

The only thing that’s sorely lacking from the instruction set, is support for gather/scatter operations (the parallel equivalent of load/store). That would allow to parallelize loops much more efficiently and together with FMA would make the CPU capable of very high effective throughput.

@Nick

Thanks for sharing the results of your implementation!

You’re welcome.

@Nick

The instruction set may be flawed, but that hardly matters. Decoding the x86 instructions into RISC micro-operations only takes a few percent of the chip space. And it still shrinks with every generation. The only thing that’s sorely lacking from the instruction set, is support for gather/scatter operations (the parallel equivalent of load/store). That would allow to parallelize loops much more efficiently and together with FMA would make the CPU capable of very high effective throughput.

There’s a lot more missing than that in the regular instruction set. No 3 operand non-destructive instructions means wasted time duplicating values all the time. No auto incrementing registers means constantly having to manually increment address pointers. Just those two things alone can increase the size of the code and reduce it’s speed by a large factor in some cases. Not to mention, no bit field operations, awful stack based FPU (still needed for many things), way too few registers unless you’re programming in and running on a 64 bit OS. Lacks multiple status registers like the PowerPC that can be used to reduce the number of conditional branches. No way to specify whether a branch is more likely to be taken or fall through like on the PowerPC. No way to specify whether an instruction modifies the status register or not like on the PowerPC, so compare instructions need to be placed right before conditional branches giving little time for branch prediction and also limiting flexibility in coding. Even its horrible variable sized instruction set with instructions up to 17 bytes in size have lead to my benchmarks on this very topic to run at different speeds depending how the code aligns in memory. My empty loop test can slow down 50% just by its placement in memory.

-Elhardt

@Elhardt

There’s a lot more missing than that in the regular instruction set. No 3 operand non-destructive instructions means wasted time duplicating values all the time.

MOV instructions can execute on port 0, 1 and 5 on Intel’s recent architectures. So the lack of non-destructive instructions hardly has any effect. Also keep in mind that keeping track of additional operand dependencies would complicate out-of-order execution.

The reason they did choose to support non-destructive instructions for AVX, is probably mainly a power efficiency descision, and I can also see it affecting the register file implementation.

No auto incrementing registers means constantly having to manually increment address pointers.

Again that’s just a compromise. RISC architectures have to separately load operands, so it makes sense to also make this load instruction do another bit of useful work. x86 can reference memory with advanced addressing modes in every instruction.

Just those two things alone can increase the size of the code and reduce it’s speed by a large factor in some cases.

Do you have any data to back up just how large that factor really is, taking the hardware costs into account as well?

I think the fact that Apple switched to Intel CPUs speaks for itself. The PowerPC instruction set looks nicer but at the end of the day that’s not the main thing determining performance and cost.

…awful stack based FPU (still needed for many things)…

What would you still need it for? x87 was officially deprecated for all x86-64 ABIs eight years ago.

…way too few registers unless you’re programming in and running on a 64 bit OS.

True, but there’s little reason not to build an x64 version of performance critical software.

Lacks multiple status registers like the PowerPC that can be used to reduce the number of conditional branches.

x86 has conditional move instructions too.

No way to specify whether a branch is more likely to be taken or fall through like on the PowerPC.

Yes there is. The 2Eh and 3Eh prefix bytes can be used as hints. Also, there are well defined rules for branches for which there is no history in the predictor. Forward jumps are assumed not taken, while backward jumps are assumed taken. This allows you (or the compiler) to structure the code more optimally.

No way to specify whether an instruction modifies the status register or not like on the PowerPC, so compare instructions need to be placed right before conditional branches giving little time for branch prediction and also limiting flexibility in coding.

Actually you want them to be placed together, so macro-op fusion can be performed and they get executed as one.

Even its horrible variable sized instruction set with instructions up to 17 bytes in size have lead to my benchmarks on this very topic to run at different speeds depending how the code aligns in memory. My empty loop test can slow down 50% just by its placement in memory.

The Sandy Bridge architecture doesn’t suffer from misaligned code. Also, variable sized instructions actually make the code more dense in some cases, and it has allowed Intel and AMD to keep expanding the capabilities over many decades. So again it’s a mixed bad of advantages and disadvantages.

Since it’s too much of a paint to quote each of your comments, I’ll just address them without quoting.

1) Nondestructive three and four operand instructions most certainly make code more efficient and faster. That’s why all modern CPU’s have that feature and that’s why Intel and AMD are adding it, and AMD is demanding the fused multiply accumulate have 4 operands. Intel is really talking up that feature in their AVX preview document. Anytime you can reduce instruction count to do the same amount of work, that’s an improvement in speed and reduction in code size and a more straight forward and elegant solution for us assembly programmers.

2) Auto Incrementing isn’t about RISC or advanced addressing modes. The main thing a computer does is scoop up data from memory, process it and write it back out. That means incrementing pointers, and all modern and not so modern CISC CPU’s also auto increment and decrement including the 68K. On the Intel it’s more add or subtract instructions added to your code. I like tight and efficient and Intel isn’t that.

3) I’ve just backed up what I said with the above two comments. Naturally when an instruction can do more than one operation at the same time it’s going to be faster and reduce code. Having written the world’s fastest software 3D render in 100% assembly code on the 68K, PowerPC, and Intel chips, sometimes the amount of code, especially in small loops can be significantly more on the Intel than the PowerPC just because of 1 and 2 above…As for Apple choosing Intel, they certainly didn’t do it because of performance or architecture. After all, Apple, Atari, Commodore, Next, Sun, Silicon Graphics, etc. all could have picked the primitive Intel architecture at several points in the past and didn’t. Even modern game machines are using PowerPC or MIPs. Apple has published many benchmarks showing the PowerPC’s superior performance, and that wasn’t even using assembly code where more of the PowerPC’s advanced features can be used. Twice by accident I came across people porting PowerPC Altivec code to Intel’s SIMD complaining their code was running at about half the speed even on Intel’s running at a higher clock rate. That’s what residual 1975 8085 architecture gets you.

4) I don’t know who plays God and determines the x87 is no longer needed. The x87 does a lot more than just +,-,/,* and sqrt. It does trig, logs, floating point compares you can branch on, and more. It also works on double/extended precision. And Intel just recently brought radix-16 division to the x87 for much faster divides than in SSE. And unless the entire world has thrown away all computers and applications prior to SSE2, the x87 is still doing most of the FP computations out there.

5) Performance critical software has always been important, even before 64 bit CPU’s. And again, as a software developer, I don’t like writing software that only runs for a tiny percentage of people with the latest CPU’s and 64bit OS’s. Even with 16 registers, the Intel register set isn’t large compared to most modern CPU’s, but at least they finally addressed that miserable problem, though way too late and in a limited manner.

6) “x86 has conditional move instructions too.”. Yes, and that’s a nice feature, though not very often usable. I’ve heard that Visual Studio doesn’t even use it. Take the ARM CPU, all of it’s instructions can be executed conditionally.

7) If the Intel has prefixes for branch hints, that must be something new. I’ve never run across any assembly code symbols that would specify such a thing in the past. On the PowerPC one puts a + or - after the opcode to specify. The other rules about forward and backward branches are the same as on the PowerPC.

8) Maybe on the Intel you want Cmps and Jccs placed together, but you really don’t have a choice in the matter.

9) I’m glad AVX code doesn’t suffer from misaligned code, but again, that doesn’t help the hundreds of millions of computers out there now, nor the regular CPU instruction set. And the instruction set they keep expanding is the SIMD stuff, while the archaic main CPU continues. Their constant half-assed changes and additions are making it hell for developers. Instead of taking their time and doing things right in the first place, like Motorola did with Altivec, they just keep spreading their SIMD additions piecemeal over many years. It’s a mess.

So back to my point, it’s the Intel’s archaic architecture that’s reducing the CPU’s throughput. Hell, I could do 3D vertex transforms in fewer clock cycles on the PowerPC’s regular old FPU than on Intel’s early SSE using vector instructions. Lack of non destructive 3 and 4 operand instructions and fused multiply/accumulate really hold/held SSE back. They should have been there on day one. Now we have to wait another 10 years for AVX to fully penetrate into people’s hands. And it looks like Intel may be playing their same old game again; releasing AVX without FMA so that after everybody buys a new computer, a few months later they’ll need to buy another to get FMA. The turmoil never ends.

@Elhardt

Since it’s too much of a paint to quote each of your comments, I’ll just address them without quoting.

It’s pretty easy if you start a reply with the Quote button, and then select some text and hit the button that looks like a text balloon. Doesn’t really matter to me though.

1) Nondestructive three and four operand instructions most certainly make code more efficient and faster. That’s why all modern CPU’s have that feature and that’s why Intel and AMD are adding it, and AMD is demanding the fused multiply accumulate have 4 operands. Intel is really talking up that feature in their AVX preview document. Anytime you can reduce instruction count to do the same amount of work, that’s an improvement in speed and reduction in code size and a more straight forward and elegant solution for us assembly programmers.

No, there is no certainty that it’s more efficient or faster.

First of all there are lots of operations that are destructive. So encoding a non-destructive instruction and analyzing an extra register dependency would be a waste.

Secondly, modern x86 microarchitectures are so wide it’s hard to ever exceed the issue width.

And last but not least, there’s no reason why a move instruction followed by a dependent arthmetic instruction can’t be fused into one micro-instruction, if they wanted to.

So I’m not saying non-destructive instructions are a bad idea, but they’re not without disadvantages and even if it’s a win there’s no reason it can’t be done with existing x86 encodings.

2) Auto Incrementing isn’t about RISC or advanced addressing modes. The main thing a computer does is scoop up data from memory, process it and write it back out. That means incrementing pointers, and all modern and not so modern CISC CPU’s also auto increment and decrement including the 68K. On the Intel it’s more add or subtract instructions added to your code. I like tight and efficient and Intel isn’t that.

Again, it’s more complicated than that and you have to look at the disadvantages too. The majority of instructions don’t need to increment any pointer. So you have an ALU sitting there doing nothing most of the time. Also, you need a write back path for the pointer, and it creates another dependency.

Also, in short loops where the pointer incrementing could be a bottleneck, unrolling can amortize or completely hide that cost. And you can also use the loop counter as an index instead of incrementing the pointers.

3) I’ve just backed up what I said with the above two comments. Naturally when an instruction can do more than one operation at the same time it’s going to be faster and reduce code.

In theory, yes, but in practice it comes at a cost. And this cost may force the CPU designers to make the CPU more narrow, clock it lower, or charge more to customers.

x86 CPUs have always had an excellent performance/price ratio. So you really have to look at the bigger picture.

Having written the world’s fastest software 3D render in 100% assembly code on the 68K, PowerPC, and Intel chips, sometimes the amount of code, especially in small loops can be significantly more on the Intel than the PowerPC just because of 1 and 2 above…

For what it’s worth, I’m the lead developer of SwiftShader, and I’m not convinced that the x86 ISA has any limitations that are severe enough to compromise its foreseeable future.

As for Apple choosing Intel, they certainly didn’t do it because of performance or architecture. After all, Apple, Atari, Commodore, Next, Sun, Silicon Graphics, etc. all could have picked the primitive Intel architecture at several points in the past and didn’t.

The past tells us very little here. A modern x86 CPU uses RISC micro-instructions, and the decoding takes only a few percent of the die space. The ISA has very little effect on performance/price.

Nowadays performance/power is also an extremely important factor. The PowerPC 970FX at 2.5 GHz consumes up to 100 Watt, while a Pentium 4 at 3.8 GHz consumes 65 Watt (both at 90 nm). And that’s for the old and horribly inefficient NetBurst microarchitecture. The Core, Core 2 and second generation Core 2 microarchitecture offer vastly more performance/Watt. Whatever the reasons, clearly the PPC ISA wasn’t enough of a benefit to keep up with Intel.

Twice by accident I came across people porting PowerPC Altivec code to Intel’s SIMD complaining their code was running at about half the speed even on Intel’s running at a higher clock rate.

Was that on anything prior to Core 2 or Phenom? They used to have 64-bit execution units for 128-bit SSE operations, so it took two micro-instructions.

That’s what residual 1975 8085 architecture gets you.

Seems like a really premature conclusion to me.

4) I don’t know who plays God and determines the x87 is no longer needed.

I didn’t say it’s no longer needed. I said it’s deprecated (in favor of SSE2).

The x87 does a lot more than just +,-,/,* and sqrt. It does trig, logs…

Those use slow microcode. On x64 they’re implemented as library functions using SSE2 instructions, without loss of performance or precision.

…floating point compares you can branch on…

The SSE comiss instruction does that too.

It also works on double/extended precision.

SSE2 introduced double precision support, and is supported by all x64 implementations. Extended precision is also deprecated.

And Intel just recently brought radix-16 division to the x87 for much faster divides than in SSE.

They’re both the same speed.

And unless the entire world has thrown away all computers and applications prior to SSE2, the x87 is still doing most of the FP computations out there.

The x87 instructions are executed by the same execution units as SSE. Transcendental functions are implemented in microcode. So aside from the instruction decoding and register stack implementation, it is practically completely replaced by SSE.

Performance critical software has always been important, even before 64 bit CPU’s. And again, as a software developer, I don’t like writing software that only runs for a tiny percentage of people with the latest CPU’s and 64bit OS’s. Even with 16 registers, the Intel register set isn’t large compared to most modern CPU’s, but at least they finally addressed that miserable problem, though way too late and in a limited manner.

Regardless of the architecture, the move from 32-bit to 64-bit takes many years. And x86’s lack of architectural register is also compensated by a large rename register set and relatively fast L1 caches.

6) “x86 has conditional move instructions too.”. Yes, and that’s a nice feature, though not very often usable. I’ve heard that Visual Studio doesn’t even use it. Take the ARM CPU, all of it’s instructions can be executed conditionally.

And once more it doesn’t come for free. It requires more encoding bits, and requires more logic in the critical path, which increases the latency even when it’s not being used.

7) If the Intel has prefixes for branch hints, that must be something new.

It has been supported for over a decade now. Hardly something new.

8) Maybe on the Intel you want Cmps and Jccs placed together, but you really don’t have a choice in the matter.

AMD Bulldozer will also support macro-op fusion for compare and jump instructions. So optimizing compilers all pair them together.

9) I’m glad AVX code doesn’t suffer from misaligned code, but again, that doesn’t help the hundreds of millions of computers out there now, nor the regular CPU instruction set.

So what? It’s not as if talking about PowerPC helps them either.

x86 has always offered high performance at an affordable price. And it doesn’t look like this is about to change.

And the instruction set they keep expanding is the SIMD stuff, while the archaic main CPU continues.

Single-threaded scalar performance has increased tremendously over the years, despite the ISA staying the same. This shows that there are more critical things to real-life performance than the instruction set architecture.

Their constant half-assed changes and additions are making it hell for developers. Instead of taking their time and doing things right in the first place, like Motorola did with Altivec, they just keep spreading their SIMD additions piecemeal over many years. It’s a mess.

SSE2 includes the vast majority of SIMD instructions you’ll use. The rest is largely application specific. So the “mess” isn’t all that horrible really.

Personally I started using LLVM for dynamic code generation. It explicitly supports vector operations and fully abstracts the target architecture. So you hardly have to worry about whether or not a specific instruction is supported.

So back to my point, it’s the Intel’s archaic architecture that’s reducing the CPU’s throughput. Hell, I could do 3D vertex transforms in fewer clock cycles on the PowerPC’s regular old FPU than on Intel’s early SSE using vector instructions.

And what about wall clock time, and processor cost?

Lack of non destructive 3 and 4 operand instructions and fused multiply/accumulate really hold/held SSE back. They should have been there on day one.

That would have been nice for developers, but the economic reality dictated otherwise. It’s a gamble whether a certain ISA extension will offer a good ROI. Other architectures have come and gone, while x86 holds a dominant position after a full three decades.

Now we have to wait another 10 years for AVX to fully penetrate into people’s hands. And it looks like Intel may be playing their same old game again; releasing AVX without FMA so that after everybody buys a new computer, a few months later they’ll need to buy another to get FMA. The turmoil never ends.

You don’t seem to realize the cost. FMA requires both port 0 and port 1 to be equipped with FMA ALUs, and each of them need an additional 256-bit operand. But the register file has to also be capable of providing all these operands to the two pipelines. And that’s not all. Because of the increase in througput the memory subsystem has to be upgraded as well.

All this really isn’t trivial, and it may take several years for the investment to be worth it. After all, the benefit in practice is much less than the theorectical doubling in performance.

Personally I would love to see it sooner rather than later, and I believe the lack of gather/scatter support is even more critical, but when looking at the bigger picture I really think we’re getting a lot of bang for the buck and there’s some interesting technology on the roadmap that keeps it exciting.

- Upcoming Multiplatform Game Program...
- Our first game - looking for feedbacks
- Network Emulation Tool
- Trouble with accessing GLSL array
- Fiction
- Game Programming Patterns: Bytecode
- Interactive WebGL Water Demo
- Skeletal Animation Tutorial with GP...
- Unreal Engine 4
- Microsoft xbox one selling poorly

Hi all,

In some situations you need a good approximation of sine and cosine that runs at very high performance. One example is to implement dynamic subdivision of round surfaces, comparable to those in Quake 3. Or to implement wave motion, in case there are no vertex shaders 2.0 available.

The standard C sinf() and cosf() functions are terribly slow and offer much more precision than what we actually need. What we really want is an approximation that offers the best compromise between precision and performance. The most well known approximation method is to use a Taylor series about 0 (also known as a Maclaurin series), which for sine becomes:

When we plot this we get: taylor.gif. The green line is the real sine, the red line is the first four terms of the Taylor series. This seems like an acceptable approximation, but lets have a closer look: taylor_zoom.gif. It behaves very nicely up till pi/2, but after that it quickly deviates. At pi, it evaluates to -0.075 instead of 0. Using this for wave simulation will result in jerky motion which is unacceptable.

We could add another term, which in fact reduces the error significantly, but this makes the formula pretty lengthy. We already need 7 multiplications and 3 additions for the 4-term version. The taylor series just can’t give us the precision and performance we’re looking for.

We did however note that we need sine(pi) = 0. And there’s another thing we can see from taylor_zoom.gif: this looks very much like a parabola! So let’s try to find the formula of a parabola that matches it as closely as possible. The generic formula for a parabola is A + B x + C x\^2. So this gives us three degrees of freedom. Obvious choises are that we want sine(0) = 0, sine(pi/2) = 1 and sine(pi) = 0. This gives us the following three equations:

Which has as solution A = 0, B = 4/pi, C = -4/pi\^2. So our parabola approximation becomes 4/pi x - 4/pi\^2 x\^2. Plotting this we get: parabola.gif. This looks worse than the 4-term Taylor series, right? Wrong! The maximum absolute error is 0.056. Furthermore, this approximation will give us smooth wave motion, and can be calculated in only 3 multiplications and 1 addition!

Unfortunately it’s not very practical yet. This is what we get in the [-pi, pi] range: negative.gif. It’s quite obvious we want at least a full period. But it’s also clear that it’s just another parabola, mirrored around the origin. The formula for it is 4/pi x + 4/pi\^2 x\^2. So the straightforward (pseudo-C) solution is:

Adding a branch is not a good idea though. It makes the code significantly slower. But look at how similar the two parts really are. A subtraction becomes an addition depending on the sign of x. In a naive first attempt to eliminate the branch we can ‘extract’ the sign of x using x / abs(x). The division is very expensive but look at the resulting formula: 4/pi x - x / abs(x) 4/pi\^2 x\^2. By inverting the division we can simplify this to a very nice and clean 4/pi x - 4/pi\^2 x abs(x). So for just one extra operation we get both halves of our sine approximation! Here’s the graph of this formula confirming the result: abs.gif.

Now let’s look at cosine. Basic trigonometry tells us that cosine(x) = sine(pi/2 + x). Is that all, add pi/2 to x? No, we actually get the unwanted part of the parabola again: shift_sine.gif. What we need to do is ‘wrap around’ when x > pi/2. This can be done by subtracting 2 pi. So the code becomes:

Yet another branch. To eliminate it, we can use binary logic, like this:

Note that this isn’t valid C code at all. But it should clarify how this can work. When x > pi is false the & operation zeroes the right hand part so the subtraction does nothing, which is perfectly equivalent. I’ll leave it as an excercise to the reader to create working C code for this (or just read on). Clearly the cosine requires a few more operations than the sine but there doesn’t seem to be any other way and it’s still extremely fast.

Now, a maximum error of 0.056 is nice but clearly the 4-term Taylor series still has a smaller error on average. Recall what our sine looked like: abs.gif. So isn’t there anything we can do to further increase precision at minimal cost? Of course the current version is already applicable for many situations where something that looks like a sine is just as good as the real sine. But for other situations that’s just not good enough.

Looking at the graphs you’ll notice that our approximation always overestimates the real sine, except at 0, pi/2 and pi. So what we need is to ‘scale it down’ without touching these important points. The solution is to use the squared parabola, which looks like this: squared.gif. Note how it preserves those important points but it’s always lower than the real sine. So we can use a weighted average of the two to get a better approximation:

With Q + P = 1. You can use exact minimization of either the absolute or relative error but I’ll save you the math. The optimal weights are Q = 0.775, P = 0.225 for the absolute error and Q = 0.782, P = 0.218 for the relative error. I will use the former. The resulting graph is: average.gif. Where did the red line go? It’s almost entirely covered by the green line, which instantly shows how good this approximation really is. The maximum error is about 0.001, a 50x improvement! The formula looks lengthy but the part between parenthesis is the same value from the parabola, which needs to be computed only once. In fact only 2 extra multiplications and 2 additions are required to achieve this precision boost.

It shouldn’t come as a big surprise that to make it work for negative x as well we need a second abs() operation. The final C code for the sine becomes:

So we need merely 5 multiplications and 3 additions; still faster than the 4-term Taylor if we neglect the abs(), and much more precise! The cosine version just needs the extra shift and wrap operations on x.

Last but not least, I wouldn’t be Nick if I didn’t include the SIMD optimized assembly version. It allows to perform the wrap operation very efficiently so I’ll give you the cosine:

This code computes four cosines in parallel, resulting in a peak performance of about 9 clock cycles per cosine for most CPU architectures. Sines take only 6 clock cycles ideally. The lower precision version can even run at 3 clock cycles per sine… And lets not forget that all input between -pi and pi is valid and the formula is exact at -pi, -pi/2, 0, pi/2 and pi.

So, the conclusion is don’t ever again use a Taylor series to approximate a sine or cosine! To add a useful discussion to this article, I’d love to hear if anyone knows good approximations for other transcendental functions like the exponential, logarithm and power functions.

Cheers,

Nick