# Fast and accurate sine/cosine

103 replies to this topic

### #1Nick

Senior Member

• Members
• 1227 posts

Posted 23 April 2006 - 02:00 PM

Hi all,

In some situations you need a good approximation of sine and cosine that runs at very high performance. One example is to implement dynamic subdivision of round surfaces, comparable to those in Quake 3. Or to implement wave motion, in case there are no vertex shaders 2.0 available.

The standard C sinf() and cosf() functions are terribly slow and offer much more precision than what we actually need. What we really want is an approximation that offers the best compromise between precision and performance. The most well known approximation method is to use a Taylor series about 0 (also known as a Maclaurin series), which for sine becomes:
x - 1/6 x^3 + 1/120 x^5 - 1/5040 x^7 + ...
When we plot this we get: taylor.gif. The green line is the real sine, the red line is the first four terms of the Taylor series. This seems like an acceptable approximation, but lets have a closer look: taylor_zoom.gif. It behaves very nicely up till pi/2, but after that it quickly deviates. At pi, it evaluates to -0.075 instead of 0. Using this for wave simulation will result in jerky motion which is unacceptable.

We could add another term, which in fact reduces the error significantly, but this makes the formula pretty lengthy. We already need 7 multiplications and 3 additions for the 4-term version. The taylor series just can't give us the precision and performance we're looking for.

We did however note that we need sine(pi) = 0. And there's another thing we can see from taylor_zoom.gif: this looks very much like a parabola! So let's try to find the formula of a parabola that matches it as closely as possible. The generic formula for a parabola is A + B x + C x^2. So this gives us three degrees of freedom. Obvious choises are that we want sine(0) = 0, sine(pi/2) = 1 and sine(pi) = 0. This gives us the following three equations:
A + B 0 + C 0^2 = 0
A + B pi/2 + C (pi/2)^2 = 1
A + B pi + C pi^2 = 0
Which has as solution A = 0, B = 4/pi, C = -4/pi^2. So our parabola approximation becomes 4/pi x - 4/pi^2 x^2. Plotting this we get: parabola.gif. This looks worse than the 4-term Taylor series, right? Wrong! The maximum absolute error is 0.056. Furthermore, this approximation will give us smooth wave motion, and can be calculated in only 3 multiplications and 1 addition!

Unfortunately it's not very practical yet. This is what we get in the [-pi, pi] range: negative.gif. It's quite obvious we want at least a full period. But it's also clear that it's just another parabola, mirrored around the origin. The formula for it is 4/pi x + 4/pi^2 x^2. So the straightforward (pseudo-C) solution is:
if(x > 0)
{
y = 4/pi x - 4/pi^2 x^2;
}
else
{
y = 4/pi x + 4/pi^2 x^2;
}
Adding a branch is not a good idea though. It makes the code significantly slower. But look at how similar the two parts really are. A subtraction becomes an addition depending on the sign of x. In a naive first attempt to eliminate the branch we can 'extract' the sign of x using x / abs(x). The division is very expensive but look at the resulting formula: 4/pi x - x / abs(x) 4/pi^2 x^2. By inverting the division we can simplify this to a very nice and clean 4/pi x - 4/pi^2 x abs(x). So for just one extra operation we get both halves of our sine approximation! Here's the graph of this formula confirming the result: abs.gif.

Now let's look at cosine. Basic trigonometry tells us that cosine(x) = sine(pi/2 + x). Is that all, add pi/2 to x? No, we actually get the unwanted part of the parabola again: shift_sine.gif. What we need to do is 'wrap around' when x > pi/2. This can be done by subtracting 2 pi. So the code becomes:
x += pi/2;

if(x > pi)   // Original x > pi/2
{
x -= 2 * pi;   // Wrap: cos(x) = cos(x - 2 pi)
}

y = sine(x);
Yet another branch. To eliminate it, we can use binary logic, like this:
x -= (x > pi) & (2 * pi);
Note that this isn't valid C code at all. But it should clarify how this can work. When x > pi is false the & operation zeroes the right hand part so the subtraction does nothing, which is perfectly equivalent. I'll leave it as an excercise to the reader to create working C code for this (or just read on). Clearly the cosine requires a few more operations than the sine but there doesn't seem to be any other way and it's still extremely fast.

Now, a maximum error of 0.056 is nice but clearly the 4-term Taylor series still has a smaller error on average. Recall what our sine looked like: abs.gif. So isn't there anything we can do to further increase precision at minimal cost? Of course the current version is already applicable for many situations where something that looks like a sine is just as good as the real sine. But for other situations that's just not good enough.

Looking at the graphs you'll notice that our approximation always overestimates the real sine, except at 0, pi/2 and pi. So what we need is to 'scale it down' without touching these important points. The solution is to use the squared parabola, which looks like this: squared.gif. Note how it preserves those important points but it's always lower than the real sine. So we can use a weighted average of the two to get a better approximation:
Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2
With Q + P = 1. You can use exact minimization of either the absolute or relative error but I'll save you the math. The optimal weights are Q = 0.775, P = 0.225 for the absolute error and Q = 0.782, P = 0.218 for the relative error. I will use the former. The resulting graph is: average.gif. Where did the red line go? It's almost entirely covered by the green line, which instantly shows how good this approximation really is. The maximum error is about 0.001, a 50x improvement! The formula looks lengthy but the part between parenthesis is the same value from the parabola, which needs to be computed only once. In fact only 2 extra multiplications and 2 additions are required to achieve this precision boost.

It shouldn't come as a big surprise that to make it work for negative x as well we need a second abs() operation. The final C code for the sine becomes:
float sine(float x)
{
const float B = 4/pi;
const float C = -4/(pi*pi);

float y = B * x + C * x * abs(x);

#ifdef EXTRA_PRECISION
//  const float Q = 0.775;
const float P = 0.225;

y = P * (y * abs(y) - y) + y;   // Q * y + P * y * abs(y)
#endif
}
So we need merely 5 multiplications and 3 additions; still faster than the 4-term Taylor if we neglect the abs(), and much more precise! The cosine version just needs the extra shift and wrap operations on x.

Last but not least, I wouldn't be Nick if I didn't include the SIMD optimized assembly version. It allows to perform the wrap operation very efficiently so I'll give you the cosine:
[code=asm]
// cos(x) = sin(x + pi/2)
movaps xmm1, xmm0
cmpnltps xmm1, PI
andps xmm1, PIx2
subps xmm0, xmm1

// Parabola
movaps xmm1, xmm0
andps xmm1, abs
mulps xmm1, xmm0
mulps xmm0, B
mulps xmm1, C

// Extra precision
movaps xmm1, xmm0
andps xmm1, abs
mulps xmm1, xmm0
subps xmm1, xmm0
mulps xmm1, P
This code computes four cosines in parallel, resulting in a peak performance of about 9 clock cycles per cosine for most CPU architectures. Sines take only 6 clock cycles ideally. The lower precision version can even run at 3 clock cycles per sine... And lets not forget that all input between -pi and pi is valid and the formula is exact at -pi, -pi/2, 0, pi/2 and pi.

So, the conclusion is don't ever again use a Taylor series to approximate a sine or cosine! To add a useful discussion to this article, I'd love to hear if anyone knows good approximations for other transcendental functions like the exponential, logarithm and power functions.

Cheers,

Nick

### #2bramz

Valued Member

• Members
• 189 posts

Posted 23 April 2006 - 03:30 PM

one word: interesting =)
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

### #3davepermen

Senior Member

• Members
• 1306 posts

Posted 23 April 2006 - 03:54 PM

hehe, i've went trough this quite a while ago, while sitting in school, bored by the math lessons i knew already. i tried different ways to approximate sine, cosine, and all the others. but i don't remember. i haven't had your idea to use a lerp between the parabola and the squared parabola, though. i think i used x^3 because it looks more like sine by default.

exp was great, as you could split the value into integer and fractional part, do the integer with bitshifting, and the fraction with some nice approx. something like that i remember..

man, thats years back :D
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

### #4m4x0r

New Member

• Members
• 25 posts

Posted 23 April 2006 - 07:31 PM

Nice job Nick, very clever.

Max

### #5davepermen

Senior Member

• Members
• 1306 posts

Posted 23 April 2006 - 09:02 PM

hm.. you should present a fast way to fit a number into the -pi/2 - +pi/2 range.

is there a fractional() instruction in SSE? or some modulus directly?
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

### #6Slaru

New Member

• Members
• 1 posts

Posted 24 April 2006 - 02:58 AM

Hello,

This is my first post here, but I've visited here for a while now.

I did a quick test on accuracies and found the fast sin function to have about 12.9% error without added precision and -0.2% error with added precision.

Here's the code, if you'd like to have a look to make sure I didn't do something that might be giving false figures.
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>

const float PI          = 3.14159265358979323846264338327950288f;
const float PI2         = 6.28318530717958647692528676655900577f;
const float PID2        = 1.57079632679489661923132169163975144f;
const float PI_SQR      = 9.86960440108935861883449099987615114f;

template <typename T, bool accurate>
T FastSin(const T& theta)
{
const T B = 4 / PI;
const T C = -4 / PI_SQR;

T y = B * theta + C * theta * abs(theta);

if (accurate)
{
//  const float Q = 0.775;
const T P = 0.225;

y = P * (y * abs(y) - y) + y;   // Q * y + P * y * abs(y)
}

return y;
}

template <typename T>
T randf()
{
int randi = rand();

return (T)(((double)(randi % RAND_MAX))/((double)RAND_MAX));
}

void CheckAccuracy()
{
const size_t checks = 10000;
typedef double real;

std::cout << "Checking accuracy randomly." << std::endl;
srand((unsigned)std::time(0));

real Sin[checks], fastSin[checks], fastAccurSin[checks];
for (int i = 0; i < checks; i++)
{
real randFloat = randf<real>();
Sin[i] = sin(randFloat);
fastSin[i] = FastSin<real, false>(randFloat);
fastAccurSin[i] = FastSin<real, true>(randFloat);
}

real SinToFastSinAccurs[checks], SinToFastAccurSinAccurs[checks];
for (int i = 0; i < checks; i++)
{
// (measured - accurate)/accurate
SinToFastSinAccurs[i] = (fastSin[i] - Sin[i])/Sin[i];
SinToFastAccurSinAccurs[i] = (fastAccurSin[i] - Sin[i])/Sin[i];
}

real FinalSinToFastSin = ((real)0), FinalSinToFastAccurSin = ((real)0);
for (int i = 0; i < checks; i++)
{
FinalSinToFastSin += SinToFastSinAccurs[i];
FinalSinToFastAccurSin += SinToFastAccurSinAccurs[i];
}
FinalSinToFastSin /= (((real)checks) / ((real)100.0));
FinalSinToFastAccurSin /= (((real)checks) / ((real)100.0));

std::cout << "FinalSinToFastSin %: " << FinalSinToFastSin << "\nFinalSinToFastAccurSin %: " << FinalSinToFastAccurSin << std::endl;
}

int main()
{
CheckAccuracy();
system("Pause");
return 0;
}

Slaru

### #7tbp

Valued Member

• Members
• 135 posts

Posted 24 April 2006 - 07:45 AM

Note: Slaru, i'm sure you could gain a couple ulp by adding more digits to your constants. :blink:

Nick said:

I'd love to hear if anyone knows good approximations for other transcendental functions like the exponential, logarithm and power functions.
See this patch for gcc contributed by AMD. All vectorized. May be too precise and slow for your purpose tho.
http://gcc.gnu.org/m...3/msg01611.html
Further down the thread there's also some rather amusing remarks from Intel drones.

davepermen said:

is there a fractional() instruction in SSE? or some modulus directly?
No. There's some basic arithmetic (+,-,*,/) and bitwise ops (or, and, nand, xor) and that's it. Heck, even the usual cond move is missing.

### #8Nick

Senior Member

• Members
• 1227 posts

Posted 24 April 2006 - 11:20 AM

Slaru said:

I did a quick test on accuracies and found the fast sin function to have about 12.9% error without added precision and -0.2% error with added precision.

Here's the code, if you'd like to have a look to make sure I didn't do something that might be giving false figures...
You're measuring relative errors, which are a bigger than absolute errors. You can however improve relative accuracy a tiny bit by using Q = 0.782, P = 0.218, as mentioned in the article.

By the way, the relative error of the Taylor series is infinity at x = pi. :o Absolute error makes more sense here.

### #9bramz

Valued Member

• Members
• 189 posts

Posted 24 April 2006 - 11:29 AM

how can you have a -0.2% error?
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

### #10bramz

Valued Member

• Members
• 189 posts

Posted 24 April 2006 - 11:58 AM

Nick said:

By the way, the relative error of the Taylor series is infinity at x = pi. :o Absolute error makes more sense here.

Relative errors do make sense when talking about signals, because the amplitude isn't always equal to one. And it also makes sense to compare the error for different angles ...
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

### #11Nick

Senior Member

• Members
• 1227 posts

Posted 24 April 2006 - 12:12 PM

tbp said:

No. There's some basic arithmetic (+,-,*,/) and bitwise ops (or, and, nand, xor) and that's it. Heck, even the usual cond move is missing.
It also has very fast approximations of reciproke and reciproke square root (12-bit mantissa precision). For fractions we can convert to integer and convert back fairly fast. Conditional move can be done with bitwise logic.

### #12bramz

Valued Member

• Members
• 189 posts

Posted 24 April 2006 - 12:45 PM

bramz said:

Relative errors do make sense when talking about signals, because the amplitude isn't always equal to one. And it also makes sense to compare the error for different angles ...

In particular, it also makes sense to talk about the total harmonic distortion
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

### #13Nick

Senior Member

• Members
• 1227 posts

Posted 24 April 2006 - 12:46 PM

bramz said:

Relative errors do make sense when talking about signals, because the amplitude isn't always equal to one. And it also makes sense to compare the error for different angles ...
Sure, but when the signal is around zero the relative error can explode even when the approximation is almost coincident with the real curve.

So what we really need is a function to weigh the error absolutely near zero, and somewhat relative towards infinity. exp(-x) seems very suited. This way the maximum 'exponential' error of the parabola is 0.037, compared to 0.075 for the Taylor series. The squared parabola gives me 0.0006 with adjusted Q = 0.7775, P = 0.2225.

No matter how you look at it, this is a fast and accurate approximation. :happy:

### #14bramz

Valued Member

• Members
• 189 posts

Posted 24 April 2006 - 01:30 PM

Nick said:

So what we really need is a function to weigh the error absolutely near zero, and somewhat relative towards infinity. exp(-x) seems very suited.

Let us not mix two kinds of error measurments, ok? =) The reason why it goes towards infinity there, is because ... well ... because you're making an infinitely large error, no matter how you look at it. However, we don't need to panic either. For sines, we can get a nice figure of overall relative error by using the total harmonic distortion as I mentioned in the other post. This is in particular true if you're using this function to generate signals.

And btw, I _do_ like your approximation ;)
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

### #15tbp

Valued Member

• Members
• 135 posts

Posted 24 April 2006 - 02:31 PM

Nick said:

It also has very fast approximations of reciproke and reciproke square root (12-bit mantissa precision).
... which are pretty much useless by themselves. You'll need a Newton-Raphson step to get within a couple ulp. And then you'll need to fix NaN blackholes at 0 and infinity.
Even if you only care about ]0, inf[ latency has already gone through the roof.

Anyway, you're right i forgot to mention them :)

Nick said:

For fractions we can convert to integer and convert back fairly fast. Conditional move can be done with bitwise logic.
And someone posted code to get something looking like a sin/cos with a bunch of adds & muls... so that doesn't say much ;)
I stand by my initial point, that is paying for the latency, decoding bandwitdth etc for 3 ops (andn, and, or) for something as mundane as a cond move is silly. Even more so when you strive to avoid branches when dealing with SSE to begin with, branches that you remove with... you know what.

### #16Nick

Senior Member

• Members
• 1227 posts

Posted 24 April 2006 - 02:51 PM

By the way, for those who think the squared parabola factor fell out of thin air, expand the formula and compare it with the Taylor series about x = pi/2:

Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2
= Q 4/pi - (Q 4/pi^2 + P 16/pi^2) x^2 - P 32/pi^3 x^3 + P 16/pi^4 x^4
~= 0.986 x + 0.0507 x^2 - 0.232 x^3 + 0.0370 x^4

sin(x ~ pi/2)
~= 1 - 1/2 (x - 1/2 Pi)^2 + 1/24 (x - 1/2 Pi)^4
~= 0.0200 + 0.925 x + 0.117 x^2 - 0.262 x^3 + 0.0417 x^4

Note they are both polynomials with roughly comparable coefficients. The parabola method just has 'corrected' coefficients to make sure sin(0) = 0, sin(pi/2) = 1, sin(pi) = 0 and the erros are minimal.

In fact you can also approach the problem starting from a generic order 4 polynomial:

A + B x + C x^2 + D x^3 + E x^4

Using the restrictions sin(0) = 0, sin(pi/2) = 1, sin(pi) = 0 yields: A = 0, B = 1/(4 pi) (16 + 2 D pi^3 + 3 E pi^4), C = -1/4 (16 + 6 D pi^3 + 7 E pi^4) / pi^2. D and E are still free parameters. So we can choose two more restrictions, for example, the first and second derivative in pi/2 should be 0 and -1 respectively. This yields A = 0, B = -1/2 (-16 + pi^2) / pi, C = 1/2 (-48 + 5 pi^2) / pi^2, D = -4 (-8 + pi^2) / pi^3, E = 2 (-8 + pi^2) / pi^4. Which gives us:

0.976 x + 0.0683 x^2 - 0.241 x^3 + 0.0384 x^4

This is pretty close to the squared parabola method and has some nice properties, but the maximum error is a little bigger. By the way, for order 3 and 4 polynomials we both get the simple parabola.

Anyway, the ad-hoc explanation is a bit easier to understand I think, but this gives it a bit of theoretical background. It's nothing more than fitting a function with a polynomial using some restrictions (exact points and derivatives). Minimizing the error is another very useful restriction...

### #17Nick

Senior Member

• Members
• 1227 posts

Posted 24 April 2006 - 03:01 PM

bramz said:

The reason why it goes towards infinity there, is because ... well ... because you're making an infinitely large error, no matter how you look at it.

Quote

However, we don't need to panic either. For sines, we can get a nice figure of overall relative error by using the total harmonic distortion as I mentioned in the other post. This is in particular true if you're using this function to generate signals.
How does that work exactly? The Wikipedia explanation is a bit cryptic...

Quote

And btw, I _do_ like your approximation ;)
I wasn't doubting that. :happy:

### #18bramz

Valued Member

• Members
• 189 posts

Posted 24 April 2006 - 03:13 PM

Nick said:

ok, fair enough ;)

Quote

How does that work exactly? The Wikipedia explanation is a bit cryptic...

Well, basically, you'll need to decompose your 'signal' (-pi..pi) in its harmonics ... you know, thinks like
y(theta) = \sum_{k=1}^inf A_k sin(k * theta).

(Normally, you won't have cosine terms, since you have an odd function. Except maybe for a DC term, though I don't think you have one. If you would have cosine terms, you better write it down using the exponential form with complex amplitudes)

Then, the THD is defined as the ratio of the RMS of the harmonics to the fundamental:
\frac {\sqrt \sum_{k=2}^inf A_k^2} {\abs A_1}.

One thingy though: it silently ignores the DC term ... so if you simply would add a number like 0.123456 to y, the THD would not change.

edit: need to use the total RMS, not total power
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

### #19Nick

Senior Member

• Members
• 1227 posts

Posted 24 April 2006 - 06:23 PM

tbp said:

... which are pretty much useless by themselves. You'll need a Newton-Raphson step to get within a couple ulp. And then you'll need to fix NaN blackholes at 0 and infinity.
Even if you only care about ]0, inf[ latency has already gone through the roof.
That's really the best possible. Remember that a divps/sqrtps is 39 clock cycles non-pipelined. So if you only need 12-bit or 24-bit mantissa precision these instructions are great. What more could you expect? They can't break the laws of semiconductor physics.

Quote

I stand by my initial point, that is paying for the latency, decoding bandwitdth etc for 3 ops (andn, and, or) for something as mundane as a cond move is silly. Even more so when you strive to avoid branches when dealing with SSE to begin with, branches that you remove with... you know what.
Why is that silly? For many situations, like for the cosine wrap, you can simplify the binary logic. Also, latency by itself is not that big a problem thanks to pipelining, out-of-order execution, and jump prediction. Ok, the Pentium 4 took that idea a little too far, but seriously, if you know a magical architecture that combines high clock frequency with low latency then let me know.

So, I understand your frustration about modern CPU performance characteristics, but Intel and AMD really know what they're doing. x86 ain't perfect but it's legacy (tons of support for it) and it's flexible (new instructions get added every generation). It's certainly possible to create faster chips with a new instruction set tuned for today's applications but if it's not flexible enough it's not going to survive. Intel beated Alpha and PowerPC with affordable and fast processors.

### #20tbp

Valued Member

• Members
• 135 posts

Posted 24 April 2006 - 08:45 PM

Nick said:

That's really the best possible. ... What more could you expect?
1998 has called and wants the 3DNow pfrcp*/pfrsq* back.

Nick said:

Why is that silly? For many situations, like for the cosine wrap, you can simplify the binary logic. Also, latency by itself is not that big a problem thanks to pipelining, out-of-order execution, and jump prediction. Ok, the Pentium 4 took that idea a little too far, but seriously, if you know a magical architecture that combines high clock frequency with low latency then let me know.
You can spin it till hell freezes over, Intel & AMD optimization manuals clearly say: thou shall remove thy ugly branches. If you don't lend me any credit maybe that should get your attention, i think.

So we have to remove branches, and for that more often than not we'll get into the the cond move pattern.
Now let's see what we had for ages on the integer side of those same CPU. Surprise! There's both those "binary logic" ops and... tada ... cmov.
I'm sure dudes at Intel thought they had encoding bits and silicon to waste and they were like let's add another instruction just for the fun.

Really i'm asking for consistency. And common sense.

Nick said:

So, I understand your frustration about modern CPU performance characteristics, but Intel and AMD really know what they're doing. x86 ain't perfect but it's legacy (tons of support for it) and it's flexible (new instructions get added every generation).
Err, correct me if i'm wrong but i'm pretty sure there was like 0 legacy issues when Intel first introduced SSE.

And allow me to put the "they know what they're doing" back into proper historical context: that's the same guys that have gone with a segmented memory model at the same time ie Motorola was going the flat way. Or those that later on saw the light again when they thought stack based fpu was the best thing since sliced bread when everyone and his brother was using registers. :whistle:

Please save me that kind of PR, because even if my memory may not be what it used to be, i was there.

#### 2 user(s) are reading this topic

0 members, 2 guests, 0 anonymous users