Fast and accurate sine/cosine
#81
Posted 01 November 2010 - 07:31 PM
If I understand correctly, your 'sine' function is limited to an input range of [-pi, pi]. Do you have any ideas of how to support a greater range?
Rich
#82
Posted 01 November 2010 - 10:45 PM
DroidFan1 said:
Note though that fmod() is typically quite slow, which can defeat the purpose of trying to outperform sin(). But since the denominator is a constant value (pi), it can be greatly optimized:
inline float mod_pi(float x)
{
const float pi = 3.14159265f;
float x1 = x * (1.0f / pi);
return pi * (x1 - (int)x1);
}
#83
Posted 09 November 2010 - 11:52 PM
For testing purposes, I wrote this in Python first; however, I will be porting it to C soon.
The precomputed values can be hard coded, but I will include the code that generated them here anyway:
def parabolaFit(p1,p2,p3):
x1,y1=p1
x2,y2=p2
x3,y3=p3
x12=x1-x2
x13=x1-x3
x23=x2-x3
y1x1213=y1/(x12*x13)
y2x1223=y2/(x12*x23)
y3x1323=y3/(x13*x23)
a=y1x1213-y2x1223+y3x1323
b=-y1x1213*(x2+x3)+y2x1223*(x1+x3)-y3x1323*(x1+x2)
c=y1x1213*x2*x3-y2x1223*x1*x3+y3x1323*x1*x2
return a,b,c
def precomputeCosineModel(num):
model=[]
size=math.pi/2/num
for i in range(num):
x1=i*size
x2=x1+size*2/3 # These fractions are magic numbers that seem to produce a
x3=x1+size*7/6 # particularly good model. There may be better values.
pt1=(x1,math.cos(x1))
pt2=(x2,math.cos(x2))
pt3=(x3,math.cos(x3))
model.append(parabolaFit(pt1,pt2,pt3))
return model
MODEL=precomputeCosineModel(128)
TWOPI=2*math.pi
HALFPI=math.pi/2
SCALE=2*(len(MODEL)-1)/math.pi
def sine(theta):
global MODEL,TWOPI,HALFPI,SCALE
theta=theta%TWOPI
t=abs((theta%math.pi)-HALFPI)
whichparabola=int(t*SCALE+0.5)
a,b,c=MODEL[whichparabola]
y=a*t*t+b*t+c
if math.pi<theta:
return -y
return y
This therefore uses:2 moduli (one of which yours also needs if it is to have a domain outside of [-pi,pi])
4 additions/subtractions
4 multiplications (maybe one more if I use your absolute value trick)
0 divisions (maybe 1 if I use your absolute value trick)
1 array index
I have not tested whether this ends up being faster or slower once it is in C, but it certainly has more precision. I thought I'd post it here in case it helped anyone who needed a different sort of time/memory/precision trade-off than your solution provided.
Edit: I left out a vital piece of code when I submitted this the first time: namely, the function parabolaFit, which takes three (x,y) points and returns (a,b,c) which describe the parabola through those points! Thank you, linear algebra, for allowing me to compute the inverse of [[x1^2, x1, 1], [x2^2, x2, 1], [x3^2, x3, 1]]
Edit 2: Inserted magic numbers into precomputeCosineModel function to make it produce a better model.
#84
Posted 11 November 2010 - 04:18 AM
MODEL=[(-0.49992962755573234, -1.0980232887746368e-07, 1.0), (-0.49990011985522964, -2.4878253437455193e-06, 1.0000000166453054), (-0.49974116132402857, -1.0823941353727674e-05, 1.0000001252510031), (-0.49950695415332391, -2.8806940785000143e-05, 1.0000004693985092), (-0.49919754344650935, -6.0136395284521466e-05, 1.0000012612471667), (-0.49881293270371074, -0.0001084815176659382, 1.0000027784910901), (-0.49835321576722147, -0.00017753753167265685, 1.000005369676503), (-0.49781843214313687, -0.00027097403126607919, 1.000009448527285), (-0.49720869023281833, -0.00039245476393783832, 1.0000154964220234), (-0.49652407307244251, -0.00054565792161023058, 1.0000240641696634), (-0.49576467665906004, -0.00073421302752367187, 1.0000357652273157), (-0.49493061808634725, -0.00096176832036338328, 1.0000512821260092), (-0.49402202278274604, -0.0012319431790140179, 1.0000713623956952), (-0.49303903481865458, -0.0015483704759287479, 1.0000968225839868), (-0.49198178996369446, -0.0019146303399401911, 1.000128539616685), (-0.49085046181016617, -0.0023343176828680761, 1.0001674569167662), (-0.48964521327061084, -0.0028110083653727414, 1.0002145846954871), (-0.48836620827730481, -0.0033482432689182556, 1.0002709938950045), (-0.48701368498920855, -0.00394956446911393, 1.0003378233319153), (-0.48558780343196667, -0.0046184801987749343, 1.0004162680274522), (-0.48408879715993564, -0.0053584856502879585, 1.0005075900834677), (-0.48251688626492351, -0.0061730584702190412, 1.0006131127485045), (-0.48087231650446333, -0.0070656520075539388, 1.0007342181465122), (-0.47915532957462753, -0.0080396847605567357, 1.0008723496156233), (-0.47736617990232022, -0.0090985495690446179, 1.0010290096303327), (-0.47550514046134701, -0.010245647615796431, 1.0012057623110093), (-0.47357248148954412, -0.011484306193674508, 1.0014042170498367), (-0.47156852428969293, -0.012817852341586904, 1.0016260648969775), (-0.46949353488389106, -0.014249583455672264, 1.0018730240583564), (-0.46734784650404376, -0.015782759252440458, 1.0021468932578577), (-0.46513177736966149, -0.017420611641178547, 1.0024495106027058), (-0.46284566764174451, -0.01916634679983701, 1.002782769585433), (-0.46048984585324404, -0.021023115827221384, 1.0031486172263804), (-0.45806467708431764, -0.022994064988368167, 1.0035490570395174), (-0.4555705320461072, -0.025082293249258678, 1.0039861396905538), (-0.45300777408188114, -0.027290857611423677, 1.0044619576840477), (-0.45037679045149637, -0.02962278417162181, 1.0049786605147859), (-0.4476779912893869, -0.032081062111514465, 1.0055384464339019), (-0.44491177103147844, -0.034668641633809109, 1.0061435518195829), (-0.44207854204620495, -0.037388435692257617, 1.0067962658728122), (-0.43917874375173477, -0.040243306190076523, 1.0074989101028025), (-0.43621280987273181, -0.043236079956155651, 1.0082538630250726), (-0.4331811798555088, -0.04636954118358165, 1.0090635264004364), (-0.4300843121284012, -0.049646442208419662, 1.0099303618596125), (-0.42692267318612304, -0.053069473410920498, 1.0108568568063054), (-0.42369674322003342, -0.05664128652545989, 1.0118455358050653), (-0.4204070152809819, -0.060364479079310737, 1.0128989620100801), (-0.41705396250101168, -0.06424163186990943, 1.0140197428020863), (-0.4136381099193685, -0.068275238558734339, 1.0152104948033558), (-0.41015996374406216, -0.072467769074498214, 1.0164738927350541), (-0.40662005159372727, -0.076821627637463613, 1.0178126132422067), (-0.40301890053226014, -0.081339199054868458, 1.0192293988934467), (-0.39935705763559937, -0.086022775370832805, 1.0207269800326426), (-0.39563507662267688, -0.090874618532679752, 1.0223081358586299), (-0.39185350434454341, -0.095896946650047638, 1.0239756612764328), (-0.38801292916341146, -0.10109190864188038, 1.0257323930961495), (-0.3841139193898932, -0.10646159924626372, 1.0275811493548772), (-0.38015706421789958, -0.11200807036593119, 1.0295248035920748), (-0.37614295623103883, -0.11773331115612513, 1.0315662325219033), (-0.37207220579791189, -0.12363925689900934, 1.0337083376641059), (-0.36794541959878263, -0.12972778385811362, 1.0359540222951913), (-0.36376321906232723, -0.13600070590813484, 1.0383062050045615), (-0.35952623652866106, -0.14245978309506768, 1.0407678223913326), (-0.35523511910317362, -0.14910671132742356, 1.0433418256011453), (-0.35089050208564287, -0.1559431424284427, 1.0460311498294304), (-0.34649303626703276, -0.16297065205229877, 1.0488387639254326), (-0.34204339941803802, -0.17019075720021581, 1.0517676321378988), (-0.33754224450551518, -0.17760491965852759, 1.05482070792313), (-0.33299025451418018, -0.18521453916967687, 1.0580009560243164), (-0.32838812622809044, -0.19302093609893275, 1.0613113511064367), (-0.32373653410714021, -0.20102538631999758, 1.0647548413786618), (-0.3190361984155517, -0.20922909258710209, 1.0683343904687153), (-0.31428781044066573, -0.21763319380608323, 1.0720529405371475), (-0.30949209528370147, -0.22623877295391423, 1.0759134462292925), (-0.30464976904622643, -0.2350468294914104, 1.0799188302530833), (-0.29976156716841629, -0.24405831482174203, 1.0840720094933414), (-0.29482822329375719, -0.25327410038194853, 1.08837589789695), (-0.28985047676105491, -0.26269499907745197, 1.0928333818222529), (-0.28482907946195507, -0.27232175825029559, 1.0974473344310056), (-0.27976478789990811, -0.28215504323308571, 1.1022206081890178), (-0.27465836464004811, -0.29219546304470062, 1.1071560364357962), (-0.26951057917979943, -0.30244355589713673, 1.1122564286841423), (-0.26432220637326725, -0.31289978979147465, 1.1175245691333056), (-0.25909402813211685, -0.32356456251189009, 1.1229632151263116), (-0.2538268306430474, -0.33443820288760157, 1.1285750953596239), (-0.24852140781133444, -0.34552096907579705, 1.1343629081356741), (-0.24317855847052225, -0.3568130487998406, 1.1403293188758024), (-0.23779908741491002, -0.36831455910783051, 1.146476958592314), (-0.23238380463959019, -0.38002554579472397, 1.1528084222169983), (-0.22693352563814218, -0.39194598304701966, 1.1593262662103987), (-0.22144907140185352, -0.40407577334137146, 1.1660330069583509), (-0.21593126765866397, -0.41641474740732159, 1.1729311190910758), (-0.21038094553894857, -0.42896266355026419, 1.1800230329822328), (-0.20479894075821739, -0.44171920795161196, 1.187311133343578), (-0.19918609408161489, -0.45468399404095783, 1.194797757459469), (-0.19354325062872405, -0.46785656258438196, 1.2024851925830025), (-0.18787126030005602, -0.48123638150851777, 1.2103756746652103), (-0.18217097731450277, -0.49482284544661065, 1.2184713867629922), (-0.17644325998700286, -0.50861527637028747, 1.2267744562424323), (-0.17068897098484162, -0.5226129224097753, 1.235286953755359), (-0.16490897680842076, -0.53681495892046982, 1.2440108905285658), (-0.15910414803094369, -0.55122048726364137, 1.2529482178709352), (-0.15327535868839398, -0.56582853581659076, 1.2621008239207094), (-0.14742348663002369, -0.5806380593194338, 1.2714705327438247), (-0.14154941316204941, -0.59564793884213763, 1.2810591019441424), (-0.13565402283422315, -0.61085698201099903, 1.2908682213329896), (-0.12973820357084356, -0.62626392301111322, 1.3008995106724539), (-0.12380284618377223, -0.64186742276177167, 1.3111545185988021), (-0.11784884457779558, -0.65766606828371177, 1.3216347199827656), (-0.11187709527234972, -0.67365837379066851, 1.3323415144869371), (-0.10588849776263635, -0.68984277981851938, 1.3432762254413739), (-0.099883953871470657, -0.70621765398798353, 1.3544400972443227), (-0.093864367741649885, -0.72278129108939171, 1.3658342941198731), (-0.087830646010325877, -0.73953191252386508, 1.3774598984042659), (-0.081783697280677761, -0.7564676674722125, 1.3893179084578964), (-0.075724432306464431, -0.77358663239215042, 1.4014092378730194), (-0.069653763414849829, -0.79088681134863248, 1.4137347127749704), (-0.063572605055916415, -0.80836613638985655, 1.4262950711662394), (-0.057481872727544014, -0.82602246761128251, 1.4390909604548976), (-0.051382483895497016, -0.84385359345000366, 1.4521229365887647), (-0.045275357195569438, -0.8618572310235747, 1.4653914621058461), (-0.039161412026312556, -0.88003102627047547, 1.4788969047361393), (-0.033041569286910648, -0.89837255444929731, 1.4926395357348825), (-0.026916750722656545, -0.91687932020993901, 1.5066195285814403), (-0.020787878528609742, -0.93554875816204097, 1.520836957263483), (-0.014655875775902497, -0.95437823298836033, 1.5352917953314795), (-0.0085216658471105426, -0.97336504019687564, 1.5499839135214026), (-0.0023861727290572753, -0.99250640610146945, 1.5649130793537367)]This improves the average deviation to 1.14e-08, and does not change the worst deviation (still no worse than 1.26e-07).
Edit: Sorry to waste your time. It turns out that this approximation is not very fast after all. It probably has to do with the array indices.
#85
Posted 24 January 2011 - 03:16 PM
It is possible to use only 4 multiplication instead of 5 in the original algorithm.
float sine(float x)
{
const float pi = 3.1415926536f;
const float P = 0.225f;
const float A = 16 * sqrt(P);
const float B = (1 - P) / sqrt(P);
float y = x / (2 * pi);
// mulps xmm0, inv2PI
y = y - floor(y + 0.5); // y in range -0.5..0.5
// using SSE2 here, RN flag set
// cvtps2dq xmm1, xmm0
// cvtdq2ps xmm1, xmm1
// subps xmm0, xmm1
y = A * y * (0.5 - abs(y));
// movaps xmm1, xmm0
// andps xmm1, abs
// subps xmm1, half
// mulps xmm0, xmm1
// mulps xmm0, minusA
return y * (B + abs(y));
// movaps xmm1, xmm0
// andps xmm1, abs
// addps xmm1, B
// mulps xmm0, xmm1
}
#86
Posted 24 January 2011 - 11:12 PM
'}:+()___ [Smile said:
It is possible to use only 4 multiplication instead of 5 in the original algorithm.
Indeed one more multiply can be eliminated. It was a remnant of using the generic parabola formula A + B * x + C * x^2. With A being 0 it can be simplified to x * (B + C * x).
The second part can also be optimized. Instead of computing Q * y + P * y * abs(y) as P * (y * abs(y) - y) + y, which is possible since Q + P = 1, it can also be rewritten as y * (Q + P * abs(y)). This eliminates the subtraction.
#87
Posted 24 February 2011 - 09:16 PM
inline float sine(float x)
{
const float B = 4/PI;
const float C = -4/(PI*PI);
const float recip2 = 1 / (PI * 2);
const float pi2 = PI * 2;
float y;
//convert the input value to a range of 0 to 1
x = (x + PI) * recip2;
//make it loop
x -= (long)(x-(x<0));
//convert back from 0-1 to -pi to +pi.
x = (x * pi2) - PI;
//original function
y = B * x + C * x * abs(x); //can abs be inlined? It's a matter of turning off the sign bit.
return (y);
//left out the higher precision.
}
inline float cosine(float x)
{
const float B = 4/PI;
const float C = -4/(PI*PI);
float y;
const float recip2 = 1 / (PI * 2);
const float pi2 = PI * 2;
const float hpi=PI/2;
x+=hpi; //shift for cosine
x = (x + PI) * recip2;
x -= (long)(x-(x<0));
x = (x * pi2) - PI;
y = B * x + C * x * abs(x);
return (y);
}
#88
Posted 25 February 2011 - 07:52 AM
The deviation also get pretty high, up to 0.056.
#89
Posted 25 February 2011 - 06:18 PM
Anyway, here's the assembler output from the offending code.
; 340 : //make it loop
; 341 : x -= (long)(x-(x<0));
mov DWORD PTR tv76[esp-4], 1
fld DWORD PTR _PI
fadd ST(1), ST(0)
fxch ST(1)
fmul DWORD PTR __real@3e22f983
fldz
fcomp ST(1)
fnstsw ax
test ah, 65 ; 00000041H
je SHORT $LN4@sine
mov DWORD PTR tv76[esp-4], 0
$LN4@sine:
; 342 : //convert back from 0-1 to -pi to +pi.
; 343 : x = (x * pi2) - PI;
#90
Posted 25 February 2011 - 07:24 PM
Also, thinking about it, it would be nice to have an SineCosine function that does both at the same time and saves a few cycles.
#91
Posted 26 February 2011 - 12:40 AM
long sign = (((long&)x & 0x80000000) >> 31); x -= (long)(x) - sign;which results in the following assembly:
; 42 : long sign = (((long&)x & 0x80000000) >> 31); ; 43 : x -= (long)(x) - sign; mov eax, DWORD PTR _x$[ebp] cvttss2si ecx, xmm0 shr eax, 31 ; 0000001fH xorps xmm2, xmm2 sub ecx, eax cvtsi2ss xmm2, ecxNo branches in sight, and the output is exactly the same as the original :)
On an a completely different note, I noticed that if I compacted the two statements to one
x -= (long)(x) - (((long&)x & 0x80000000) >> 31);
MSVC reverts to generating x87 assembly for this function, instead of SSE2, that it was instructed to use... Weird :wacko:
#92
Posted 26 February 2011 - 11:38 PM
IEEE floating point specifies four possible rounding modes:
:nearest
In this mode, the inexact results are rounded to the nearer of the two possible result values. If the neither possibility is nearer, then the even alternative is chosen. This form of rounding is also called ``round to even'', and is the form of rounding specified for the Common Lisp round function.
:positive-infinity
This mode rounds inexact results to the possible value closer to positive infinity. This is analogous to the Common Lisp ceiling function.
:negative-infinity
This mode rounds inexact results to the possible value closer to negative infinity. This is analogous to the Common Lisp floor function.
:zero
This mode rounds inexact results to the possible value closer to zero. This is analogous to the Common Lisp truncate function.
#93
Posted 26 February 2011 - 11:57 PM
What do you suddenly need rounding for? If you need rounding according to the current rounding mode, you can use cvtss2si instead.
#94
Posted 27 February 2011 - 05:35 PM
Dan_Ritchie said:
:nearest
In this mode, the inexact results are rounded to the nearer of the two possible result values. If the neither possibility is nearer, then the even alternative is chosen. This form of rounding is also called ``round to even'', and is the form of rounding specified for the Common Lisp round function.
:positive-infinity
This mode rounds inexact results to the possible value closer to positive infinity. This is analogous to the Common Lisp ceiling function.
:negative-infinity
This mode rounds inexact results to the possible value closer to negative infinity. This is analogous to the Common Lisp floor function.
:zero
This mode rounds inexact results to the possible value closer to zero. This is analogous to the Common Lisp truncate function.
Nitpick: IEEE-754 was updated in 2008 to add a variation of the "nearest" rounding mode that ties away from zero.
http://en.wikipedia....i/IEEE_754-2008
#95
Posted 28 February 2011 - 11:11 PM
You could avoid most of this if you just used the right mode for the cast, ...
; 42 : long sign = (((long&)x & 0x80000000) >> 31);
; 43 : x -= (long)(x) - sign;
and just have x-=(long)x. But it might be slower to set the rounding mode, I don't know.
#96
Posted 02 March 2011 - 10:45 AM
float sine(float x)
{
// Convert the input value to a range of -1 to 1
x = x * (1.0f / PI);
// Wrap around
volatile float z = (x + 25165824.0f);
x = x - (z - 25165824.0f);
#if LOW_SINE_PRECISION
return 4.0f * (x - x * abs(x));
#else
float y = x - x * abs(x);
const float Q = 3.1f;
const float P = 3.6f;
return y * (Q + P * abs(y));
#endif
}
Of course an AVX version of this code can compute 8 values in parallel, and it can make use of the VROUNDPS instruction to do the wrapping even faster. Future CPUs will also feature fused multiply-add (FMA).
#97
Posted 02 March 2011 - 08:21 PM
#98
Posted 02 March 2011 - 08:31 PM
#99
Posted 14 March 2011 - 01:06 AM
Nick said:
The second part can also be optimized. Instead of computing Q * y + P * y * abs(y) as P * (y * abs(y) - y) + y, which is possible since Q + P = 1, it can also be rewritten as y * (Q + P * abs(y)). This eliminates the subtraction.
I really appreciate this fast sine algorithm. I was searching for something faster than the hardware fsin and hopefully trying to avoid lookup tables. Just a correction, you forgot the abs(x) in the first part of the equation. It should be:
New Simplified Parabola
y = x * (B + C * abs(x))
Refinement
y = y * (Q + P * abs(y))
ATTENTION FOR THOSE INTERESTED IN ACTUAL EXECUTION SPEEDS:
I did some experimentation writing this in SSE assembly code and as such have actual execution times and also can explain how to achieve even higher rates and also point out that not all SSE vector units are the same. Here's what I tried.
First I wrote it using scalar instructions to calculate a single sine. My code expects the input to be in the range of -PI to +PI as input. It uses 12 instructions to load an input number, calculate, then write back the result. Plus there's two instructions for the loop since I looped it 1 billion times to time against the ms clock. On my 2.2 GHz Athlon it was taking 30 clock cycles to calculate that one sine. Even that is much better than the average 90 to 100+ I was getting using the x87 FSIN instruction.
Since the algorithm has a lot of dependencies in it, most of those 30 clock cycles are spent waiting, so I interleaved another scalar sine calc in there and it only increased the time by 2 clock cycles: 32 clocks for two sines. Then I added another sine and it went to 36 clocks for three sines. Then I went to four sines and it went to 47 clocks which is 11.75 clocks per sine. Keep in mind this is running 4 sine calculations through in scalar mode which is more flexible if you're not working with variable 128 bit aligned and all in the same place.
Now here's where things get more unpredictable. One would assume that just replacing the scalar instructions with vector / packed instructions should give 16 vector sines in the same time as 4 scalar. It depends on the CPU. The earlier CPU's with SSE didn't process 4 floats at once. They only did two at once and pipelined the second pair behind the first. So that could slow it down almost to half. But it's possible to chop some clock cycles off of that because running 4 loads, then 4 movs, then 4 ands, then 4 mults, etc. doesn't effectly use the different execution units in the vector processor. So I shifted instructions around and interleaved them further. That took several clock cycles off the time. The end result on my old 2.2GHz Athlon is 70.5 clock cycles for 16 sines, so that's about 4.4 clocks per sine and a throughput of almost a half billion sines per second.
Then I moved the code to my 2.8GHz Athlon II that I bought last year. It actually does process 4 floats at once, so you'd think it would be almost twice as fast. Turns out it was even faster so there's more going on there that I know about. It calcs 16 sines in 29.5 clocks, so now we're at 1.84 clock cycles per sine and a throughput of over 1.5 billion sines per second, triple that of my older computer.
So now we have some realworld results. If the code were written for the Motorola Altivec/Velocity engine, the 12 instructions can be reduced to 8 because the PowerPC/Altivec architecture is modern and not some archaic 1975 architecture like the Intel CPU. On the PowerPC, the move instructions disappear because you get a free move with all PowerPC/Altivec instructions, and the fused Multiply/Add not only means reduced instructions and double the speed for over separate multiplies and adds, but it removes the latency caused by the add waiting for the multiply to finish (although my code fills in that wasted time because it calculates 16 sines instead of the original 4). Intel / AMD will finally have those capabilities too supposedly later this year. It's taken them 12 years to catch up. Unfortunately the main CPU instruction set is still crappy and inefficient, and I doubt that will ever change.
#100
Posted 14 March 2011 - 11:44 PM
Elhardt said:
I believe the Athlon II, which uses the K10 architecture, also benefits significantly from the extra instruction fetching capabilities. Load reordering also lessens the impact of cache misses.
Quote
Quote
There's no confirmation yet when Intel will support it, although it's part of the AVX spec. The good news is they can theoretically double the performance (again), while AMD chose to share two 128-bit units between two cores (for now).
Quote
The only thing that's sorely lacking from the instruction set, is support for gather/scatter operations (the parallel equivalent of load/store). That would allow to parallelize loops much more efficiently and together with FMA would make the CPU capable of very high effective throughput.
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users











