Fast and accurate sine/cosine

103 replies to this topic

#81DroidFan1

New Member

• Members
• 7 posts

Posted 01 November 2010 - 07:31 PM

Nick,

If I understand correctly, your 'sine' function is limited to an input range of [-pi, pi]. Do you have any ideas of how to support a greater range?

Rich

#82Nick

Senior Member

• Members
• 1227 posts

Posted 01 November 2010 - 10:45 PM

DroidFan1 said:

If I understand correctly, your 'sine' function is limited to an input range of [-pi, pi]. Do you have any ideas of how to support a greater range?
All you need is the modulo of the input and pi to extend the range.

Note though that fmod() is typically quite slow, which can defeat the purpose of trying to outperform sin(). But since the denominator is a constant value (pi), it can be greatly optimized:

inline float mod_pi(float x)

{

const float pi = 3.14159265f;

float x1 = x * (1.0f / pi);

return pi * (x1 - (int)x1);

}



#83collinstocks

New Member

• Members
• 2 posts

Posted 09 November 2010 - 11:52 PM

Nick, your estimation is rather amazing. It inspired me to write my own one that uses a similar number of operations, but also uses a (relatively small) lookup table. As it turns out, a lookup table with 128 sets of numbers (A, B, C) each of which describes a parabolic estimation of a very short segment of the function cos from 0 to pi/2 will give a maximum error of no more than 0.000000126 (1.26e-07), and an average error of about 0.0000000117 (1.17e-08).

For testing purposes, I wrote this in Python first; however, I will be porting it to C soon.

The precomputed values can be hard coded, but I will include the code that generated them here anyway:

def parabolaFit(p1,p2,p3):
x1,y1=p1
x2,y2=p2
x3,y3=p3
x12=x1-x2
x13=x1-x3
x23=x2-x3
y1x1213=y1/(x12*x13)
y2x1223=y2/(x12*x23)
y3x1323=y3/(x13*x23)
a=y1x1213-y2x1223+y3x1323
b=-y1x1213*(x2+x3)+y2x1223*(x1+x3)-y3x1323*(x1+x2)
c=y1x1213*x2*x3-y2x1223*x1*x3+y3x1323*x1*x2
return a,b,c

def precomputeCosineModel(num):
model=[]
size=math.pi/2/num
for i in range(num):
x1=i*size
x2=x1+size*2/3 # These fractions are magic numbers that seem to produce a
x3=x1+size*7/6 # particularly good model. There may be better values.
pt1=(x1,math.cos(x1))
pt2=(x2,math.cos(x2))
pt3=(x3,math.cos(x3))
model.append(parabolaFit(pt1,pt2,pt3))
return model

MODEL=precomputeCosineModel(128)
TWOPI=2*math.pi
HALFPI=math.pi/2
SCALE=2*(len(MODEL)-1)/math.pi

def sine(theta):
global MODEL,TWOPI,HALFPI,SCALE
theta=theta%TWOPI
t=abs((theta%math.pi)-HALFPI)
whichparabola=int(t*SCALE+0.5)
a,b,c=MODEL[whichparabola]
y=a*t*t+b*t+c
if math.pi<theta:
return -y
return y
This therefore uses:
2 moduli (one of which yours also needs if it is to have a domain outside of [-pi,pi])
4 multiplications (maybe one more if I use your absolute value trick)
0 divisions (maybe 1 if I use your absolute value trick)
1 array index

I have not tested whether this ends up being faster or slower once it is in C, but it certainly has more precision. I thought I'd post it here in case it helped anyone who needed a different sort of time/memory/precision trade-off than your solution provided.

Edit: I left out a vital piece of code when I submitted this the first time: namely, the function parabolaFit, which takes three (x,y) points and returns (a,b,c) which describe the parabola through those points! Thank you, linear algebra, for allowing me to compute the inverse of [[x1^2, x1, 1], [x2^2, x2, 1], [x3^2, x3, 1]]

Edit 2: Inserted magic numbers into precomputeCosineModel function to make it produce a better model.

#84collinstocks

New Member

• Members
• 2 posts

Posted 11 November 2010 - 04:18 AM

For anyone who is interested, optimal values for the parabola models seem to be (after several hours of brute forcing):

MODEL=[(-0.49992962755573234, -1.0980232887746368e-07, 1.0),
(-0.49990011985522964, -2.4878253437455193e-06, 1.0000000166453054),
(-0.49974116132402857, -1.0823941353727674e-05, 1.0000001252510031),
(-0.49950695415332391, -2.8806940785000143e-05, 1.0000004693985092),
(-0.49919754344650935, -6.0136395284521466e-05, 1.0000012612471667),
(-0.49881293270371074, -0.0001084815176659382, 1.0000027784910901),
(-0.49835321576722147, -0.00017753753167265685, 1.000005369676503),
(-0.49781843214313687, -0.00027097403126607919, 1.000009448527285),
(-0.49720869023281833, -0.00039245476393783832, 1.0000154964220234),
(-0.49652407307244251, -0.00054565792161023058, 1.0000240641696634),
(-0.49576467665906004, -0.00073421302752367187, 1.0000357652273157),
(-0.49493061808634725, -0.00096176832036338328, 1.0000512821260092),
(-0.49402202278274604, -0.0012319431790140179, 1.0000713623956952),
(-0.49303903481865458, -0.0015483704759287479, 1.0000968225839868),
(-0.49198178996369446, -0.0019146303399401911, 1.000128539616685),
(-0.49085046181016617, -0.0023343176828680761, 1.0001674569167662),
(-0.48964521327061084, -0.0028110083653727414, 1.0002145846954871),
(-0.48836620827730481, -0.0033482432689182556, 1.0002709938950045),
(-0.48701368498920855, -0.00394956446911393, 1.0003378233319153),
(-0.48558780343196667, -0.0046184801987749343, 1.0004162680274522),
(-0.48408879715993564, -0.0053584856502879585, 1.0005075900834677),
(-0.48251688626492351, -0.0061730584702190412, 1.0006131127485045),
(-0.48087231650446333, -0.0070656520075539388, 1.0007342181465122),
(-0.47915532957462753, -0.0080396847605567357, 1.0008723496156233),
(-0.47736617990232022, -0.0090985495690446179, 1.0010290096303327),
(-0.47550514046134701, -0.010245647615796431, 1.0012057623110093),
(-0.47357248148954412, -0.011484306193674508, 1.0014042170498367),
(-0.47156852428969293, -0.012817852341586904, 1.0016260648969775),
(-0.46949353488389106, -0.014249583455672264, 1.0018730240583564),
(-0.46734784650404376, -0.015782759252440458, 1.0021468932578577),
(-0.46513177736966149, -0.017420611641178547, 1.0024495106027058),
(-0.46284566764174451, -0.01916634679983701, 1.002782769585433),
(-0.46048984585324404, -0.021023115827221384, 1.0031486172263804),
(-0.45806467708431764, -0.022994064988368167, 1.0035490570395174),
(-0.4555705320461072, -0.025082293249258678, 1.0039861396905538),
(-0.45300777408188114, -0.027290857611423677, 1.0044619576840477),
(-0.45037679045149637, -0.02962278417162181, 1.0049786605147859),
(-0.4476779912893869, -0.032081062111514465, 1.0055384464339019),
(-0.44491177103147844, -0.034668641633809109, 1.0061435518195829),
(-0.44207854204620495, -0.037388435692257617, 1.0067962658728122),
(-0.43917874375173477, -0.040243306190076523, 1.0074989101028025),
(-0.43621280987273181, -0.043236079956155651, 1.0082538630250726),
(-0.4331811798555088, -0.04636954118358165, 1.0090635264004364),
(-0.4300843121284012, -0.049646442208419662, 1.0099303618596125),
(-0.42692267318612304, -0.053069473410920498, 1.0108568568063054),
(-0.42369674322003342, -0.05664128652545989, 1.0118455358050653),
(-0.4204070152809819, -0.060364479079310737, 1.0128989620100801),
(-0.41705396250101168, -0.06424163186990943, 1.0140197428020863),
(-0.4136381099193685, -0.068275238558734339, 1.0152104948033558),
(-0.41015996374406216, -0.072467769074498214, 1.0164738927350541),
(-0.40662005159372727, -0.076821627637463613, 1.0178126132422067),
(-0.40301890053226014, -0.081339199054868458, 1.0192293988934467),
(-0.39935705763559937, -0.086022775370832805, 1.0207269800326426),
(-0.39563507662267688, -0.090874618532679752, 1.0223081358586299),
(-0.39185350434454341, -0.095896946650047638, 1.0239756612764328),
(-0.38801292916341146, -0.10109190864188038, 1.0257323930961495),
(-0.3841139193898932, -0.10646159924626372, 1.0275811493548772),
(-0.38015706421789958, -0.11200807036593119, 1.0295248035920748),
(-0.37614295623103883, -0.11773331115612513, 1.0315662325219033),
(-0.37207220579791189, -0.12363925689900934, 1.0337083376641059),
(-0.36794541959878263, -0.12972778385811362, 1.0359540222951913),
(-0.36376321906232723, -0.13600070590813484, 1.0383062050045615),
(-0.35952623652866106, -0.14245978309506768, 1.0407678223913326),
(-0.35523511910317362, -0.14910671132742356, 1.0433418256011453),
(-0.35089050208564287, -0.1559431424284427, 1.0460311498294304),
(-0.34649303626703276, -0.16297065205229877, 1.0488387639254326),
(-0.34204339941803802, -0.17019075720021581, 1.0517676321378988),
(-0.33754224450551518, -0.17760491965852759, 1.05482070792313),
(-0.33299025451418018, -0.18521453916967687, 1.0580009560243164),
(-0.32838812622809044, -0.19302093609893275, 1.0613113511064367),
(-0.32373653410714021, -0.20102538631999758, 1.0647548413786618),
(-0.3190361984155517, -0.20922909258710209, 1.0683343904687153),
(-0.31428781044066573, -0.21763319380608323, 1.0720529405371475),
(-0.30949209528370147, -0.22623877295391423, 1.0759134462292925),
(-0.30464976904622643, -0.2350468294914104, 1.0799188302530833),
(-0.29976156716841629, -0.24405831482174203, 1.0840720094933414),
(-0.29482822329375719, -0.25327410038194853, 1.08837589789695),
(-0.28985047676105491, -0.26269499907745197, 1.0928333818222529),
(-0.28482907946195507, -0.27232175825029559, 1.0974473344310056),
(-0.27976478789990811, -0.28215504323308571, 1.1022206081890178),
(-0.27465836464004811, -0.29219546304470062, 1.1071560364357962),
(-0.26951057917979943, -0.30244355589713673, 1.1122564286841423),
(-0.26432220637326725, -0.31289978979147465, 1.1175245691333056),
(-0.25909402813211685, -0.32356456251189009, 1.1229632151263116),
(-0.2538268306430474, -0.33443820288760157, 1.1285750953596239),
(-0.24852140781133444, -0.34552096907579705, 1.1343629081356741),
(-0.24317855847052225, -0.3568130487998406, 1.1403293188758024),
(-0.23779908741491002, -0.36831455910783051, 1.146476958592314),
(-0.23238380463959019, -0.38002554579472397, 1.1528084222169983),
(-0.22693352563814218, -0.39194598304701966, 1.1593262662103987),
(-0.22144907140185352, -0.40407577334137146, 1.1660330069583509),
(-0.21593126765866397, -0.41641474740732159, 1.1729311190910758),
(-0.21038094553894857, -0.42896266355026419, 1.1800230329822328),
(-0.20479894075821739, -0.44171920795161196, 1.187311133343578),
(-0.19918609408161489, -0.45468399404095783, 1.194797757459469),
(-0.19354325062872405, -0.46785656258438196, 1.2024851925830025),
(-0.18787126030005602, -0.48123638150851777, 1.2103756746652103),
(-0.18217097731450277, -0.49482284544661065, 1.2184713867629922),
(-0.17644325998700286, -0.50861527637028747, 1.2267744562424323),
(-0.17068897098484162, -0.5226129224097753, 1.235286953755359),
(-0.16490897680842076, -0.53681495892046982, 1.2440108905285658),
(-0.15910414803094369, -0.55122048726364137, 1.2529482178709352),
(-0.15327535868839398, -0.56582853581659076, 1.2621008239207094),
(-0.14742348663002369, -0.5806380593194338, 1.2714705327438247),
(-0.14154941316204941, -0.59564793884213763, 1.2810591019441424),
(-0.13565402283422315, -0.61085698201099903, 1.2908682213329896),
(-0.12973820357084356, -0.62626392301111322, 1.3008995106724539),
(-0.12380284618377223, -0.64186742276177167, 1.3111545185988021),
(-0.11784884457779558, -0.65766606828371177, 1.3216347199827656),
(-0.11187709527234972, -0.67365837379066851, 1.3323415144869371),
(-0.10588849776263635, -0.68984277981851938, 1.3432762254413739),
(-0.099883953871470657, -0.70621765398798353, 1.3544400972443227),
(-0.093864367741649885, -0.72278129108939171, 1.3658342941198731),
(-0.087830646010325877, -0.73953191252386508, 1.3774598984042659),
(-0.081783697280677761, -0.7564676674722125, 1.3893179084578964),
(-0.075724432306464431, -0.77358663239215042, 1.4014092378730194),
(-0.069653763414849829, -0.79088681134863248, 1.4137347127749704),
(-0.063572605055916415, -0.80836613638985655, 1.4262950711662394),
(-0.057481872727544014, -0.82602246761128251, 1.4390909604548976),
(-0.051382483895497016, -0.84385359345000366, 1.4521229365887647),
(-0.045275357195569438, -0.8618572310235747, 1.4653914621058461),
(-0.039161412026312556, -0.88003102627047547, 1.4788969047361393),
(-0.033041569286910648, -0.89837255444929731, 1.4926395357348825),
(-0.026916750722656545, -0.91687932020993901, 1.5066195285814403),
(-0.020787878528609742, -0.93554875816204097, 1.520836957263483),
(-0.014655875775902497, -0.95437823298836033, 1.5352917953314795),
(-0.0085216658471105426, -0.97336504019687564, 1.5499839135214026),
(-0.0023861727290572753, -0.99250640610146945, 1.5649130793537367)]
This improves the average deviation to 1.14e-08, and does not change the worst deviation (still no worse than 1.26e-07).

Edit: Sorry to waste your time. It turns out that this approximation is not very fast after all. It probably has to do with the array indices.

#85}:+()___ (Smile)

Member

• Members
• 169 posts

Posted 24 January 2011 - 03:16 PM

Found this interesting algorithm and wonder, why use extra multiplication?

It is possible to use only 4 multiplication instead of 5 in the original algorithm.


float sine(float x)

{

const float pi = 3.1415926536f;

const float P = 0.225f;

const float A = 16 * sqrt(P);

const float B = (1 - P) / sqrt(P);

float y = x / (2 * pi);

//   mulps xmm0, inv2PI

y = y - floor(y + 0.5);  // y in range -0.5..0.5

// using SSE2 here, RN flag set

//   cvtps2dq xmm1, xmm0

//   cvtdq2ps xmm1, xmm1

//   subps xmm0, xmm1

y = A * y * (0.5 - abs(y));

//   movaps xmm1, xmm0

//   andps xmm1, abs

//   subps xmm1, half

//   mulps xmm0, xmm1

//   mulps xmm0, minusA

return y * (B + abs(y));

//   movaps xmm1, xmm0

//   andps xmm1, abs

//   mulps xmm0, xmm1

}



Sorry my broken english!

#86Nick

Senior Member

• Members
• 1227 posts

Posted 24 January 2011 - 11:12 PM

'}:+()___ [Smile said:

']Found this interesting algorithm and wonder, why use extra multiplication?

It is possible to use only 4 multiplication instead of 5 in the original algorithm.
Finally someone notices! :happy:

Indeed one more multiply can be eliminated. It was a remnant of using the generic parabola formula A + B * x + C * x^2. With A being 0 it can be simplified to x * (B + C * x).

The second part can also be optimized. Instead of computing Q * y + P * y * abs(y) as P * (y * abs(y) - y) + y, which is possible since Q + P = 1, it can also be rewritten as y * (Q + P * abs(y)). This eliminates the subtraction.

#87Dan_Ritchie

New Member

• Members
• 16 posts

Posted 24 February 2011 - 09:16 PM

I thought I'd try to get this all together into a function that loops before/after pi/-pi. I might have missed a few optimizations along the way, but there's no divides and no branching, so I think it lends itself to sse, and works with positive/negative inputs. Somebody care to check me on this?

inline float sine(float x)
{
const float B = 4/PI;
const float C = -4/(PI*PI);
const float recip2 = 1 / (PI * 2);
const float pi2 = PI * 2;
float y;

//convert the input value to a range of 0 to 1
x = (x + PI) * recip2;
//make it loop
x -= (long)(x-(x<0));
//convert back from 0-1 to -pi to +pi.
x = (x * pi2) - PI;
//original function
y = B * x + C * x * abs(x); //can abs be inlined? It's a matter of turning off the sign bit.
return (y);
//left out the higher precision.
}

inline float cosine(float x)
{
const float B = 4/PI;
const float C = -4/(PI*PI);
float y;
const float recip2 = 1 / (PI * 2);
const float pi2 = PI * 2;
const float hpi=PI/2;
x+=hpi; //shift for cosine
x = (x + PI) * recip2;
x -= (long)(x-(x<0));
x = (x * pi2) - PI;
y = B * x + C * x * abs(x);
return (y);
}

#88Kenneth Gorking

Senior Member

• Members
• 939 posts

Posted 25 February 2011 - 07:52 AM

(x<0) creates a branch :)
The deviation also get pretty high, up to 0.056.
"Stupid bug! You go squish now!!" - Homer Simpson

#89Dan_Ritchie

New Member

• Members
• 16 posts

Posted 25 February 2011 - 06:18 PM

My x86 is a little rusty. I grew up in 68000 land, and I really don't know how the x86 c compiler works, but it looks like it may be doing a branch? So, is binary logic without branching not possible on x86? I remember doing it in mmx once, but does the normal x86 allow for it? Is there anything similar to test and set, where you compare two values and set a result?

Anyway, here's the assembler output from the offending code.

; 340 : //make it loop
; 341 : x -= (long)(x-(x<0));

mov DWORD PTR tv76[esp-4], 1
fld DWORD PTR _PI
fxch ST(1)
fmul DWORD PTR __real@3e22f983
fldz
fcomp ST(1)
fnstsw ax
test ah, 65 ; 00000041H
je SHORT $LN4@sine mov DWORD PTR tv76[esp-4], 0$LN4@sine:

; 342 : //convert back from 0-1 to -pi to +pi.
; 343 : x = (x * pi2) - PI;

#90Dan_Ritchie

New Member

• Members
• 16 posts

Posted 25 February 2011 - 07:24 PM

ok, so doing some checking and x86 does have a cmov opcode (conditional move) but visual studio doesn't use it. Anyway, I'd love to see if anyone has a better solution? I'm really, really not up for doing any assembler at the moment.

Also, thinking about it, it would be nice to have an SineCosine function that does both at the same time and saves a few cycles.

#91Kenneth Gorking

Senior Member

• Members
• 939 posts

Posted 26 February 2011 - 12:40 AM

Yeah, MSVC doesn't use the cmovxx instructions, for some reason. Anyway, there are more ways to test wheter a number is negative, one of them being:
	long sign = (((long&)x & 0x80000000) >> 31);

x -= (long)(x) - sign;


which results in the following assembly:
; 42   : 	long sign = (((long&)x & 0x80000000) >> 31);

; 43   : 	x -= (long)(x) - sign;

mov	eax, DWORD PTR _x\$[ebp]

cvttss2si ecx, xmm0

shr	eax, 31					; 0000001fH

xorps	xmm2, xmm2

sub	ecx, eax

cvtsi2ss xmm2, ecx


No branches in sight, and the output is exactly the same as the original :)

On an a completely different note, I noticed that if I compacted the two statements to one
x -= (long)(x) - (((long&)x & 0x80000000) >> 31);
MSVC reverts to generating x87 assembly for this function, instead of SSE2, that it was instructed to use... Weird :wacko:
"Stupid bug! You go squish now!!" - Homer Simpson

#92Dan_Ritchie

New Member

• Members
• 16 posts

Posted 26 February 2011 - 11:38 PM

There's a FRNDINT opcode in x86 that rounds according to the rounding mode in the FP control word. That might also do the trick. I have some code around here somewhere for changing the control word.

IEEE floating point specifies four possible rounding modes:

:nearest
In this mode, the inexact results are rounded to the nearer of the two possible result values. If the neither possibility is nearer, then the even alternative is chosen. This form of rounding is also called round to even'', and is the form of rounding specified for the Common Lisp round function.

:positive-infinity
This mode rounds inexact results to the possible value closer to positive infinity. This is analogous to the Common Lisp ceiling function.

:negative-infinity
This mode rounds inexact results to the possible value closer to negative infinity. This is analogous to the Common Lisp floor function.

:zero
This mode rounds inexact results to the possible value closer to zero. This is analogous to the Common Lisp truncate function.

#93Kenneth Gorking

Senior Member

• Members
• 939 posts

Posted 26 February 2011 - 11:57 PM

x87 code has been deprecated since 2005, so you should stick with SSE. Also, FRNDINT is slow on all processors.

What do you suddenly need rounding for? If you need rounding according to the current rounding mode, you can use cvtss2si instead.
"Stupid bug! You go squish now!!" - Homer Simpson

#94kusma

Valued Member

• Members
• 163 posts

Posted 27 February 2011 - 05:35 PM

Dan_Ritchie said:

IEEE floating point specifies four possible rounding modes:

:nearest
In this mode, the inexact results are rounded to the nearer of the two possible result values. If the neither possibility is nearer, then the even alternative is chosen. This form of rounding is also called round to even'', and is the form of rounding specified for the Common Lisp round function.

:positive-infinity
This mode rounds inexact results to the possible value closer to positive infinity. This is analogous to the Common Lisp ceiling function.

:negative-infinity
This mode rounds inexact results to the possible value closer to negative infinity. This is analogous to the Common Lisp floor function.

:zero
This mode rounds inexact results to the possible value closer to zero. This is analogous to the Common Lisp truncate function.

Nitpick: IEEE-754 was updated in 2008 to add a variation of the "nearest" rounding mode that ties away from zero.
http://en.wikipedia....i/IEEE_754-2008

#95Dan_Ritchie

New Member

• Members
• 16 posts

Posted 28 February 2011 - 11:11 PM

>>What do you suddenly need rounding for?

You could avoid most of this if you just used the right mode for the cast, ...

; 42 : long sign = (((long&)x & 0x80000000) >> 31);
; 43 : x -= (long)(x) - sign;

and just have x-=(long)x. But it might be slower to set the rounding mode, I don't know.

#96Nick

Senior Member

• Members
• 1227 posts

Posted 02 March 2011 - 10:45 AM

The range wrapping can be done by making use of the fact that a IEEE-754 single-precision floating-point number has 23 mantissa bits. Also, you don't have to convert back to the [-PI, PI] range; just adjust the parabola coefficients:

float sine(float x)

{

// Convert the input value to a range of -1 to 1

x = x * (1.0f / PI);

// Wrap around

volatile float z = (x + 25165824.0f);

x = x - (z - 25165824.0f);

#if LOW_SINE_PRECISION

return 4.0f * (x - x * abs(x));

#else

float y = x - x * abs(x);

const float Q = 3.1f;

const float P = 3.6f;

return y * (Q + P * abs(y));

#endif

}
Of course an AVX version of this code can compute 8 values in parallel, and it can make use of the VROUNDPS instruction to do the wrapping even faster. Future CPUs will also feature fused multiply-add (FMA).

#97Dan_Ritchie

New Member

• Members
• 16 posts

Posted 02 March 2011 - 08:21 PM

couldn't you make 1/pi a const to avoid the division, or does the compiler do that?

#98Reedbeta

DevMaster Staff

• 5344 posts
• LocationSanta Clara, CA

Posted 02 March 2011 - 08:31 PM

Any compiler worth its salt should do constant folding, especially the way Nick wrote it as an explicit multiplication by the reciprocal.
reedbeta.com - developer blog, OpenGL demos, and other projects

#99Elhardt

New Member

• Members
• 3 posts

Posted 14 March 2011 - 01:06 AM

Nick said:

Indeed one more multiply can be eliminated. It was a remnant of using the generic parabola formula A + B * x + C * x^2. With A being 0 it can be simplified to x * (B + C * x).

The second part can also be optimized. Instead of computing Q * y + P * y * abs(y) as P * (y * abs(y) - y) + y, which is possible since Q + P = 1, it can also be rewritten as y * (Q + P * abs(y)). This eliminates the subtraction.

I really appreciate this fast sine algorithm. I was searching for something faster than the hardware fsin and hopefully trying to avoid lookup tables. Just a correction, you forgot the abs(x) in the first part of the equation. It should be:

New Simplified Parabola
y = x * (B + C * abs(x))
Refinement
y = y * (Q + P * abs(y))

ATTENTION FOR THOSE INTERESTED IN ACTUAL EXECUTION SPEEDS:

I did some experimentation writing this in SSE assembly code and as such have actual execution times and also can explain how to achieve even higher rates and also point out that not all SSE vector units are the same. Here's what I tried.

First I wrote it using scalar instructions to calculate a single sine. My code expects the input to be in the range of -PI to +PI as input. It uses 12 instructions to load an input number, calculate, then write back the result. Plus there's two instructions for the loop since I looped it 1 billion times to time against the ms clock. On my 2.2 GHz Athlon it was taking 30 clock cycles to calculate that one sine. Even that is much better than the average 90 to 100+ I was getting using the x87 FSIN instruction.

Since the algorithm has a lot of dependencies in it, most of those 30 clock cycles are spent waiting, so I interleaved another scalar sine calc in there and it only increased the time by 2 clock cycles: 32 clocks for two sines. Then I added another sine and it went to 36 clocks for three sines. Then I went to four sines and it went to 47 clocks which is 11.75 clocks per sine. Keep in mind this is running 4 sine calculations through in scalar mode which is more flexible if you're not working with variable 128 bit aligned and all in the same place.

Now here's where things get more unpredictable. One would assume that just replacing the scalar instructions with vector / packed instructions should give 16 vector sines in the same time as 4 scalar. It depends on the CPU. The earlier CPU's with SSE didn't process 4 floats at once. They only did two at once and pipelined the second pair behind the first. So that could slow it down almost to half. But it's possible to chop some clock cycles off of that because running 4 loads, then 4 movs, then 4 ands, then 4 mults, etc. doesn't effectly use the different execution units in the vector processor. So I shifted instructions around and interleaved them further. That took several clock cycles off the time. The end result on my old 2.2GHz Athlon is 70.5 clock cycles for 16 sines, so that's about 4.4 clocks per sine and a throughput of almost a half billion sines per second.

Then I moved the code to my 2.8GHz Athlon II that I bought last year. It actually does process 4 floats at once, so you'd think it would be almost twice as fast. Turns out it was even faster so there's more going on there that I know about. It calcs 16 sines in 29.5 clocks, so now we're at 1.84 clock cycles per sine and a throughput of over 1.5 billion sines per second, triple that of my older computer.

So now we have some realworld results. If the code were written for the Motorola Altivec/Velocity engine, the 12 instructions can be reduced to 8 because the PowerPC/Altivec architecture is modern and not some archaic 1975 architecture like the Intel CPU. On the PowerPC, the move instructions disappear because you get a free move with all PowerPC/Altivec instructions, and the fused Multiply/Add not only means reduced instructions and double the speed for over separate multiplies and adds, but it removes the latency caused by the add waiting for the multiply to finish (although my code fills in that wasted time because it calculates 16 sines instead of the original 4). Intel / AMD will finally have those capabilities too supposedly later this year. It's taken them 12 years to catch up. Unfortunately the main CPU instruction set is still crappy and inefficient, and I doubt that will ever change.

#100Nick

Senior Member

• Members
• 1227 posts

Posted 14 March 2011 - 11:44 PM

Elhardt said:

It calcs 16 sines in 29.5 clocks, so now we're at 1.84 clock cycles per sine and a throughput of over 1.5 billion sines per second, triple that of my older computer.
Thanks for sharing the results of your implementation!

I believe the Athlon II, which uses the K10 architecture, also benefits significantly from the extra instruction fetching capabilities. Load reordering also lessens the impact of cache misses.

Quote

On the PowerPC, the move instructions disappear because you get a free move with all PowerPC/Altivec instructions...
The AVX instruction set, supported by Intel's latest Sandy Bridge architecture, also features independent destination operands.

Quote

Intel / AMD will finally have those capabilities too supposedly later this year.
AMD will support FMA4 in the Bulldozer architecture, to be launched this summer.

There's no confirmation yet when Intel will support it, although it's part of the AVX spec. The good news is they can theoretically double the performance (again), while AMD chose to share two 128-bit units between two cores (for now).

Quote

It's taken them 12 years to catch up. Unfortunately the main CPU instruction set is still crappy and inefficient, and I doubt that will ever change.
The instruction set may be flawed, but that hardly matters. Decoding the x86 instructions into RISC micro-operations only takes a few percent of the chip space. And it still shrinks with every generation.

The only thing that's sorely lacking from the instruction set, is support for gather/scatter operations (the parallel equivalent of load/store). That would allow to parallelize loops much more efficiently and together with FMA would make the CPU capable of very high effective throughput.

1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users