# Fast and accurate sine/cosine

103 replies to this topic

### #21Nick

Senior Member

• Members
• 1227 posts

Posted 25 April 2006 - 01:06 AM

tbp said:

1998 has called and wants the 3DNow pfrcp*/pfrsq* back.
pfrcp has 14-bit mantissa precision, pfsqrt 15-bit. That's not a whole lot better than the SSE versions. I mean, especially for the purposes these instructions are used for it shouldn't matter that much (e.g. vector normalization). And the 3DNow! instructions have quite high latencies on Athlon 64 as well.

Quote

You can spin it till hell freezes over, Intel & AMD optimization manuals clearly say: thou shall remove thy ugly branches. If you don't lend me any credit maybe that should get your attention, i think.
Sorry, what credit are you asking for then? :huh: And I read almost every page of the Intel manuals (plus some chapters in The Unabridged Pentium 4) so it's hard to get my attention with that, sorry. I also got several courses about processor architecture and digital design if that matters...

Quote

So we have to remove branches, and for that more often than not we'll get into the the cond move pattern.
Now let's see what we had for ages on the integer side of those same CPU. Surprise! There's both those "binary logic" ops and... tada ... cmov.
I'm sure dudes at Intel thought they had encoding bits and silicon to waste and they were like let's add another instruction just for the fun.

Really i'm asking for consistency. And common sense.
My point was that using a couple binary logic instructions is quite effective to remove some branches. Sure, there is no cmov equivalent, but that's not so terrible. Besides, what would that instruction look like anyway? There is no vector of compare flags. And that's a good thing because they would add a whole lot of hardware complexity (making the latencies even bigger). The current SSE architecture is pretty clean in my opinion.

Implementing cmov in the integer pipeline is easy since flags have always been there. They did the right thing for SSE by giving it a cleaner modern architecture. That means less consistency but I don't mind. A single scalar SSE version of cmov would probably have been doable but then there is no vector equivalent (no consistency) and it's pretty useless anyway. For predictable branches you can still use comiss.

So I'm not sure what you're asking for exactly... :surrender

Quote

And allow me to put the "they know what they're doing" back into proper historical context: that's the same guys that have gone with a segmented memory model at the same time ie Motorola was going the flat way. Or those that later on saw the light again when they thought stack based fpu was the best thing since sliced bread when everyone and his brother was using registers. :whistle:
Intel has always chosen the budget solution, and look where they are now. Sure, x86 ain't perfect, but it's what makes my PC tick and almost everyone else's. As for x87, yes the register stack can be annoying, but its design started in 1976 and uses the robust IEEE 754 standard (also created by Intel) we still use today.

Quote

Please save me that kind of PR, because even if my memory may not be what it used to be, i was there.
I'm sorry but what PR? I have an AMD destop and Intel laptop. And I'm willing to change to something else if it's fast, affordable, easy to develop for and flexible enough for the future.

Anyway, I have good hopes for the future of x86. Intel finally realized clock frequency isn't the only thing that determines performance. By increasing issue width, making SIMD units 4-wide instead of only 2-wide, and using reverse hyper-threading, floating-point performance can still increase a lot.

### #22tbp

Valued Member

• Members
• 135 posts

Posted 25 April 2006 - 03:03 AM

Nick said:

pfrcp has 14-bit mantissa precision, pfsqrt 15-bit. That's not a whole lot better than the SSE versions. I mean, especially for the purposes these instructions are used for it shouldn't matter that much (e.g. vector normalization). And the 3DNow! instructions have quite high latencies on Athlon 64 as well.
Reciprocals and square roots do happen outside of vector normalization you know.
Fact is i need both of them, fast and robust and if possible with a bit pattern that doesn't look like i got it from a RNG.

Plus i wasn't addressing relative latency of 3DNow & SSE but functionnalities. It seems you've missed the wildcard in my previous post, so please check "3DNow! Technology Manual" for PFRCPIT1, PFRCPIT2, PFRSQIT1 & co.
Compare that to:
	static __m128 rcp_sqrt_nr_robust_ps(const __m128 arg) {
const __m128
nr		= rsqrtps(arg),
muls	= mulps(mulps(nr, nr), arg),
beta	= mulps(rt::cst::section.half, nr),
gamma	= subps(rt::cst::section.three, filter),
final	= mulps(beta, gamma);
return final;
}
I could have spammed the same kind of uglyness for related ops like 1/x etc... Not so attractive anymore eh? (or robust in fact)

Nick said:

Sorry, what credit are you asking for then? :huh: And I read almost every page of the Intel manuals (plus some chapters in The Unabridged Pentium 4) so it's hard to get my attention with that, sorry. I also got several courses about processor architecture and digital design if that matters...
That's fine & dandy, but one got to wonder if we're reading the same manuals.

From a quick searh through IA-32 Intel Architecture Optimization Reference Manual, Order Number:248966-012
2-85, Vectorization
"User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches"

2-89, User/Source Coding Rules
"User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches. 2-86"

3-37, Instruction Selection
"The following section gives some guidelines for choosing instructions to complete a task. One barrier to SIMD computation can be the existence of data-dependent branches. Conditional moves can be used to eliminate data-dependent branches. Conditional moves can be emulated in SIMD computation by using masked compares and logicals, as shown in Example3-21."

4-2, General Rules on SIMD Integer Code
"Emulate conditional moves by using masked compares and logicals instead of using conditional branches."

Should i quote some more or did you get the point by now?

Nick said:

My point was that using a couple binary logic instructions is quite effective to remove some branches. Sure, there is no cmov equivalent, but that's not so terrible. Besides, what would that instruction look like anyway? There is no vector of compare flags. And that's a good thing because they would add a whole lot of hardware complexity (making the latencies even bigger). The current SSE architecture is pretty clean in my opinion.
Excuse me but isn't one of the main point of a CISC arch to provide code compression via byzantine instruction encoding?
I was kind enough to restrict my example to x86 and i've never ever alluded to flags. I've merely pointed out the inconsistency where on one hand you have cmov* on the ALU side, fcmov* on the FPU side and nothing like that for SSE. You'll notice those cmov* and fcmov* could also be emulated.

I would have expected a single op doing exactly what a andn/and/or sequence would do; admitedly ternary ops aren't the norm on x86, but anything would have done the trick: use a blessed register as the implicit mask (aka accumulator) or encode it as a prefix or whatever.
It makes no sense when decoding bandwitdh is scarce and you're struggling with troughoutput to say the norm is replace to conditional branches with a 3 freaking op sequence.

Nick said:

Intel has always chosen the budget solution, and look where they are now. Sure, x86 ain't perfect, but it's what makes my PC tick and almost everyone else's. As for x87, yes the register stack can be annoying, but its design started in 1976 and uses the robust IEEE 754 standard (also created by Intel) we still use today.
Revisionist kids these days... :)
You know there was a (computing) world before Intel and it will still be there when everybody will have forgotten about them.

Repeat after me: stack based fpu stink, they never ever made sense. Ever.
Again it might come as a surprise but others got it mostly right: http://en.wikipedia..../Motorola_68882

Now better than rephrasing what the marketing department wrote, get the story about IEEE 754 from the Man himself. Please. And remove those rosy teinted glasses.
http://www.cs.berkel...s/754story.html

### #23Nick

Senior Member

• Members
• 1227 posts

Posted 25 April 2006 - 06:26 AM

[quote name='tbp']Reciprocals and square roots do happen outside of vector normalization you know.[/quote]
Indeed I know.
[quote]Fact is i need both of them, fast and robust and if possible with a bit pattern that doesn't look like i got it from a RNG.[/quote]
Then please explain what you're using it for exactly and what's the source of your frustration. Because right now you're barking up the wrong tree.
[quote]
Plus i wasn't addressing relative latency of 3DNow & SSE but functionnalities. It seems you've missed the wildcard in my previous post, so please check "3DNow! Technology Manual" for PFRCPIT1, PFRCPIT2, PFRSQIT1 & co.
Compare that to:

I could have spammed the same kind of uglyness for related ops like 1/x etc... Not so attractive anymore eh? (or robust in fact)[/quote]
I already know how Newton-Rhapson refinement works and the related 3DNow! instructions, I use it myself, now what's your point? That SSE intrinsics can be unattractive? I got hundreds of lines of [url=http://www.devmaster.net/articles/linear-optimization/]run-time intrinsics[/url] that are very tricky and it's not all SSE code.
[quote]Should i quote some more or did you get the point by now?[/quote]
No need to quote anything, I know the optimization manual by heart. I still don't get your point though. You want an SSE cmov? Show me how that would work.
[quote]Excuse me but isn't one of the main point of a CISC arch to provide code compression via byzantine instruction encoding?[/quote]
As I'm sure you know, every modern processor is RISC behind the decoder. So even if they added every exotic instruction you can think of, they wouldn't execute much faster than when you have to implement them yourself. That's also why SSE is very RISC-like and adding more instructions would just be a waste of opcodes and microcode ROM. Compact code ain't that important any more.

Besides, it's not because you see a use for certain instructions for your specific application, that they need to be added to the instruction set. They constantly have to evaluate transistor cost and overhead versus speedup over all existing software. It simply pays off more to improve the architecture and not touch the instruction set. Even though it's CISC you can't go adding instructions like mad.
[quote]I would have expected a single op doing exactly what a andn/and/or sequence would do; admitedly ternary ops aren't the norm on x86, but anything would have done the trick: use a blessed register as the implicit mask (aka accumulator) or encode it as a prefix or whatever.[/quote]
Sorry, that's not a good idea. You would be giving a nice modern architecture a special-purpose accumulator register and have instructions with side effects. That's actually worse than having an FPU with a register stack.

Ternary operations would require register files with extra read ports. The area of a register file is proportional to the square of the number of ports. It also has a direct influence on latencies. The only time that can be acceptable is if all operations are ternary (which saves move operations).
[quote]It makes no sense when decoding bandwitdh is scarce and you're struggling with troughoutput to say the norm is replace to conditional branches with a 3 freaking op sequence.[/quote]
There is nothing we can change about the masking sequence. But that's not the actual problem anyway. Future x86 CPUs will just need better decoders, higher issue width and wider execution units, like I said before. That's infinitely better than breaking the architecture for that one application where it could add a few percent performance.
[quote]Revisionist kids these days... :)[/quote]
Don't mock me. I'm only giving explanations, not excuses. I too wish that x86 was a cleaner architecture or something superiour took over. For the time being we're just going to have to live with it and understand its limitiations so we still get the best out of it.
[quote]You know there was a (computing) world before Intel and it will still be there when everybody will have forgotten about them.[/quote]
Indeed I know.
[quote]Repeat after me: stack based fpu stink, they never ever made sense. Ever.[/quote]
Early compiler technology was stack based for most part. So the FPU's register stack made it easier to write compilers. Nowadays it makes no sense to write FPU code manually, compilers generate perfect code, so why bother. Also, fxch has zero latency so it's pretty much like accessing a flat register file anyway.
[quote]Again it might come as a surprise but others got it mostly right: [url]http://en.wikipedia.org/wiki/Motorola_68882[/url][/quote]
That's not a surprise at all considering it was designed 10 years later.
[quote]Now better than rephrasing what the marketing department wrote, get the story about IEEE 754 from the Man himself. Please. And remove those rosy teinted glasses.
[url]http://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html[/url][/QUOTE]
I have already read that article. Not so long ago actually. I was looking for a way to detect problems caused by NaNs, you remember. The answer is not in the article but it was an interesting read. But what's your point? It's Intel who initiated the standard. It would be foolish to ignore their influence on the digital world.

As a matter of fact I got cool new glasses last Saturday, black with copper orange. Not rosy teinted though, crystal clear.

### #24Reedbeta

DevMaster Staff

• 5305 posts
• LocationBellevue, WA

Posted 25 April 2006 - 06:40 AM

Guys,

This is rapidly becoming pointless. Let's stay on the topic please.
reedbeta.com - developer blog, OpenGL demos, and other projects

### #25bramz

Valued Member

• Members
• 189 posts

Posted 25 April 2006 - 11:35 AM

So, I did a quick test on the THD, and this is what I've found:

taylor: 2.4%
parabolic: 3.8%
extra precision: 0.078%

So, you can see that your version with extra precision performs much better than the others. Your parabolic version is actually worse than the taylor one. That's of course because you're making rather large errors over almost the complete domain. Taylor only gets (very) bad towards pi. There is however one major fact in favour of the parabolic one: it's continuous.

The parabolic also has only odd harmonics (k = 2n+1), while taylor has both even and odd harmonics. The latter however, might be in favour of the Taylor series when it comes to psychoacoustics (how we experience the signal when we here it) because the third harmonic gives a 'covered' sound, while having the 2nd, 3rd, 4th, ... all together gives a 'brassy' sound. It's a bit like transistors vs tubes ...

It might be interesting to take the general formula with four coefficients, and to optimize it for minimal THD.

Bramz

BTW, that other one you proposed: 0.976 x + 0.0683 x^2 - 0.241 x^3 + 0.0384 x^4, it's THD is 0.32%.
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

### #26siddord

New Member

• Members
• 23 posts

Posted 25 April 2006 - 12:01 PM

is is definitely the most interesting and detailed post i have seen on any forum :)

im going to try this out myself and will keep u posted if i do get something for the other functions :)

### #27Nick

Senior Member

• Members
• 1227 posts

Posted 25 April 2006 - 12:59 PM

bramz said:

So, I did a quick test on the THD, and this is what I've found:

taylor: 2.4%
parabolic: 3.8%
extra precision: 0.078%

So, you can see that your version with extra precision performs much better than the others. Your parabolic version is actually worse than the taylor one. That's of course because you're making rather large errors over almost the complete domain. Taylor only gets (very) bad towards pi. There is however one major fact in favour of the parabolic one: it's continuous. It also has only odd harmonics (k = 2n+1), while taylor has both even and odd harmonics.
Thanks! :yes: THD really is an interesting error measure. Looking at the graphs it makes sense the single parabola is worste than the Taylor series, except towards x = pi. Anyway the Taylor series is totally useless compared to the squared parabola method. :happy:

Quote

It might be interesting to take the general formula with four coefficients, and to optimize it for minimal THD.
I'll give that a try when I find the time.

### #28tbp

Valued Member

• Members
• 135 posts

Posted 25 April 2006 - 07:35 PM

Nick said:

I already know how Newton-Rhapson refinement works and the related 3DNow! instructions, I use it myself, now what's your point?
My point was that not providing instructions to refine results with SSE was a mistake, that as AMD demonstrated in 1998 it was possible to provide that kind of functionnality on x86 in an elegant and effective fashion; see the posted code up there were it takes 8 ops, 2 mem access to do get within a couple ulp over a restrained domain. It doesn't sound like "really the best possible" in my book.

While it makes me all warm inside to know you've read all books - "la chair est triste, helas! et j'ai lu tous les livres" -, written awesome software and you're still on track to cure cancer i was kinda expecting after 3 posts that you would answer the point as opposed to wave hands and make broad claims.

Nick said:

<...>
There is nothing we can change about the masking sequence. But that's not the actual problem anyway. Future x86 CPUs will just need better decoders, higher issue width and wider execution units, like I said before. That's infinitely better than breaking the architecture for that one application where it could add a few percent performance.
Which can be summed up as: let's comfortably forget there was nothing to break when SSE was introduced (introduction, as in it didn't exist before) and keep faith in Moore's law to save us all.

Btw, i'm glad you decreeted that wasn't a problem but could you convince my current cpu to comply with your ruling? Thank you.

Nick said:

Early compiler technology was stack based for most part. So the FPU's register stack made it easier to write compilers.
Eh? I must be living in an alternate reality.
Oh and fxch was made almost 'free' by the time of the P3, that is a couple of decades later.

Nick said:

Then, why do you say "they've created it"? It makes no sense.
x87 wasn't the first fpu to be designed and as a matter of fact there was no PC to speak about before some wacko at IBM picked the worst processor design of the time back in 1980.

No i'm not frustrated. No i'm not barking. I just find incredibly funny that one could put "they know what they're doing" and Intel in the same sentence given their track record. Circumstances & fortunate events paving the way for a monopoly in a given sector are not to be mixed up with engineering prowness.
Note that i'm not saying they systematically mess it up. I'm just that asking for some critical judgement.

So please compare, gah, 4 iterations of SSE to 3DNow!, Altivec etc etc...
You could start here, http://www.eecg.toro.../svx/index.html

On that note, since the Moore law argument was made, there's no need for further discussion really. Forgive me if i'll selectively remember your code and not your amusing remarks about book you've read or courses you've taken.

### #29Groove

New Member

• Members
• 26 posts

Posted 26 April 2006 - 05:45 PM

We could also used Taylor series for exp function.

I did it for my math library(http://glm.g-truc.net) and they are quiet efficient but low precision. My functions are build to use an "half precision".

Quote

inline float fast(float x)
{
// This has a better looking and same performance in release mode than the following code. However, in debug mode it's slower.
// return 1.0f + x * (1.0f + x * 0.5f * (1.0f + x * 0.3333333333f * (1.0f + x * 0.25 * (1.0f + x * 0.2f))));
float x2 = x * x;
float x3 = x2 * x;
float x4 = x3 * x;
float x5 = x4 * x;
return 1.0f + x + (x2 * 0.5f) + (x3 * 0.1666666667f) + (x4 * 0.041666667f) + (x5 * 0.008333333333f);
}

I also tryed with the log function but the result was good, VC7.1 is really fastest.

It's also work well with asin, acos, atan and tan functions:

Quote

inline float fastAsin(float x)
{
return x + (x * x * x * 0.166666667f) + (x * x * x * x * x * 0.075f) + (x * x * x * x * x * x * x * 0.0446428571f) + (x * x * x * x * x * x * x * x * x * 0.0303819444f);// + (x * x * x * x * x * x * x * x * x * x * x * 0.022372159f);
}

Quote

inline float fastAcos(float x)
{
return 1.5707963267948966192313216916398f - glm::fastAsinGTX(x); //(PI / 2)
}

Quote

inline float fastAtan(float x)
{
return x - (x * x * x * 0.333333333333f) + (x * x * x * x * x * 0.2f) - (x * x * x * x * x * x * x * 0.1428571429f) + (x * x * x * x * x * x * x * x * x * 0.111111111111f) - (x * x * x * x * x * x * x * x * x * x * x * 0.0909090909f);
}

Quote

inline float fastTan(float x)
{
return x + (x * x * x * 0.3333333333f) + (x * x * x * x * x * 0.1333333333333f) + (x * x * x * x * x * x * x * 0.0539682539f);
}

I have try with SSE instructions and I didn't kown this "optimisation" so I very interested :p

### #30Nick

Senior Member

• Members
• 1227 posts

Posted 28 April 2006 - 01:11 PM

Quote

Note that i'm not saying they systematically mess it up. I'm just that asking for some critical judgement.
I think this is the essential part of our discussion. I fully agree that x86 ain't perfect. SIMD performance is definitely below par compared to several other CPUs. So yes I'm critical about that.

Despite that, these processors are cheap and they are everywhere. You may regret that, but frankly I don't care much how this came to be. I just want to get the maximum out of them. There's little point complaining about lack of instructions or the 'right' register layout. Heck, most people don't even know assembly, they only care about 'C++' performance. And that's understandably what Intel focusses on, by including mainly (but not exclusively) instructions that can be used by compilers, fixing old mistakes (fxch for free), improving jump prediction, removing the decoder bottleneck (trace cache), doubling integer throughput (rapid execution engine), HyperThreading, etc. And considering that developing a new architecture takes 4-6 years and they're limited by the legacy instruction set I still think they're doing the best possible. It's always easy to criticise afterwards...

Apple's switch from IBM PowerPC to Intel also sounds to me like they admit Intel produces great processors. They even claim it's 2x faster. And they must be looking forward too, knowing that Core will be very cost effective. They're not switching for no reason. And I'm sure AMD is working on a great new architecture as well. Reverse HyperThreading is a great idea and an integrated memory controller for XDR could blow away Intel's memory performance.

So I guess I'm a realistic optimist. x86 still has great potential and clearly the instruction set doesn't matter all that much. Both Intel and AMD could beat AltiVec if they wanted to. They're just waiting for the right moment to make this economical, which might be sooner than expected...

### #31geon

Senior Member

• Members
• 939 posts

Posted 28 April 2006 - 05:38 PM

tbp & Nick:

If you forgive me for disturbing when the mastrers are talking, and for beeing completely off topic...

First of all, I have never done any asm coding myself, and I am by no means literate on low-level processor architecture. But I am interested!

What do you think of the cell architecture for low-level coding? Will it be a nightmare to work with, with a totally messed up design incorporating every flaw you could make... Or will it be pure pleasure with total controll and possibilities for writing high performing code?

It seemed neat to me, although I wouldn't really know.

### #32Nick

Senior Member

• Members
• 1227 posts

Posted 28 April 2006 - 07:01 PM

geon said:

What do you think of the cell architecture for low-level coding? Will it be a nightmare to work with, with a totally messed up design incorporating every flaw you could make... Or will it be pure pleasure with total controll and possibilities for writing high performing code?
Hi geon,

It's nice to hear your have an interest in processor architecture!

But I'm not sure if this thread is the ideal place to discuss it. It would be great if you could start a topic about it in the General Development forum though. :yes: I don't know all that much about Cell myself but I'll do some research and hopefully we can have an interesting discussion there.

Cheer,

Nick

P.S: Just a thought: Actually a 'Hardware' forum could be useful. It would be the place where we can discuss the low level stuff that has little to do with the programming itself but can be essential to create great games (like looking ahead what will be possible with future CPUs and GPUs). Anyone feel the same or is the current list of forums more than sufficient?

### #33Reedbeta

DevMaster Staff

• 5305 posts
• LocationBellevue, WA

Posted 29 April 2006 - 02:07 AM

I don't know how many people besides yourself would ever post there
reedbeta.com - developer blog, OpenGL demos, and other projects

### #34davepermen

Senior Member

• Members
• 1306 posts

Posted 29 April 2006 - 03:37 AM

i could.. i would.. i guess.. then again, it's .. 05:37 and i'm sitting in the club right now.. so i have no clue.. really..
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

### #35Nils Pipenbrinck

Senior Member

• Members
• 597 posts

Posted 29 April 2006 - 04:29 PM

tbp said:

Eh? I must be living in an alternate reality.
Oh and fxch was made almost 'free' by the time of the P3, that is a couple of decades later.

fxch was already almost free on the first generation Pentium. As far as I remember it was the only float instruction that could execute in the V-pipeline making it a practially free instruction for any sequence of fpu instructions.

Later on (P2) they broke up the fixed UV-pipeline design and made fxch truely free. It was more a prefix-kind instruction (also it could still consume a cycle or two in extreme cases).

Anyways, I have to admit, writing stack based fpu code was never easy, even with fxch. Otoh one could outperform the compiles a lot back these days, that was fun.

### #36zavie

Member

• Members
• 91 posts

Posted 16 May 2006 - 11:15 AM

Hi Nick,

Thank you for your very interesting article.
I have a question though: you say Taylor behaves very nicely up to Pi/2. Since the shape in the range [0 ; Pi/2] is enough for any sinus or cosinus part, what about using only this part?

I would be interested in seeing a more fair comparison. :-)

### #37Nick

Senior Member

• Members
• 1227 posts

Posted 16 May 2006 - 01:54 PM

zavie said:

I have a question though: you say Taylor behaves very nicely up to Pi/2. Since the shape in the range [0 ; Pi/2] is enough for any sinus or cosinus part, what about using only this part?

I would be interested in seeing a more fair comparison. :-)
They are all polynomials, they just have different coefficients. The Taylor series is a way to approximate a function around a certain point (0 in this case), but the coefficients it produces are not ideal for the best approximation in a range ([-pi, pi] in this case).

Using only the [-pi/2, pi/2] range, the Taylor series is more accurate. But that's not too surprising since it uses more terms, making it slower to compute. Using a generic polynomial with the same number of terms and minimizing the error yields even better results. So again the Taylor series fails.

In fact the Taylor series starts from the same generic polynomial but instead of using restrictions that minimize the error over a range it requires that the 0'th derivative, 1'th derivative, ... n'th derivative in 0 is equal to that of the function to be approximated.

### #38bramz

Valued Member

• Members
• 189 posts

Posted 16 May 2006 - 06:06 PM

Nick,

did you have a change to optimize for minimal THD yet?

Bramz
hi, i'm a signature viruz, plz set me as your signature and help me spread :)
Bramz' warehouse | LiAR isn't a raytracer

### #39jirzynek

New Member

• Members
• 15 posts

Posted 01 June 2007 - 12:59 PM

Few months ago I implemented method described by Nick using fixedpoint math only. This routine works fine for me and maybe someone else would find it useful. So I've decided to post it here :)


fp16 ne3d_sintp_02pi(fp16 angle)

{

fp16 Xi2,Xi3,Yi,Yi2,Yi3,Yi4;

angle = angle >> 1;

Xi2 = ((angle >> 1) * (angle >> 2)) >> 13;

Xi3 = (6640 * Xi2) >> 12;

Yi =  (((20860 * angle) >> 13) - Xi3) >> 1;

Yi2 = (Yi * Yi) >> 14;

Yi3 = (14746 * Yi2) >> 16;

Yi4 = (25395 * Yi ) >> 14;

Yi = Yi4 + Yi3;

return Yi;

}



int i = 0;
(+ - + - + - + ++--++--++--++--++--++--++--------++++++i);

### #40iMalc

New Member

• Members
• 11 posts

Posted 07 August 2007 - 06:42 AM

It might be interesting to see if an even better approximation of a 4th degree polynomial can be obtained by either brute force trying of combinations, or by using a genetic algorithm of some kind.

#### 1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users