Optimization

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 08, 2011 at 02:54

I’ve heard the advice many times to not bother with low level optimization. In the past, I’ve taken the position that clarity and organization of code and algorithm design is far more important. However, recently I’ve been working on a project that involves many-body interactions where performance is already starting to be an issue.

I spent a good half a day exploring and benchmarking various ways to make the functions for those interactions faster (they are called many millions of times per second). For example, I tried the fast inverse square root function that was used in Quake, and the fast sine/cosine functions the Nick shared with us here. I also tried several fast 3d distance equations of various crudeness.

Each fast math function was tested in isolation or near isolation and called 10 million times and compared. The results surprised me. A few of the functions were slightly faster that their “slow” counterparts, and quite a few of the “fast” functions were slower. Supposing I haven’t made some stupid systematic mistake in my benchmarks, it seems that the GCC compiler produces really fast code on its own. I’m wondering if you have had similar experiences. Has anyone else actually benchmarked Nick’s fast sine and cosine code?

35 Replies

Please log in or register to post a reply.

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Aug 08, 2011 at 04:17

what are you actually doing, an atomic reaction simulator? :) (just jesting, this is way higher than my league hehe)
reedbeta delete my post.

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 08, 2011 at 05:20

@Rouncer, your post got deleted? My project isn’t anything exotic, just a simulation of many interacting “people”. The AI for each “person” will (if I can work it out) be fairly computationally expensive. At the moment, I just have a bunch of cubes with collision forces between them, drag forces, and some simple biased Brownian motion to make them swarm. 1000 of them on my single core machine runs 20-25 frames a second. As mundane as that sounds, that’s already half a million interactions per frame for collisions. Since the system is highly dynamic, not so sure whether space partitioning would really help.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Aug 08, 2011 at 09:01

@tobeythorn

I’ve heard the advice many times to not bother with low level optimization.

That is totally wrong advice. You should not bother with low level optimization, 90% of the time. That 10% is critical for the overall performance of your application. The reason some people have stopped bothering about performance altogether, is that other libraries (and drivers) take care of some of the low-level work.

Has anyone else actually benchmarked Nick’s fast sine and cosine code?

I have. :D It’s the fastest algorithm I know of. Make sure you’re not using any branch operations though.

Have you considered SIMD code (SSE/AVX)? Future x86 CPUs will feature AVX2 with gather and FMA support.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Aug 08, 2011 at 09:03

@tobeythorn

1000 of them on my single core machine runs 20-25 frames a second. As mundane as that sounds, that’s already half a million interactions per frame for collisions. Since the system is highly dynamic, not so sure whether space partitioning would really help.

Sort the objects in the x, y and z direction, to quickly determine which potentially collide.

Fe8a5d0ee91f9db7f5b82b8fd4a4e1e6
0
JarkkoL 102 Aug 08, 2011 at 12:12

When designing efficient algorithms, you should consider both big-O notation and low-level details of the algorithm. While faster sqrt/sin/cos may result in some improvents (maybe even few times the performance, if the algorithm is very heavy on those operations), big-O and good cache coherency can result in several orders of magnitude improvement in comparison. Those are the big fishes you should try to catch first, because other optimizations potentially pale in comparison and may be completely useless if you have to change the entire algorithm for sake of better big-O & cache coherency.

It sounds to me that you are struggling because of bad big-O and focusing on faster sqrt/sin/cos implementations potentially result only in quite modest improvements. You should try to figure out how you could improve the algorithm in the high-level instead. For example do you have to consider all objects for each object, or is it enough to use only the closest objects? You mentioned space partitioning for dynamic objects, so could you put the objects into a grid for example for faster queries & updates? Furthermore, since we are talking about a simulation, could you exploit frame-to-frame coherency somehow?

Cheers, Jarkko

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 08, 2011 at 15:41

Thanks all for your replies.

@Nick,
How much faster did you find your sine and cose to be? I’ll give your functions another look.

I’ve never tried writing assembly before, but would consider it. My machine is an old amd64 with sse2.

I really like your dimensional sorting idea. That would avoid a lot of conditional statements (which from my experience are quite slow) that a space partitioning trees would require, and could maintain some coherency from frame to frame and not need too much resorting. Will definitely give that a shot.

@Jarkkol,
You’re absolutely right about the importance of making my algorithm more efficient. There are still things I need to explore in that regard. This is the first time I’ve delved into a low-level approach, and I was surprised at just how ineffective it was. Still, at the end of the day, I’m also going to be calling a handful of math functions a very large number of times, so at some point, faster math functions would be nice.

8676d29610e6c98d6dd2d9c38528cd9c
0
alphadog 101 Aug 09, 2011 at 18:07

It’s not “don’t bother with low level optimization”, it’s “don’t bother with low level optimization first, above all or obsessively in all parts of your app.”

You should code iteratively, refactoring sections for performance and only after benchmarking your whole program shows it to be necessary.

The raison d’etre for that motto is that many coders are perfectionists when it comes to their code. Some of us will spend days optimizing the number of angels on the head of a pin if given half the chance… B)

So, you tested each function with 10 million calls. The question I’d ask is will your app call that function 10 million times in such a way that the difference in timings would affect the end goal of the app?

6eaf0e08fe36b2c23ca096562dd7a8b7
0
__________Smile_ 101 Aug 09, 2011 at 22:30

You should try some spatial subdivision algorithms first before going to low level optimization. ASM-written bubble sort will be much slower then quick/heap/tree/subdivision sorting written in C/C++.

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Aug 09, 2011 at 23:14

if you run on the gpu, its going to go faster, too.

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 10, 2011 at 03:21

@Nick,
After carefully testing your fast sine function again (the c version, not the assembly version), it does seem to be about twice as fast as the standard function. =)

So starting implementing your dimensional sorting suggestion. Using insertion-sort to sort the objects in each dimension is super fast since there’s good coherency between frames. Then the idea was to take each object and use the sorted lists to figure out which other objects are nearby in each dimension. Across the 3 dimensions, there would be some duplicate nearby objects, so I’d cull those and then finally do interactions for what’s left over.

However, dimensional sorting doesn’t seem to work if my people/objects are different sizes, which is desirable for my project. Still thinking about workaround…

Further, for each dimension (say x), each object potentially collides with all the objects inside of a slice in the other dimensions (all the objects in the collision range in x, but with any y or z). This is really expensive if the object density is high. This isn’t an issue for a 1D system, and less for a 2D system, but seems like it would be crippling in 3D. Is there something I’m missing?

When I get a chance I may finish up the sort thing and get some real numbers anyway.

@alphadog
Well, I’ve already established that yes, certain functions are being called upwards of 10million times a second. Even if I can reduce the number of interaction checks I can do with spatial partitioning, those functions will be called a lot. For example, I just benchmarked my collision algorithm, and commenting out its contents doubles the performance of my application. Performance is ok now, but once I add artificial intelligence for each object, its going to get slow.

@Smile,
The idea of writing a sort-algorithm in assembly never even crossed my mind. I don’t think I will try it. Do you have a suggestion for a spatial subdivision algorithm that works well for many dynamic objects? My intuition says the cost would outweigh the benefit because of all the updating, but I’m not so experienced with these things.

—-other stuff—-
Adding an initial box-test to my collision algorithm boosted performance by about 15%, which is not groundbreaking, but nice for 3 extra lines of code.

I also benched the circular, doubly-linked list that I was storing my objects/people in versus ordinary arrays. Iterating through the objects for interactions and rendering is only between 0 and 20% faster (total application performance) with arrays. I’ll stick with the convenience of the linked-list for now.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Aug 10, 2011 at 07:49

@JarkkoL

While faster sqrt/sin/cos may result in some improvents (maybe even few times the performance, if the algorithm is very heavy on those operations), big-O and good cache coherency can result in several orders of magnitude improvement in comparison. Those are the big fishes you should try to catch first…

I don’t think you should perform one or the other optimization first. Algorithmic optimization may result in a significant performance improvement, but there are very few guarantees. I’ve known cases where people rewrote an entire library over the course of months, only to find out it was all in vain and they could have concentrated their effort elsewhere…

It’s really a cost/gain/risk analysis. Sometimes low-level optimizations are quite straightforward and after implementing them you get a much clearer picture of the algorithmic complexity. At other times it’s obvious that first rewriting things in assembly and then trying to optimize at a higher level is asking for trouble.

The simple truth is that there are no golden hammers in software optimization. You always have to evaluate every approach, and there’s no substitute for prototyping and profiling.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Aug 10, 2011 at 08:35

@rouncer

if you run on the gpu, its going to go faster, too.

There’s no guarantee of that. The brute force approach may run faster on a powerful GPU, but with spacial subdivision the CPU is a lot more efficient. Also note that the round-trip communication overhead can make GPGPU processing inferior.

And with AVX2 future CPUs will become a lot more powerful at throughput computing so you don’t have to make any compromises.

820ce9018b365a6aeba6e23847f17eda
0
geon 101 Aug 10, 2011 at 12:41

“that’s already half a million interactions per frame for collisions. Since the system is highly dynamic, not so sure whether space partitioning would really help.”

Well. I guess you just compare all pairs:

1000 persons \^2 / 2 = 500000

With spatial subdivision of 10 * 10 cells, and uniform disribution (the ideal case, I know, but still), you get:

(1000 persons / 10\^2 cells)\^2 * 10\^2 cells / 2 = 5000

An improvement by 2 orders of magnitude. Should cover the added complexity.

edit: Messed up the last formula.

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 10, 2011 at 15:03

@geon,
you make a convincing argument

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 18, 2011 at 00:00

@geon,
I tried out and tested the uniform spatial partitioning method you suggested. With 1000 uniformly sized randomly moving objects, I tuned the partition size for best performance and required area. The best grid had 20x20 partitions, where each partition was 4 times as wide as an object.

The partitioning method performs about twice as fast as the N*(N-1)/2 method (total application performance). Certainly, this is a decent gain, but not nearly as high as you predicted, and it comes at some cost in code clarity.

Determining interactions requires higher complexity than your calculation predicts:

numberOfCells * averageNumberOfCollisionsPerCell * averageNumberObjectsThatEachObjectNeedsToCheckWhetherCollisionWithHasAlreadyBeenCalculated
= 400* \~(1000/400)\^2/2 * \~50 = \~60,000

an order of magnitude slower than your estimate.

The various ideas here have got me thinking. There’s another partitioning strategy I thought up today that should be faster, and require no boundaries. I will report back when that’s done.

B5262118b588a5a420230bfbef4a2cdf
0
Stainless 151 Aug 24, 2011 at 12:29

I’m unclear about what you are trying to do.

You talk about people, then talk about 3d.

People cannot fly, so taking a projection through the view volume simplifies your collision handling. I.e a plan of the volume.

If you have some kind of terrain with massive vertical displacements, you can then cull the 2d collisions based on a difference in the vertical displacement squared.

In general you should never need a square root in collision detection, it is only needed once you decide a collision has occured and you move into collision resolution.

If the “people” are of more or less a uniform size then you can treat the plan of the volume as a grid and then you have loads of techniques you can use to reduce the computational tasks.

You can treat the grid as a texture and move nearly everything into the gpu.

Use image recognition tricks

quadtrees

lot’s of things.

Can you give a better description of what you are trying to do?

Then maybe we can give targeted advice rather than general case advice.

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 24, 2011 at 17:21

@Stainless,
I’m building a collision system for a game that will involve a large number of both movable objects and moving “beings”. Beings will come in various sizes. Although I have not made a final decision regarding terrain, my plan at this point is for the terrain to be a roughly spherical world (like in the game Populous). Regardless or whether I go that route, or with a more planar terrain, I have a lot of heterogeneous moving objects that need to collide with each other in either 2 or 3 dimensions, and soon, will also need to have AI interactions between objects and beings in proximity.

@everyone
Since my creatures interact from a more macroscopic perspect, I’ve decided to do force-based interactions. I’m not doing traditional rigid-body collision detection followed by collision response. My collision forces are based on a fast s-curve formula, but also involves a single vector unitization, which means a square-root. Bounding box and distance squared checks are used to cull non-collisions first.

I’ve spent the past week trying out different partitioning strategies and to find something that works well for my application. I just roughed out a new partitioning scheme this morning and results look great. You guys were right: the performance advantage (of the right partitioning method) is indeed well beyond what any math optimization could accomplish.

Basically, I take all my objects and find the center-of-position (ie, unweighted center-of-mass) in the dimension where the objects are most spread out. Then I split at the center-of-position into two partitions, and put any object that lie on the partition edge into a separate bin. Doing this recursively, the partitions quickly get small . There is no penalty for objects moving, or what order they are in, no limits to how spread out the objects are, and this strategy finds optimal split planes. It’s also quite fast. 4000 moving objects (brute-force that would be an impossible 8 million collisions per frame) goes at playable frame rates with just 8 smallest-level partitions. Haven’t done any tuning yet for what the best number of partition levels is.

Thank you all for your advice and ideas!

820ce9018b365a6aeba6e23847f17eda
0
geon 101 Aug 24, 2011 at 18:39

@tobeythorn

It is possible you have too many cells in your partition. There is always an optimal number of cells, depending on the implementation, the distrubution of your objects and then number of objects.

Also, what is “averageNumberObjectsThatEachObjectNeedsToCheckWhetherCollisionWithHasAlreadyBeenCalculated”?

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 25, 2011 at 00:07

@geon,
What I mean is the scheme I’m using, the algorithm does a nice job of choosing how to split partitions, and I’m going to make it so that it dynamically chooses how many times to split each frame for best performance. The only situation that this doesn’t work well in is when all objects occupy the same space.

With the previous grid scheme you suggested, averageNumberObjectsThatEachObjectNeedsToCheckWhe therCollisionWithHasAlreadyBeenCalculated”? is because some objects may intersect more than one grid cell, so each object must store a list of all the objects it has already done collision with, and that list needs to be checked and updated for every collision. I suppose there are other ways of dealing with this.

820ce9018b365a6aeba6e23847f17eda
0
geon 101 Aug 25, 2011 at 01:10

@tobeythorn

I suppose there are other ways of dealing with this.

I would check all adjacent cells, or keep objects at the border in multiple cells.

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 25, 2011 at 03:23

not quite sure what you mean. i was keeping objects at the border in multiple cells, but the idea was to make sure that i didn’t do collision interactions for the same pair multiple times, so I was having objects remember who they have already interacted with that frame and check that.

B5262118b588a5a420230bfbef4a2cdf
0
Stainless 151 Aug 25, 2011 at 08:39

Sounds like you are close to a good solution.

One thing I would think about doing is changing the coordinate system to a fixed point one so you can use integer maths.

There is an incredibly fast integer square root algo that just requires a loop for every 2 bits of the integer and a shift and add in the loop.

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 26, 2011 at 05:03

Finally got the bugs out of the partitioning system. Partitions are regenerated from scratch each frame and split down until no partition contains more than 100 objects. With 2000 moving and colliding objects in the view of the camera it goes at 30 FPS. The cool thing is that performance scales nicely. For example, 8000 objects runs at about 5-10 FPS depending on whether all 8000 objects are in the view of the camera. Here’s a screenshot.

partitioning.png

@Stainless,
I’ll keep the integer coordinate idea in mind for when I run into performance issues again and need more speed. I feel comfortable moving on to the AI at this point.

6eaf0e08fe36b2c23ca096562dd7a8b7
0
__________Smile_ 101 Aug 26, 2011 at 10:37

100 objects per node seems too much for me. Did you try to optimize stopping criterion? If objects have fixed interaction radius than size-based stopping criterion will be better, or regular grid-based approaches with spacing equals to the interaction size.

2b97deded6213469bcd87b65cce5d014
0
Mihail121 102 Aug 26, 2011 at 11:14

@Nick

That is totally wrong advice. You should not bother with low level optimization, 90% of the time. That 10% is critical for the overall performance of your application. The reason some people have stopped bothering about performance altogether, is that other libraries (and drivers) take care of some of the low-level work.

No the real reason is we have to meet deadlines and boss is sitting behind my back asking why the hell I am optimizing for. If it does not crash it’s fine.

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Aug 26, 2011 at 14:58

I actually made a mistake, I had it set to a max of 50 objects per partition. There seems to be relatively little difference in performance between having 50 and 2 objects max per partition. Not really sure what the reason is at this point. Performances does seem to depend on how dense the objects are too. This is probably because less objects get caught on the split planes when the density is low. In any case, performance is good enough for now.

@Mihail,
There is such a disconnect in perspective between engineers and business people. The business people are sometimes right about keeping priorities on getting a product done, but also don’t seem to care about the quality of a product as long as they think they can sell it.

B5262118b588a5a420230bfbef4a2cdf
0
Stainless 151 Aug 27, 2011 at 09:00

A good programmer knows when to optomise code instead of algorithms.

It should be left to the engineers to decide when it is required, that’s why we have software development bastar… errr … managers.

It is easy to get trapped looking at a loop and trying to work out how to speed it up when a change in algo could mean the loop is removed completely.

These days you really should only look at algo optomisation.

The code produced by a compiler is highly specific to platform and compiler issues. I can look at a block of assembler and tell if it was compiled with visual c++, gcc, or hand coded.

If you start dropping in x86 code, it won’t work on an arm platform, and these days you should not be targeting a single platform.

There are too many devices out there not based on x86 that you would be deserting.

All those sales gone because of a few nanoseconds in a tight loop ?

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Sep 01, 2011 at 00:56

Following JarkkoL’s advice to pay more attention to big O, I’ve started measuring big O in my application to see where it’s worst. This has helped me understand inefficiencies better.

For example, I realized that although I was partitioning nodes effectively, I was not dealing with collisions involving objects lying on partition split-planes as efficiently as I could. Objects lying on split-planes were accounting for 2 orders of magnitude more collisions calls than the rest of the objects were. So, fixing that inefficiency, I’ve reduced the total collision calls by a factor of 4. When I started this thread, my application could handle collisions between about 1000 objects before framerates dropped to unacceptable levels. Now it can do 10000!

Also, I installed a profiling plugin. This is the first time I’ve used a profiler, but it is a very nice tool indeed.

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 Sep 01, 2011 at 01:05

Very nice, Tobey! Out of curiosity, what profiler did you use?

A77e71b962cd6c7c3b885f0488452f1f
0
tobeythorn 101 Sep 01, 2011 at 01:24

Code::Blocks has a plugin which uses the GNU Gprof.

E2e042ba856ff700adfd6b4c080209e8
0
EvilEnt 101 Sep 01, 2011 at 11:08

edit: disregard the stuff below this line, std::abs is even faster

stumbled on this a hour ago and benchmarked this function:

float sine(float x)
{
    // Convert the input value to a range of -1 to 1
    x = x * (1.0f / PI);

    // Wrap around
    volatile float z = (x + 25165824.0f);
    x = x - (z - 25165824.0f);

    #if LOW_SINE_PRECISION
        return 4.0f * (x - x * abs(x));
    #else
        float y = x - x * abs(x);

        const float Q = 3.1f;
        const float P = 3.6f;

        return y * (Q + P * abs(y));
    #endif
}

things were 10-15% faster for me (gcc) when replacing abs/fabs with

#define fastAbs(f) ((f<0.0f) ? (-f) : (f))

the inline equivalent was equally slow as fabs or clearing the sign bit
note that without optimization fabs was faster than without
also the results were different

(anyone has a clue how this “Wrap around” works?)

6eaf0e08fe36b2c23ca096562dd7a8b7
0
__________Smile_ 101 Sep 01, 2011 at 11:31

@EvilEnt

things were 10-15% faster for me (gcc) when replacing abs/fabs with

Which flags do you use? Also try std::abs instead of abs/fabs.
@EvilEnt

anyone has a clue how this “Wrap around” works?

“x - (x + 25165824 - 25165824)” is something like “x % 2” for float (IEEE 754 single precision) x in range about -8388608..8388607. It is elaborate bit manipulation of internal float representation.

E2e042ba856ff700adfd6b4c080209e8
0
EvilEnt 101 Sep 01, 2011 at 12:00

@’

‘]Which flags do you use? Also try std::abs instead of abs/fabs.

-O3, -s and -Os

fabs: 2.5 sec
#define: 2.0 sec
std::abs: 1.7 sec

so it’s totally awesome (at least in my far-from-reality benchmark)
@’

’]
“x - (x + 25165824 - 25165824)” is something like “x % 2” for float (IEEE 754 single precision) x in range about -8388608..8388607. It is elaborate bit manipulation of internal float representation.

now I know what it does but still not how it works
evil black magic o_O

6eaf0e08fe36b2c23ca096562dd7a8b7
0
__________Smile_ 101 Sep 01, 2011 at 18:00

@EvilEnt

evil black magic o_O

Well, it is not so hard…

A float basically consists of the sign bit, position of the most significant bit in the number and values of the next 23 bits (forget about NaN, infinity and underflow).

Example calculation of 25165824 + x - 25165824 with x = 111111.25f:

25165824.0f =
1 1000 0000 0000 0000 0000 0000 . 0000 0000
^----------------------------^ first 24 bits stored in float

x = 111111.25f
          1 1011 0010 0000 0111 . 0100 0000
          ^------------------------------^ 24 bits

25165824 + x = 
1 1000 0001 1011 0010 0000 0111 . 0100 0000
                              ^----- will be truncated ---
float(25165824 + x) = 
1 1000 0001 1011 0010 0000 0110 . 0000 0000
^----------------------------^ 24 bits

float(25165824 + x) - 25165824 =
          1 1011 0010 0000 0110 . 0000 0000

So, x rounded to even integer 111110.

B20d81438814b6ba7da7ff8eb502d039
0
Vilem_Otte 117 Nov 30, 2011 at 14:46

Okay, I haven’t read through all the stuff here (just went quickly through). Anyway, first thing - to Carmack’s black magic (meaning fast rsqrt in Quake 1, etc.) - the problem with this algorithm is, that it is going to be slower (or just slightly faster) on today CPUs compared to other ways; especially when you can call RSQRTPS or RSQRTSS instructions through SSE.

As for fast abs, you will get best performance with:

_mm_max_ps(mValue, _mm_mul_ps(mValue, _mm_set1_ps(-1.0f)));

The same goes for fast sin (although that code is a bit longer, so I won’t write it here. You can also go with lookup tables (although I don’t think you’ll get much out of them today).

Anyway as for checking collision of lots of stuff, I’d also recommend you try BVH or BIH (BIH could really shine in this case). You can very quickly dynamically rebuild/refit them (even with SAH-like partitioning heuristics, though I think that Equal counts partitioning (with sort) heuristics would be better for this case - which is even faster btw.). So see it this way - you’re doing some operation 1000000 times, if you hand-optimize it in assembly (so it will be twice the fast), you’ll get 200% speed, but when you avoid doing that operation firstly and do it just 100 times, you end up with 1000000% speed. (Of course assuming you can instantly say when you can avoid that operation - which is impossible - still you’ll get much more speed avoiding the operation, than actually brute-forcing it).

Huh, hope you got the point (I’m not that good at explaining something).