Optimization
#1
Posted 08 August 2011 - 02:54 AM
I spent a good half a day exploring and benchmarking various ways to make the functions for those interactions faster (they are called many millions of times per second). For example, I tried the fast inverse square root function that was used in Quake, and the fast sine/cosine functions the Nick shared with us here. I also tried several fast 3d distance equations of various crudeness.
Each fast math function was tested in isolation or near isolation and called 10 million times and compared. The results surprised me. A few of the functions were slightly faster that their "slow" counterparts, and quite a few of the "fast" functions were slower. Supposing I haven't made some stupid systematic mistake in my benchmarks, it seems that the GCC compiler produces really fast code on its own. I'm wondering if you have had similar experiences. Has anyone else actually benchmarked Nick's fast sine and cosine code?
#2
Posted 08 August 2011 - 04:17 AM
reedbeta delete my post.
#3
Posted 08 August 2011 - 05:20 AM
#4
Posted 08 August 2011 - 09:01 AM
tobeythorn said:
Quote
Have you considered SIMD code (SSE/AVX)? Future x86 CPUs will feature AVX2 with gather and FMA support.
#5
Posted 08 August 2011 - 09:03 AM
tobeythorn said:
#6
Posted 08 August 2011 - 12:12 PM
It sounds to me that you are struggling because of bad big-O and focusing on faster sqrt/sin/cos implementations potentially result only in quite modest improvements. You should try to figure out how you could improve the algorithm in the high-level instead. For example do you have to consider all objects for each object, or is it enough to use only the closest objects? You mentioned space partitioning for dynamic objects, so could you put the objects into a grid for example for faster queries & updates? Furthermore, since we are talking about a simulation, could you exploit frame-to-frame coherency somehow?
Cheers, Jarkko
#7
Posted 08 August 2011 - 03:41 PM
@Nick,
How much faster did you find your sine and cose to be? I'll give your functions another look.
I've never tried writing assembly before, but would consider it. My machine is an old amd64 with sse2.
I really like your dimensional sorting idea. That would avoid a lot of conditional statements (which from my experience are quite slow) that a space partitioning trees would require, and could maintain some coherency from frame to frame and not need too much resorting. Will definitely give that a shot.
@Jarkkol,
You're absolutely right about the importance of making my algorithm more efficient. There are still things I need to explore in that regard. This is the first time I've delved into a low-level approach, and I was surprised at just how ineffective it was. Still, at the end of the day, I'm also going to be calling a handful of math functions a very large number of times, so at some point, faster math functions would be nice.
#8
Posted 09 August 2011 - 06:07 PM
You should code iteratively, refactoring sections for performance and only after benchmarking your whole program shows it to be necessary.
The raison d'etre for that motto is that many coders are perfectionists when it comes to their code. Some of us will spend days optimizing the number of angels on the head of a pin if given half the chance...
So, you tested each function with 10 million calls. The question I'd ask is will your app call that function 10 million times in such a way that the difference in timings would affect the end goal of the app?
#9
Posted 09 August 2011 - 10:30 PM
#10
Posted 09 August 2011 - 11:14 PM
#11
Posted 10 August 2011 - 03:21 AM
After carefully testing your fast sine function again (the c version, not the assembly version), it does seem to be about twice as fast as the standard function. =)
So starting implementing your dimensional sorting suggestion. Using insertion-sort to sort the objects in each dimension is super fast since there's good coherency between frames. Then the idea was to take each object and use the sorted lists to figure out which other objects are nearby in each dimension. Across the 3 dimensions, there would be some duplicate nearby objects, so I'd cull those and then finally do interactions for what's left over.
However, dimensional sorting doesn't seem to work if my people/objects are different sizes, which is desirable for my project. Still thinking about workaround...
Further, for each dimension (say x), each object potentially collides with all the objects inside of a slice in the other dimensions (all the objects in the collision range in x, but with any y or z). This is really expensive if the object density is high. This isn't an issue for a 1D system, and less for a 2D system, but seems like it would be crippling in 3D. Is there something I'm missing?
When I get a chance I may finish up the sort thing and get some real numbers anyway.
@alphadog
Well, I've already established that yes, certain functions are being called upwards of 10million times a second. Even if I can reduce the number of interaction checks I can do with spatial partitioning, those functions will be called a lot. For example, I just benchmarked my collision algorithm, and commenting out its contents doubles the performance of my application. Performance is ok now, but once I add artificial intelligence for each object, its going to get slow.
@Smile,
The idea of writing a sort-algorithm in assembly never even crossed my mind. I don't think I will try it. Do you have a suggestion for a spatial subdivision algorithm that works well for many dynamic objects? My intuition says the cost would outweigh the benefit because of all the updating, but I'm not so experienced with these things.
----other stuff----
Adding an initial box-test to my collision algorithm boosted performance by about 15%, which is not groundbreaking, but nice for 3 extra lines of code.
I also benched the circular, doubly-linked list that I was storing my objects/people in versus ordinary arrays. Iterating through the objects for interactions and rendering is only between 0 and 20% faster (total application performance) with arrays. I'll stick with the convenience of the linked-list for now.
#12
Posted 10 August 2011 - 07:49 AM
JarkkoL said:
It's really a cost/gain/risk analysis. Sometimes low-level optimizations are quite straightforward and after implementing them you get a much clearer picture of the algorithmic complexity. At other times it's obvious that first rewriting things in assembly and then trying to optimize at a higher level is asking for trouble.
The simple truth is that there are no golden hammers in software optimization. You always have to evaluate every approach, and there's no substitute for prototyping and profiling.
#13
Posted 10 August 2011 - 08:35 AM
rouncer said:
And with AVX2 future CPUs will become a lot more powerful at throughput computing so you don't have to make any compromises.
#14
Posted 10 August 2011 - 12:41 PM
Well. I guess you just compare all pairs:
1000 persons ^2 / 2 = 500000
With spatial subdivision of 10 * 10 cells, and uniform disribution (the ideal case, I know, but still), you get:
(1000 persons / 10^2 cells)^2 * 10^2 cells / 2 = 5000
An improvement by 2 orders of magnitude. Should cover the added complexity.
edit: Messed up the last formula.
#15
Posted 10 August 2011 - 03:03 PM
you make a convincing argument
#16
Posted 18 August 2011 - 12:00 AM
I tried out and tested the uniform spatial partitioning method you suggested. With 1000 uniformly sized randomly moving objects, I tuned the partition size for best performance and required area. The best grid had 20x20 partitions, where each partition was 4 times as wide as an object.
The partitioning method performs about twice as fast as the N*(N-1)/2 method (total application performance). Certainly, this is a decent gain, but not nearly as high as you predicted, and it comes at some cost in code clarity.
Determining interactions requires higher complexity than your calculation predicts:
numberOfCells * averageNumberOfCollisionsPerCell * averageNumberObjectsThatEachObjectNeedsToCheckWhetherCollisionWithHasAlreadyBeenCalculated
= 400* ~(1000/400)^2/2 * ~50 = ~60,000
an order of magnitude slower than your estimate.
The various ideas here have got me thinking. There’s another partitioning strategy I thought up today that should be faster, and require no boundaries. I will report back when that’s done.
#17
Posted 24 August 2011 - 12:29 PM
You talk about people, then talk about 3d.
People cannot fly, so taking a projection through the view volume simplifies your collision handling. I.e a plan of the volume.
If you have some kind of terrain with massive vertical displacements, you can then cull the 2d collisions based on a difference in the vertical displacement squared.
In general you should never need a square root in collision detection, it is only needed once you decide a collision has occured and you move into collision resolution.
If the "people" are of more or less a uniform size then you can treat the plan of the volume as a grid and then you have loads of techniques you can use to reduce the computational tasks.
You can treat the grid as a texture and move nearly everything into the gpu.
Use image recognition tricks
quadtrees
lot's of things.
Can you give a better description of what you are trying to do?
Then maybe we can give targeted advice rather than general case advice.
#18
Posted 24 August 2011 - 05:21 PM
I'm building a collision system for a game that will involve a large number of both movable objects and moving "beings". Beings will come in various sizes. Although I have not made a final decision regarding terrain, my plan at this point is for the terrain to be a roughly spherical world (like in the game Populous). Regardless or whether I go that route, or with a more planar terrain, I have a lot of heterogeneous moving objects that need to collide with each other in either 2 or 3 dimensions, and soon, will also need to have AI interactions between objects and beings in proximity.
@everyone
Since my creatures interact from a more macroscopic perspect, I've decided to do force-based interactions. I'm not doing traditional rigid-body collision detection followed by collision response. My collision forces are based on a fast s-curve formula, but also involves a single vector unitization, which means a square-root. Bounding box and distance squared checks are used to cull non-collisions first.
I've spent the past week trying out different partitioning strategies and to find something that works well for my application. I just roughed out a new partitioning scheme this morning and results look great. You guys were right: the performance advantage (of the right partitioning method) is indeed well beyond what any math optimization could accomplish.
Basically, I take all my objects and find the center-of-position (ie, unweighted center-of-mass) in the dimension where the objects are most spread out. Then I split at the center-of-position into two partitions, and put any object that lie on the partition edge into a separate bin. Doing this recursively, the partitions quickly get small . There is no penalty for objects moving, or what order they are in, no limits to how spread out the objects are, and this strategy finds optimal split planes. It's also quite fast. 4000 moving objects (brute-force that would be an impossible 8 million collisions per frame) goes at playable frame rates with just 8 smallest-level partitions. Haven't done any tuning yet for what the best number of partition levels is.
Thank you all for your advice and ideas!
#19
Posted 24 August 2011 - 06:39 PM
It is possible you have too many cells in your partition. There is always an optimal number of cells, depending on the implementation, the distrubution of your objects and then number of objects.
Also, what is "averageNumberObjectsThatEachObjectNeedsToCheckWhetherCollisionWithHasAlreadyBeenCalculated"?
#20
Posted 25 August 2011 - 12:07 AM
What I mean is the scheme I'm using, the algorithm does a nice job of choosing how to split partitions, and I'm going to make it so that it dynamically chooses how many times to split each frame for best performance. The only situation that this doesn't work well in is when all objects occupy the same space.
With the previous grid scheme you suggested, averageNumberObjectsThatEachObjectNeedsToCheckWhe therCollisionWithHasAlreadyBeenCalculated"? is because some objects may intersect more than one grid cell, so each object must store a list of all the objects it has already done collision with, and that list needs to be checked and updated for every collision. I suppose there are other ways of dealing with this.
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users












