Hi,
I was profiling some code, and noticed that with some project settings Visual C++ 2005 compiler added npad asm instruction to some part of my code (unintuitively when I disabled "whole program optimizations"). This instruction made the specific piece of the code to run around 20% faster and I was wondering is there way in Visual C++ to make some part of the code to be aligned to a given boundary (some #pragma maybe)?
I would need this for my code performance unit test framework to have more consistent results, because it's quite misleading if I for example change some algorithm to be more optimal it may still result in worse performance in the test framework due to code alignment changes.
Code performance & aligning (npad)
Started by J22, Dec 06 2007 12:51 AM
7 replies to this topic
#1
Posted 06 December 2007 - 12:51 AM
#2
Posted 06 December 2007 - 09:08 AM
Okay, found it. For anyone interested: __asm align 16; does the trick.
#3
Posted 06 December 2007 - 04:04 PM
If performance is that critical and you need binary level tricks to get it then I'd highly advise to write the entire algorithm in assembly.
There is no guarantee that the C++ compiler won't insert extra instructions after the align. When you write it all in assembly you get full control.
What specific algorithm are you working on?
There is no guarantee that the C++ compiler won't insert extra instructions after the align. When you write it all in assembly you get full control.
What specific algorithm are you working on?
#4
Posted 06 December 2007 - 06:13 PM
Not any specific algorithm, but a code performance testing framework for my hobby project. I have written performance test loops for various functions (e.g. matrix*matrix, vector*matrix, etc. operations) and the intention is to use this to see how changes (e.g. compiler settings, algorithm changes, etc.) affect to the performance of the functions.
However, there are some robustness issues with the profiling, and I'm confused of their origins now. For example I tried to compile the project with link time code generation on and off, and I got 65% performance improvement for a function when turning LTCG on, but when comparing the assembly, these functions are almost equal (just some slight differences). There is no function calls within the loop and the data used within the loop is cache-line aligned and fits well into L1 cache. There is also a warp-up iteration over the loop so everything should be in cache as well.
However, there are some robustness issues with the profiling, and I'm confused of their origins now. For example I tried to compile the project with link time code generation on and off, and I got 65% performance improvement for a function when turning LTCG on, but when comparing the assembly, these functions are almost equal (just some slight differences). There is no function calls within the loop and the data used within the loop is cache-line aligned and fits well into L1 cache. There is also a warp-up iteration over the loop so everything should be in cache as well.
#5
Posted 06 December 2007 - 06:25 PM
What timing method do you use? QueryPerformanceCounter has a high resolution, but it takes a round-trip to kernel mode so the call itself is quite slow. Also make sure that you set your process and thread priority a little higher.
#6
Posted 06 December 2007 - 07:32 PM
Duh, found the problem. I was calling a placeholder function without implemention which returned some invalid float values causing the slowdown.
Yes, I'm using QueryPerformanceCounter but it's called only twice per test (before & after each loop) and the loop takes around quarter of a sec, so the overhead of the call is relatively small. I might switch to RDTSC later though as I don't think speed-step would cause issues with this CPU intensive task. Good idea about boosting the thread priority, though I have had pretty constant results between tests.
There is still some other confusing weirdness going on though. I have a function which normalizes a 4d vector and when I have LTCG on, I don't have the fsqrt intrinsic inlined but it calls a _sqrt function and does some other extra memory operations as opposed to build which has LTCG off. However, "LTCG on" build still performs around 35% better, which doesn't make any sense.
Overall, I don't quite understand why when LTCG is enable the compiler refuses to inline these intrinsics.
Yes, I'm using QueryPerformanceCounter but it's called only twice per test (before & after each loop) and the loop takes around quarter of a sec, so the overhead of the call is relatively small. I might switch to RDTSC later though as I don't think speed-step would cause issues with this CPU intensive task. Good idea about boosting the thread priority, though I have had pretty constant results between tests.
There is still some other confusing weirdness going on though. I have a function which normalizes a 4d vector and when I have LTCG on, I don't have the fsqrt intrinsic inlined but it calls a _sqrt function and does some other extra memory operations as opposed to build which has LTCG off. However, "LTCG on" build still performs around 35% better, which doesn't make any sense.
Overall, I don't quite understand why when LTCG is enable the compiler refuses to inline these intrinsics.
#7
Posted 06 December 2007 - 07:52 PM
J22,
I'm pretty sure you've been hit by a branch prediction collision. All the effects you've described calls for that. Benchmarks are very vulnerable to this because they benchmark a very small piece of code a lot of times. If things collide here they will collide big time!
If the code you're benchmarking contains a branch itself chances are that the loop branch (which is very predictable) and your branch (which might as well be very predictable) end up in the same branch target and branch history buffer.
No big deal except that it the loop branch is usually not taken (for-style loop) and yours nearly always taken the two branches will work against each other. If that happends your code will confuse the shit out of the branch prediction logic.
It does not always need a very short loop to see this effect. You can construct a worst case scenario if you put two branches into the same aligned 32 bytes and make sure one jumps and the other does not, but it can as well happen in code that spawns several kilobytes. There simply is not that much internal storage for branch prediction.
It's like optimizing for cache coherence.. Just the opposite way, and with a *much* smaller cache (the branch prediction history).
You can do nothing about it since it's in the hand of the compiler/linker how to transform your loops do decissions and all that stuff. The only thing you can do is to use the performance counter to get the predicted/mispredicted ratio and print it out along with your benchmarks. This way you at least get a sign that you have to interpret the performance values with a grain of salt or two.
If changing alignment (and thus the branch prediction slot) makes such a difference you can't reliable benchmark anymore because your outer loop may mess up the timing in an untypical way. Sad but true.
But there are good news as well: If your code is that tight and fast that branch prediction collisions contribute 20% to your performance numbers you wrote some damn tight and fast code, and you should stop all micro optimizations now.
Pick out the next thing and optimize it!
Regarding the branch prediction thing Agner Fog wrote some very good stuff how todays CPU's like or dislike it... http://www.agner.org/
Nils
I'm pretty sure you've been hit by a branch prediction collision. All the effects you've described calls for that. Benchmarks are very vulnerable to this because they benchmark a very small piece of code a lot of times. If things collide here they will collide big time!
If the code you're benchmarking contains a branch itself chances are that the loop branch (which is very predictable) and your branch (which might as well be very predictable) end up in the same branch target and branch history buffer.
No big deal except that it the loop branch is usually not taken (for-style loop) and yours nearly always taken the two branches will work against each other. If that happends your code will confuse the shit out of the branch prediction logic.
It does not always need a very short loop to see this effect. You can construct a worst case scenario if you put two branches into the same aligned 32 bytes and make sure one jumps and the other does not, but it can as well happen in code that spawns several kilobytes. There simply is not that much internal storage for branch prediction.
It's like optimizing for cache coherence.. Just the opposite way, and with a *much* smaller cache (the branch prediction history).
You can do nothing about it since it's in the hand of the compiler/linker how to transform your loops do decissions and all that stuff. The only thing you can do is to use the performance counter to get the predicted/mispredicted ratio and print it out along with your benchmarks. This way you at least get a sign that you have to interpret the performance values with a grain of salt or two.
If changing alignment (and thus the branch prediction slot) makes such a difference you can't reliable benchmark anymore because your outer loop may mess up the timing in an untypical way. Sad but true.
But there are good news as well: If your code is that tight and fast that branch prediction collisions contribute 20% to your performance numbers you wrote some damn tight and fast code, and you should stop all micro optimizations now.
Pick out the next thing and optimize it!
Regarding the branch prediction thing Agner Fog wrote some very good stuff how todays CPU's like or dislike it... http://www.agner.org/
Nils
My music: http://myspace.com/planetarchh <-- my music
My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
#8
Posted 06 December 2007 - 08:47 PM
This particular normalization test loop is ~550 bytes of intense FPU code and there's only single branch (forward branch of the for loop and unconditional jmp in the end of the loop), so it's not really a tiny loop that's very prone to performance differences due to branching. Anyway, I also tried to change the loop to backward branch (do-while) that even static branch prediction should predict correctly, but there was no difference in performance.
I'm not really optimizing things now, but rather building a reliable test framework which allows me to optimize code easily. I.e. if I decide to change an algorithm or compiler settings or such, I can run through the tests to see what's the actual impact on the performance without having to rely on my gut feeling. The problem is that the framework isn't really reliable yet due to the unexplained stuff that's going on.
Anyway, I'll gotta check the Agners link you were referring to. Thanks!
I'm not really optimizing things now, but rather building a reliable test framework which allows me to optimize code easily. I.e. if I decide to change an algorithm or compiler settings or such, I can run through the tests to see what's the actual impact on the performance without having to rely on my gut feeling. The problem is that the framework isn't really reliable yet due to the unexplained stuff that's going on.
Anyway, I'll gotta check the Agners link you were referring to. Thanks!
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users












