SSE Code question
#1
Posted 20 October 2009 - 10:19 PM
Is there someone out there who can take a look at my code and hopefully tell what's wrong with it? Here is link:
http://codepad.org/plz5aEW6
I'm using MS Visual Studio 2005 with all optimization set to the highest.
Thanks in advance,
Christian
#3
Posted 21 October 2009 - 09:58 AM
FPU result: 4571 ms
SSE result: 2543 ms
Although there is indeed some overhead from the loop you can clearly see significant benefit from using SSE.
Anyway, what CPU do you have? Pentium 4 and older Athlons don't have 128-bit SSE execution units, so it's a lot harder to get any speedup out of them.
#4
Posted 21 October 2009 - 10:06 AM
FPU results: 4587 ms
SSE results: 2277 ms
#5
Posted 21 October 2009 - 06:53 PM
I have an Intel Core 2 Quad Q9400. But I'm using an old MS compiler from 2005 which probably isn't good in optimizing code anymore.
Thanks for your remarks,
Christian
#6
Posted 21 October 2009 - 07:28 PM
#7
Posted 22 October 2009 - 04:10 PM
Here are the results.
NV - Number of Vertices; NL - Number of Loops;
NV = 1,000; NL = 100,000,000
CPU - 394sec
SSE - 65sec
Ratio: 6.1
NV = 10,000; NL = 10,000,000
CPU - 461sec
SSE - 146sec
Ratio: 3.2
NV = 100,000; NL = 1,000,000
CPU - 461sec
SSE - 146sec
Ratio: 1.04
Basically, I think I'm just testing the cache misses. But how does the math go? CPU-Z tells me that my L1 D-Cache is 32Kbyte where each line is 64byte. I can squeeze in 32 * 1024byte / 4 byte per float = 8192 floats. That would mean I can have 8192 floats / 3 floats per vertex = 2730 vertices in my L1 Cache.
Now, I have 2 float arrays v_1 and v_2. Does it makes sense to compute the dot product in batches of 1300 floats from each array? Is that a recipe for maximum performance?
Thanks,
Christian
#8
Posted 02 November 2009 - 11:58 AM
chhenning said:
In my experience it's not really worth trying to achieve absolute maximum performance. Change one small thing and it no longer performs optimally. So just make sure you have tightly packed SSE-friendly data structures and you access them linearly whenever possible.
#9
Posted 02 November 2009 - 04:43 PM
Let the compiler decide whether to inline or not, inlined function can pollute the execution cache, even when not executed. Inlining or not is not relevant in your code, but might be very relevant in real applications.
Further: Before diving into making everything SSE, find your bottlenecks first, and see what the compiler already does for you. Often, you'll find that the compiler generates code that is already unrolled, autovectorized and whatnot, and better than what you would have code:
#include <cstdlib>
#include <iostream>
struct __attribute__((aligned(16)) Vector {
float a,b,c,d;
};
int main () {
const size_t size = 1024;
float __attribute__((aligned(16)) dots[size];
Vector __attribute__((aligned(16)) b[size];
Vector __attribute__((aligned(16)) c[size];
// spread randomness to prevent folding actions
int some_seed;
std::cin >> some_seed;
srand (some_seed);
for (size_t i=0; i<size; ++i) {
dots [i] = rand () / (RAND_MAX+1.0f);
b [i].a = rand () / (RAND_MAX+1.0f);
b [i].b = rand () / (RAND_MAX+1.0f);
b [i].c = rand () / (RAND_MAX+1.0f);
b [i].d = rand () / (RAND_MAX+1.0f);
c [i].a = rand () / (RAND_MAX+1.0f);
c [i].b = rand () / (RAND_MAX+1.0f);
c [i].c = rand () / (RAND_MAX+1.0f);
c [i].d = rand () / (RAND_MAX+1.0f);
}
// do some number crunching
for (size_t i=0; i<size; ++i) {
dots [i] = b[i].a*c[i].a
+ b[i].b*c[i].b
+ b[i].c*c[i].c
+ b[i].d*c[i].d;
}
// output it all to prevent folding
for (size_t i=0; i<size; ++i) {
std::cout << dots [i];
}
}
Quote
L7: movaps (%esi,%eax,4), %xmm3 movaps 16(%esi,%eax,4), %xmm2 movaps 32(%esi,%eax,4), %xmm1 movaps 48(%esi,%eax,4), %xmm0 movaps %xmm3, %xmm6 shufps $136, %xmm2, %xmm6 shufps $221, %xmm2, %xmm3 movaps %xmm1, %xmm5 shufps $136, %xmm0, %xmm5 shufps $221, %xmm0, %xmm1 movaps %xmm1, 16(%esp) movaps (%edi,%eax,4), %xmm2 movaps 16(%edi,%eax,4), %xmm1 movaps 32(%edi,%eax,4), %xmm0 movaps 48(%edi,%eax,4), %xmm7 movaps %xmm2, %xmm4 shufps $136, %xmm1, %xmm4 shufps $221, %xmm1, %xmm2 movaps %xmm0, %xmm1 shufps $136, %xmm7, %xmm1 movaps %xmm1, 32(%esp) shufps $221, %xmm7, %xmm0 movaps %xmm0, 48(%esp) movaps %xmm6, %xmm0 shufps $136, %xmm5, %xmm0 movaps %xmm4, %xmm1 shufps $136, 32(%esp), %xmm1 mulps %xmm1, %xmm0 movaps %xmm3, %xmm1 shufps $136, 16(%esp), %xmm1 movaps %xmm2, %xmm7 shufps $136, 48(%esp), %xmm7 mulps %xmm7, %xmm1 addps %xmm1, %xmm0 shufps $221, %xmm5, %xmm6 shufps $221, 32(%esp), %xmm4 mulps %xmm4, %xmm6 addps %xmm6, %xmm0 shufps $221, 16(%esp), %xmm3 shufps $221, 48(%esp), %xmm2 mulps %xmm2, %xmm3 addps %xmm3, %xmm0 movaps %xmm0, 32832(%esp,%eax) addl $16, %eax cmpl $4096, %eax jne L7
This vectorized and loop-unrolled code is not necessarily better than your version (but I haven't benchmarked it), but often people underestimate what modern compilers are able to do, possibly enough to leverage bottlenecks to remote and unrelated points in your application, making optimizing wannabe-bottlenecks an unneeded loss of time.
edit: Little update after tweaking above code from array-of-struct to struct-of-arrays:
Quote
L8: movaps (%ebx,%eax), %xmm0 mulps (%esi,%eax), %xmm0 movaps 4000(%ebx,%eax), %xmm1 mulps 4000(%esi,%eax), %xmm1 addps %xmm1, %xmm0 movaps 8000(%ebx,%eax), %xmm1 mulps 8000(%esi,%eax), %xmm1 addps %xmm1, %xmm0 movaps 12000(%ebx,%eax), %xmm1 mulps 12000(%esi,%eax), %xmm1 addps %xmm1, %xmm0 movaps %xmm0, (%edi,%eax) addl $16, %eax cmpl $4000, %eax jne L8 incl %edx cmpl $250000, %edx jne L7 xorl %ebx, %ebx
I guess performance is now pretty much equivalent to your explicit SSE code (care to post some asm-dumps?).
~/www/ | picogen.org | metatrace
- --- / .-.. . - / -- -.-- / ... .. --. -. .- - ..- .-. . / .- .--. .--. . .- .-. / --. . . -.- -.--
#10
Posted 02 November 2009 - 11:22 PM
In any event, thanks for your reply.
Thanks,
Christian
#11
Posted 03 November 2009 - 01:02 AM
-
Currently working on: the 3D engine for Tomb Raider.
#13
Posted 03 November 2009 - 10:14 AM
#include <cstdlib>
#include <iostream>
#include <ctime>
namespace {
template <size_t i> struct afloat_array {
typedef float type[i] __attribute__((aligned (16)));
};
template <size_t len>
struct Vector2 {
typename afloat_array<len>::type a, b, c, d;
};
const size_t size = 1000000;
const size_t more = 1000;
afloat_array<size>::type dots;
Vector2<size> b, c;
}
int main () {
// spread randomness to prevent folding actions
int some_seed;
std::cin >> some_seed;
srand (some_seed);
for (size_t i=0; i<size; ++i) {
dots [i] = rand () / (RAND_MAX+1.0f);
b.a [i]= rand () / (RAND_MAX+1.0f);
b.b [i]= rand () / (RAND_MAX+1.0f);
b.c [i]= rand () / (RAND_MAX+1.0f);
b.d [i]= rand () / (RAND_MAX+1.0f);
c.a [i]= rand () / (RAND_MAX+1.0f);
c.b [i]= rand () / (RAND_MAX+1.0f);
c.c [i]= rand () / (RAND_MAX+1.0f);
c.d [i]= rand () / (RAND_MAX+1.0f);
}
const clock_t begin = clock();
// do some number crunching
for (size_t m=0; m<more; ++m) {
for (size_t i=0; i<size; ++i) {
dots [i] = b.a[i]*c.a[i]
+ b.b[i]*c.b[i]
+ b.c[i]*c.c[i]
+ b.d[i]*c.d[i];
}}
const clock_t end = clock();
// output it all to prevent folding
for (size_t i=0; i<size; ++i) {
std::cout << dots [i];
}
std::cout << "\ntime for " << (long)more*size << " dp4's: "
<< (end-begin)/(float)(CLOCKS_PER_SEC) << '\n';
}
For size=1000, more=1000000:
time for 1000000000 dp4's: 6.297
Running your explicit sse code, ported to g++, yields:
CPU Test: 7000 ms Sum of results: 750267.569484 SSE Test: 6578 ms Sum of results: 750267.569504
To ensure that the outer loop is not a nop, I looked at the assemblies.
Yours:
L47: xorl %eax, %eax movl %edx, 84(%esp) .p2align 4,,10 L48: movaps (%esi,%eax,4), %xmm1 movl 100(%esp), %edx movaps (%edi,%eax,4), %xmm2 mulps (%edx,%eax,4), %xmm1 mulps (%ecx,%eax,4), %xmm2 movl 104(%esp), %edx movaps (%edx,%eax,4), %xmm0 movl 84(%esp), %edx mulps (%edx,%eax,4), %xmm0 addps %xmm2, %xmm0 addps %xmm1, %xmm0 movaps %xmm0, (%ebx,%eax,4) addl $4, %eax cmpl $1000000, %eax jne L48 incl 96(%esp) movl 84(%esp), %edx cmpl $1000, 96(%esp) jne L47
Mine:
L7: xorl %eax, %eax .p2align 4,,10 L8: movaps __ZN12_GLOBAL__N_11cE+4000000(%eax), %xmm0 movaps __ZN12_GLOBAL__N_11cE(%eax), %xmm1 mulps __ZN12_GLOBAL__N_11bE+4000000(%eax), %xmm0 mulps __ZN12_GLOBAL__N_11bE(%eax), %xmm1 addps %xmm1, %xmm0 movaps __ZN12_GLOBAL__N_11cE+8000000(%eax), %xmm1 mulps __ZN12_GLOBAL__N_11bE+8000000(%eax), %xmm1 addps %xmm1, %xmm0 movaps %xmm0, __ZN12_GLOBAL__N_14dotsE(%eax) addl $16, %eax cmpl $4000000, %eax jne L8 incl %edx cmpl $1000, %edx jne L7
g++ 4.4 wins this match, and all we had to do was to align the data, all else is john-doe-readable & standards-conforming c++. And when I look at the latest result, I think it is hard to code or generate relevantly better codepaths :yes:
~/www/ | picogen.org | metatrace
- --- / .-.. . - / -- -.-- / ... .. --. -. .- - ..- .-. . / .- .--. .--. . .- .-. / --. . . -.- -.--
#14
Posted 03 November 2009 - 06:19 PM
Christian
#15
Posted 05 November 2009 - 11:12 AM
chhenning said:
Christian
I don't see where there shall be cache misses on a modern cpu. Then there is a bug:
rhs_a = reinterpret_cast< const sse_t* >( &c.a[i] ); rhs_b = reinterpret_cast< const sse_t* >( &c.b[i] ); rhs_c = reinterpret_cast< const sse_t* >( &c.c[i] ); rhs_d = reinterpret_cast< const sse_t* >( &c.c[i] );// << you save a load and rhs_b and rhs_d alias each other
Further: Have you looked at the assemby? What does MSVC produce for those simple dot-products?
Then: On g++, your code crashes. I try to fix it.
~/www/ | picogen.org | metatrace
- --- / .-.. . - / -- -.-- / ... .. --. -. .- - ..- .-. . / .- .--. .--. . .- .-. / --. . . -.- -.--
#16
Posted 05 November 2009 - 02:07 PM
#include <cstdlib>
#include <iostream>
#include <ctime>
#include <xmmintrin.h>
#include <emmintrin.h>
namespace {
template <size_t i> struct afloat_array {
typedef float type[i] __attribute__((aligned (16)));
};
template <size_t len>
struct Vector2 {
typename afloat_array<len>::type a, b, c, d;
};
template <size_t len>
struct SSEVector {
__m128 a[len], b[len], c[len], d[len];
};
const size_t size = 1000000;
const size_t more = 1000;
}
// normally I'd put those in the anonymous namespace,
// but execution breaks then (seems to be an issue in g++-4.4)
afloat_array<size>::type dots;
Vector2<size> b, c;
__m128 sse_dots[size/4];
SSEVector<size/4> sse_lhs, sse_rhs;
// report bug when above data put in anonymous namespace
int main () {
// spread randomness to prevent folding actions
int some_seed;
std::cin >> some_seed;
srand (some_seed);
for (size_t i=0; i<size; ++i) {
dots [i] = rand () / (RAND_MAX+1.0f);
b.a [i]= rand () / (RAND_MAX+1.0f);
b.b [i]= rand () / (RAND_MAX+1.0f);
b.c [i]= rand () / (RAND_MAX+1.0f);
b.d [i]= rand () / (RAND_MAX+1.0f);
c.a [i]= rand () / (RAND_MAX+1.0f);
c.b [i]= rand () / (RAND_MAX+1.0f);
c.c [i]= rand () / (RAND_MAX+1.0f);
c.d [i]= rand () / (RAND_MAX+1.0f);
// dirty init
if (i%4 == 0) {
sse_dots[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
sse_lhs.a[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
sse_lhs.b[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
sse_lhs.c[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
sse_lhs.d[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
sse_rhs.a[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
sse_rhs.b[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
sse_rhs.c[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
sse_rhs.d[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
}
}
std::cout << "sse test" << std::endl;
const clock_t sse_begin = clock();
// do some SSE number crunching
for( size_t m=0; m<more; ++m ) {
for( size_t i=0; i<size/4; ++i) {
sse_dots[i] = _mm_add_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps( sse_lhs.a[i], sse_rhs.a[i] )
, _mm_mul_ps( sse_lhs.b[i], sse_rhs.b[i] )
)
, _mm_mul_ps( sse_lhs.c[i], sse_rhs.c[i] )
)
, _mm_mul_ps( sse_lhs.d[i], sse_rhs.d[i] )
);
} }
const clock_t sse_end = clock();
// i tried the following to see how your explicit sse performs
// after a bit of scheduling. but gcc seems already good enough
// at scheduling.
/*std::cout << "sse test 2" << std::endl;
const clock_t sse_begin2 = clock();
// do some SSE number crunching
for( size_t m=0; m<more; ++m ) {
for( size_t i=0; i<size/4; ++i) {
const __m128
a = _mm_mul_ps( sse_lhs.a[i], sse_rhs.a[i]),
b = _mm_mul_ps( sse_lhs.b[i], sse_rhs.b[i]),
c = _mm_mul_ps( sse_lhs.c[i], sse_rhs.c[i]),
d = _mm_mul_ps( sse_lhs.d[i], sse_rhs.d[i]),
a_ = _mm_add_ps (a,b),
b_ = _mm_add_ps (c,d)
;
sse_dots[i] = _mm_add_ps (a_,b_);
} }
const clock_t sse_end2 = clock();
*/
std::cout << "scalar test" << std::endl;
// do some number crunching
const clock_t begin = clock();
for (size_t m=0; m<more; ++m) {
for (size_t i=0; i<size; ++i) {
dots [i] = b.a[i]*c.a[i]
+ b.b[i]*c.b[i]
+ b.c[i]*c.c[i]
+ b.d[i]*c.d[i];
}}
const clock_t end = clock();
// prevent folding
float accu = 0.0f;
for (size_t i=0; i<size; ++i)
accu += dots[i];
for (size_t i=0; i<size/4; ++i) {
union {__m128 m; float f[4];} const x = {sse_dots[i]}; //evil
accu += x.f[0]+x.f[1]+x.f[2]+x.f[3];
}
std::cout << accu << std::endl;
std::cout << "\ntime for " << (long)more*size << " dp4's: "
<< (end-begin)/(float)(CLOCKS_PER_SEC) << '\n';
std::cout << "\ntime for " << (long)more*size << " sse-dp4's: "
<< (sse_end-sse_begin)/(float)(CLOCKS_PER_SEC) << '\n';
// sse: 7.892 7.953
// scalar: 7.922 7.937
}
In one test, your code (+ tweaks by me) is faster (sse: 7.892, scalar: 7.922), in another, mine is faster (scalar: 7.922, sse: 7.953). The reason becomes clear when we look at the assemblies:
Here is what g++-4.4 gives me for your (tweaked by me) code:
L11: xorl %eax, %eax .p2align 4,,10 L12: movaps _sse_lhs+12000000(%eax), %xmm1 movaps _sse_lhs+8000000(%eax), %xmm2 mulps _sse_rhs+12000000(%eax), %xmm1 mulps _sse_rhs+8000000(%eax), %xmm2 movaps _sse_lhs+4000000(%eax), %xmm3 movaps _sse_lhs(%eax), %xmm0 mulps _sse_rhs+4000000(%eax), %xmm3 mulps _sse_rhs(%eax), %xmm0 addps %xmm3, %xmm0 addps %xmm2, %xmm0 addps %xmm1, %xmm0 movaps %xmm0, _sse_dots(%eax) addl $16, %eax cmpl $4000000, %eax jne L12 incl %edx cmpl $1000, %edx jne L11
For my john-doe-c++ I reap:
L14: xorl %eax, %eax .p2align 4,,10 L15: movaps _c+4000000(%eax), %xmm0 movaps _c(%eax), %xmm1 mulps _b+4000000(%eax), %xmm0 mulps _b(%eax), %xmm1 addps %xmm1, %xmm0 movaps _c+8000000(%eax), %xmm1 mulps _b+8000000(%eax), %xmm1 addps %xmm1, %xmm0 movaps _c+12000000(%eax), %xmm1 mulps _b+12000000(%eax), %xmm1 addps %xmm1, %xmm0 movaps %xmm0, _dots(%eax) addl $16, %eax cmpl $4000000, %eax jne L15 incl %edx cmpl $1000, %edx jne L14
Both are nearly identical, the only real difference is that your code runs in xmm0-xmm3, whereas mine runs in only xmm0 and xmm1.
I don't know what MSVC produces, but it is clear that optimising such microbenchmark code with explicit sse is a pure waste of time with g++, because the latter does it all for you.
~/www/ | picogen.org | metatrace
- --- / .-.. . - / -- -.-- / ... .. --. -. .- - ..- .-. . / .- .--. .--. . .- .-. / --. . . -.- -.--
#17
Posted 05 November 2009 - 02:41 PM
#18
Posted 06 November 2009 - 12:04 AM
Nick said:
-
Currently working on: the 3D engine for Tomb Raider.
#19
Posted 06 November 2009 - 08:27 AM
.oisyn said:
So it's not that simple to get a cache miss out of a modern CPU. :happy: I suspect you might be running into bandwidth limitations, or your random test isn't fully random, or a significant part of the array still fits in L2 cache.
#20
Posted 06 November 2009 - 12:08 PM
Here's my code:
#include <iostream>
#include <utility>
#include <set>
#include <intrin.h>
#include <xmmintrin.h>
#include <emmintrin.h>
static const int cSize = 4*1024*1024;
__m128i arr[cSize];
__m128i calc(__m128i v, __m128i a)
{ // merely wasting some cycles here...
for (int j = 0; j < 256; j++)
v = _mm_add_epi32(v, a);
return v;
}
long long __declspec(naked) gettime()
{
__asm rdtsc;
__asm ret;
}
int main()
{
long long t;
long long t1 = 0, t2 = 0, t3 = 0;
__m128i add = _mm_set1_epi32(10);
// make sure the OS commits the pages to mem
for (int i = cSize-1; i >= 0; i--)
arr[i] = _mm_set1_epi32(0);
for (int run = 0; run < 10; run++)
{
// forward
t = -gettime();
for (int i = 0; i < cSize; i++)
arr[i] = calc(arr[i], add);
t += gettime();
t1 += t;
std::cout << "forward time: " << t << std::endl;
// reverse
t = -gettime();
for (int i = cSize-1; i >= 0; i--)
arr[i] = calc(arr[i], add);
t += gettime();
t2 += t;
std::cout << "reverse time: " << t << std::endl;
// random
t = -gettime();
for (int i = 0; i < cSize; i += 16)
{
arr[i+ 9] = calc(arr[i+ 9], add);
arr[i+ 5] = calc(arr[i+ 5], add);
arr[i+14] = calc(arr[i+14], add);
arr[i+ 2] = calc(arr[i+ 2], add);
arr[i+13] = calc(arr[i+13], add);
arr[i+ 0] = calc(arr[i+ 0], add);
arr[i+15] = calc(arr[i+15], add);
arr[i+ 6] = calc(arr[i+ 6], add);
arr[i+ 3] = calc(arr[i+ 3], add);
arr[i+11] = calc(arr[i+11], add);
arr[i+ 7] = calc(arr[i+ 7], add);
arr[i+ 1] = calc(arr[i+ 1], add);
arr[i+ 8] = calc(arr[i+ 8], add);
arr[i+12] = calc(arr[i+12], add);
arr[i+ 4] = calc(arr[i+ 4], add);
arr[i+10] = calc(arr[i+10], add);
}
t += gettime();
t3 += t;
std::cout << "random time : " << t << "\n" << std::endl;
}
std::cout << "total forward: " << t1 << std::endl;
std::cout << "total reverse: " << t2 << std::endl;
std::cout << "total random : " << t3 << std::endl;
}
-
Currently working on: the 3D engine for Tomb Raider.
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users











