Jump to content


SSE Code question


20 replies to this topic

#1 chhenning

    New Member

  • Members
  • Pip
  • 9 posts

Posted 20 October 2009 - 10:19 PM

Hi there, I have been trying to learn more about writing SSE code. But for some reason I cannot get my code much faster as with normal CPU code.

Is there someone out there who can take a look at my code and hopefully tell what's wrong with it? Here is link:

http://codepad.org/plz5aEW6

I'm using MS Visual Studio 2005 with all optimization set to the highest.

Thanks in advance,
Christian

#2 JarkkoL

    Senior Member

  • Members
  • PipPipPipPip
  • 458 posts

Posted 20 October 2009 - 11:30 PM

It's because in your test code most of the overhead comes from cache misses and looping. Unroll your test loop few times and read from a single index and you'll see the expected improvement.

#3 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1225 posts

Posted 21 October 2009 - 09:58 AM

Looks fairly ok to me:

FPU result: 4571 ms
SSE result: 2543 ms

Although there is indeed some overhead from the loop you can clearly see significant benefit from using SSE.

Anyway, what CPU do you have? Pentium 4 and older Athlons don't have 128-bit SSE execution units, so it's a lot harder to get any speedup out of them.

#4 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1225 posts

Posted 21 October 2009 - 10:06 AM

64-bit version:

FPU results: 4587 ms
SSE results: 2277 ms

#5 chhenning

    New Member

  • Members
  • Pip
  • 9 posts

Posted 21 October 2009 - 06:53 PM

Thanks for all of your replies. I have been significantly improving the SSE code. Basically rearranging the loops have helped a lot.

I have an Intel Core 2 Quad Q9400. But I'm using an old MS compiler from 2005 which probably isn't good in optimizing code anymore.

Thanks for your remarks,
Christian

#6 JarkkoL

    Senior Member

  • Members
  • PipPipPipPip
  • 458 posts

Posted 21 October 2009 - 07:28 PM

For the SSE code there isn't much for compiler to optimize though. Anyway, the operation you perform on SOA is very simple (only dot product) thus cache misses contribute quite a bit on my PC at least to the figure, so there isn't close to 4x improvement what you would expect. If I change the code so that it performs the operation only at fixed index essentially making all the reads from L1 cache, then the improvement is close to 4x, so you should really try to make more work in a single pass over the data to get better SSE improvements.

#7 chhenning

    New Member

  • Members
  • Pip
  • 9 posts

Posted 22 October 2009 - 04:10 PM

Jarkkol, I have been doing so benchmarking.

Here are the results.

NV - Number of Vertices; NL - Number of Loops;

NV = 1,000; NL = 100,000,000
CPU - 394sec
SSE - 65sec
Ratio: 6.1

NV = 10,000; NL = 10,000,000
CPU - 461sec
SSE - 146sec
Ratio: 3.2

NV = 100,000; NL = 1,000,000
CPU - 461sec
SSE - 146sec
Ratio: 1.04

Basically, I think I'm just testing the cache misses. But how does the math go? CPU-Z tells me that my L1 D-Cache is 32Kbyte where each line is 64byte. I can squeeze in 32 * 1024byte / 4 byte per float = 8192 floats. That would mean I can have 8192 floats / 3 floats per vertex = 2730 vertices in my L1 Cache.

Now, I have 2 float arrays v_1 and v_2. Does it makes sense to compute the dot product in batches of 1300 floats from each array? Is that a recipe for maximum performance?

Thanks,
Christian

#8 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1225 posts

Posted 02 November 2009 - 11:58 AM

chhenning said:

Basically, I think I'm just testing the cache misses. But how does the math go?
It's a complex tale because the CPU will also automatically prefetch data into the cache when it detects a linear access pattern. And out-of-order execution also hides some of the latency caused by cache misses.

In my experience it's not really worth trying to achieve absolute maximum performance. Change one small thing and it no longer performs optimally. So just make sure you have tightly packed SSE-friendly data structures and you access them linearly whenever possible.

#9 phresnel

    Member

  • Members
  • PipPip
  • 47 posts

Posted 02 November 2009 - 04:43 PM

What you have written here is essentially a micro benchmark, where in real world code calculating a compile time constant amount of dot products in a row is not quite common. And even when, you should organize your data so that you don't need so many pointer-casts and produce aliasing (which is massively in the way of compiler optimization).

Let the compiler decide whether to inline or not, inlined function can pollute the execution cache, even when not executed. Inlining or not is not relevant in your code, but might be very relevant in real applications.

Further: Before diving into making everything SSE, find your bottlenecks first, and see what the compiler already does for you. Often, you'll find that the compiler generates code that is already unrolled, autovectorized and whatnot, and better than what you would have code:

#include <cstdlib>

#include <iostream>


struct __attribute__((aligned(16)) Vector {

        float a,b,c,d;

};


int main () {

        const size_t size = 1024;


        float __attribute__((aligned(16)) dots[size];

        Vector __attribute__((aligned(16)) b[size];

        Vector __attribute__((aligned(16)) c[size];


        // spread randomness to prevent folding actions

        int some_seed;

        std::cin >> some_seed;

        srand (some_seed);

        for (size_t i=0; i<size; ++i) {

                dots [i] = rand () / (RAND_MAX+1.0f);

                b [i].a = rand () / (RAND_MAX+1.0f);

                b [i].b = rand () / (RAND_MAX+1.0f);

                b [i].c = rand () / (RAND_MAX+1.0f);

                b [i].d = rand () / (RAND_MAX+1.0f);

                c [i].a = rand () / (RAND_MAX+1.0f);

                c [i].b = rand () / (RAND_MAX+1.0f);

                c [i].c = rand () / (RAND_MAX+1.0f);

                c [i].d = rand () / (RAND_MAX+1.0f);

        }


        // do some number crunching

        for (size_t i=0; i<size; ++i) {

                dots [i] = b[i].a*c[i].a

                         + b[i].b*c[i].b

                         + b[i].c*c[i].c

                         + b[i].d*c[i].d;

        }


        // output it all to prevent folding

        for (size_t i=0; i<size; ++i) {

                std::cout << dots [i];

        }

}

Quote

Originally Posted by g++ 4.4 -O3 -msse3 -S

L7:

	movaps	(%esi,%eax,4), %xmm3

	movaps	16(%esi,%eax,4), %xmm2

	movaps	32(%esi,%eax,4), %xmm1

	movaps	48(%esi,%eax,4), %xmm0

	movaps	%xmm3, %xmm6

	shufps	$136, %xmm2, %xmm6

	shufps	$221, %xmm2, %xmm3

	movaps	%xmm1, %xmm5

	shufps	$136, %xmm0, %xmm5

	shufps	$221, %xmm0, %xmm1

	movaps	%xmm1, 16(%esp)

	movaps	(%edi,%eax,4), %xmm2

	movaps	16(%edi,%eax,4), %xmm1

	movaps	32(%edi,%eax,4), %xmm0

	movaps	48(%edi,%eax,4), %xmm7

	movaps	%xmm2, %xmm4

	shufps	$136, %xmm1, %xmm4

	shufps	$221, %xmm1, %xmm2

	movaps	%xmm0, %xmm1

	shufps	$136, %xmm7, %xmm1

	movaps	%xmm1, 32(%esp)

	shufps	$221, %xmm7, %xmm0

	movaps	%xmm0, 48(%esp)

	movaps	%xmm6, %xmm0

	shufps	$136, %xmm5, %xmm0

	movaps	%xmm4, %xmm1

	shufps	$136, 32(%esp), %xmm1

	mulps	%xmm1, %xmm0

	movaps	%xmm3, %xmm1

	shufps	$136, 16(%esp), %xmm1

	movaps	%xmm2, %xmm7

	shufps	$136, 48(%esp), %xmm7

	mulps	%xmm7, %xmm1

	addps	%xmm1, %xmm0

	shufps	$221, %xmm5, %xmm6

	shufps	$221, 32(%esp), %xmm4

	mulps	%xmm4, %xmm6

	addps	%xmm6, %xmm0

	shufps	$221, 16(%esp), %xmm3

	shufps	$221, 48(%esp), %xmm2

	mulps	%xmm2, %xmm3

	addps	%xmm3, %xmm0

	movaps	%xmm0, 32832(%esp,%eax)

	addl	$16, %eax

	cmpl	$4096, %eax

	jne	L7

This vectorized and loop-unrolled code is not necessarily better than your version (but I haven't benchmarked it), but often people underestimate what modern compilers are able to do, possibly enough to leverage bottlenecks to remote and unrelated points in your application, making optimizing wannabe-bottlenecks an unneeded loss of time.


edit: Little update after tweaking above code from array-of-struct to struct-of-arrays:


Quote

Originally Posted by g++ 4.4 -O3 -msse3 -S
L8:

	movaps	(%ebx,%eax), %xmm0

	mulps	(%esi,%eax), %xmm0

	movaps	4000(%ebx,%eax), %xmm1

	mulps	4000(%esi,%eax), %xmm1

	addps	%xmm1, %xmm0

	movaps	8000(%ebx,%eax), %xmm1

	mulps	8000(%esi,%eax), %xmm1

	addps	%xmm1, %xmm0

	movaps	12000(%ebx,%eax), %xmm1

	mulps	12000(%esi,%eax), %xmm1

	addps	%xmm1, %xmm0

	movaps	%xmm0, (%edi,%eax)

	addl	$16, %eax

	cmpl	$4000, %eax

	jne	L8

	incl	%edx

	cmpl	$250000, %edx

	jne	L7

	xorl	%ebx, %ebx

I guess performance is now pretty much equivalent to your explicit SSE code (care to post some asm-dumps?).

~/www/ | picogen.org | metatrace
- --- / .-.. . - / -- -.-- / ... .. --. -. .- - ..- .-. . / .- .--. .--. . .- .-. / --. . . -.- -.--


#10 chhenning

    New Member

  • Members
  • Pip
  • 9 posts

Posted 02 November 2009 - 11:22 PM

Hi phresnel, qq. How can I get an assembler output with VC++ 8?

In any event, thanks for your reply.

Thanks,
Christian

#11 .oisyn

    DevMaster Staff

  • Moderators
  • 1810 posts

Posted 03 November 2009 - 01:02 AM

It's somewhere in the compiler options for each file, but what I usually do is just put a breakpoint on the line of code I want to view the assembly for, then run, and when it breaks do ctrl-F11 to go to the disassembly window.
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#12 JarkkoL

    Senior Member

  • Members
  • PipPipPipPip
  • 458 posts

Posted 03 November 2009 - 01:03 AM

File (or project) properties => C/C++ => Output Files => Assembler Output => Assembly With Source Code (/FAs)

#13 phresnel

    Member

  • Members
  • PipPip
  • 47 posts

Posted 03 November 2009 - 10:14 AM

Little update:

#include <cstdlib>

#include <iostream>

#include <ctime>


namespace {

        template <size_t i> struct afloat_array {

                typedef float type[i] __attribute__((aligned (16)));

        };


        template <size_t len>

        struct Vector2 {

                typename afloat_array<len>::type a, b, c, d;

        };

        const size_t size = 1000000;

        const size_t more = 1000;

        afloat_array<size>::type dots;

        Vector2<size> b, c;

}


int main () {


        // spread randomness to prevent folding actions

        int some_seed;

        std::cin >> some_seed;

        srand (some_seed);

        for (size_t i=0; i<size; ++i) {

                dots [i] = rand () / (RAND_MAX+1.0f);

                b.a [i]= rand () / (RAND_MAX+1.0f);

                b.b [i]= rand () / (RAND_MAX+1.0f);

                b.c [i]= rand () / (RAND_MAX+1.0f);

                b.d [i]= rand () / (RAND_MAX+1.0f);

                c.a [i]= rand () / (RAND_MAX+1.0f);

                c.b [i]= rand () / (RAND_MAX+1.0f);

                c.c [i]= rand () / (RAND_MAX+1.0f);

                c.d [i]= rand () / (RAND_MAX+1.0f);

        }


        const clock_t begin = clock();

        // do some number crunching

        for (size_t m=0; m<more; ++m) {

        for (size_t i=0; i<size; ++i) {

                dots [i] = b.a[i]*c.a[i]

                         + b.b[i]*c.b[i]

                         + b.c[i]*c.c[i]

                         + b.d[i]*c.d[i];

        }}


        const clock_t end = clock();


        // output it all to prevent folding

        for (size_t i=0; i<size; ++i) {

                std::cout << dots [i];

        }


        std::cout << "\ntime for " << (long)more*size << " dp4's: "

                  << (end-begin)/(float)(CLOCKS_PER_SEC) << '\n';

}

For size=1000, more=1000000:
time for 1000000000 dp4's: 6.297


Running your explicit sse code, ported to g++, yields:
CPU Test: 7000 ms

Sum of results: 750267.569484

SSE Test: 6578 ms

Sum of results: 750267.569504



To ensure that the outer loop is not a nop, I looked at the assemblies.

Yours:
L47:

	xorl	%eax, %eax

	movl	%edx, 84(%esp)

	.p2align 4,,10

L48:

	movaps	(%esi,%eax,4), %xmm1

	movl	100(%esp), %edx

	movaps	(%edi,%eax,4), %xmm2

	mulps	(%edx,%eax,4), %xmm1

	mulps	(%ecx,%eax,4), %xmm2

	movl	104(%esp), %edx

	movaps	(%edx,%eax,4), %xmm0

	movl	84(%esp), %edx

	mulps	(%edx,%eax,4), %xmm0

	addps	%xmm2, %xmm0

	addps	%xmm1, %xmm0

	movaps	%xmm0, (%ebx,%eax,4)

	addl	$4, %eax

	cmpl	$1000000, %eax

	jne	L48

	incl	96(%esp)

	movl	84(%esp), %edx

	cmpl	$1000, 96(%esp)

	jne	L47


Mine:
L7:

	xorl	%eax, %eax

	.p2align 4,,10

L8:

	movaps	__ZN12_GLOBAL__N_11cE+4000000(%eax), %xmm0

	movaps	__ZN12_GLOBAL__N_11cE(%eax), %xmm1

	mulps	__ZN12_GLOBAL__N_11bE+4000000(%eax), %xmm0

	mulps	__ZN12_GLOBAL__N_11bE(%eax), %xmm1

	addps	%xmm1, %xmm0

	movaps	__ZN12_GLOBAL__N_11cE+8000000(%eax), %xmm1

	mulps	__ZN12_GLOBAL__N_11bE+8000000(%eax), %xmm1

	addps	%xmm1, %xmm0

	movaps	%xmm0, __ZN12_GLOBAL__N_14dotsE(%eax)

	addl	$16, %eax

	cmpl	$4000000, %eax

	jne	L8

	incl	%edx

	cmpl	$1000, %edx

	jne	L7


g++ 4.4 wins this match, and all we had to do was to align the data, all else is john-doe-readable & standards-conforming c++. And when I look at the latest result, I think it is hard to code or generate relevantly better codepaths :yes:

~/www/ | picogen.org | metatrace
- --- / .-.. . - / -- -.-- / ... .. --. -. .- - ..- .-. . / .- .--. .--. . .- .-. / --. . . -.- -.--


#14 chhenning

    New Member

  • Members
  • Pip
  • 9 posts

Posted 03 November 2009 - 06:19 PM

Hi there, I have add my sse calculation to your code ( http://codepad.org/nBhb3C23 ). It it's slightly faster than your code. It could be made a lot faster without the cache misses. Restructuring the for loops would help a lot. I'll let you know how it goes.

Christian

#15 phresnel

    Member

  • Members
  • PipPip
  • 47 posts

Posted 05 November 2009 - 11:12 AM

chhenning said:

Hi there, I have add my sse calculation to your code ( http://codepad.org/nBhb3C23 ). It it's slightly faster than your code. It could be made a lot faster without the cache misses. Restructuring the for loops would help a lot. I'll let you know how it goes.

Christian

I don't see where there shall be cache misses on a modern cpu. Then there is a bug:

rhs_a = reinterpret_cast< const sse_t* >( &c.a[i] );
rhs_b = reinterpret_cast< const sse_t* >( &c.b[i] );
rhs_c = reinterpret_cast< const sse_t* >( &c.c[i] );
rhs_d = reinterpret_cast< const sse_t* >( &c.c[i] );// << you save a load and rhs_b and rhs_d alias each other 

Further: Have you looked at the assemby? What does MSVC produce for those simple dot-products?

Then: On g++, your code crashes. I try to fix it.

~/www/ | picogen.org | metatrace
- --- / .-.. . - / -- -.-- / ... .. --. -. .- - ..- .-. . / .- .--. .--. . .- .-. / --. . . -.- -.--


#16 phresnel

    Member

  • Members
  • PipPip
  • 47 posts

Posted 05 November 2009 - 02:07 PM

There seems to be an issue w.r.t. alignment in g++-4.4 ((edit: http://gcc.gnu.org/b...ug.cgi?id=41950)), but I got it compile again with small tweaks:

#include <cstdlib>
#include <iostream>
#include <ctime>

#include <xmmintrin.h>
#include <emmintrin.h>

namespace {
        template <size_t i> struct afloat_array {
                typedef float type[i] __attribute__((aligned (16)));
        };

        template <size_t len>
        struct Vector2 {
                typename afloat_array<len>::type a, b, c, d;
        };


        template <size_t len>
        struct SSEVector {
                __m128 a[len], b[len], c[len], d[len];
        };

        const size_t size = 1000000;
        const size_t more = 1000;
}

// normally I'd put those in the anonymous namespace,
// but execution breaks then (seems to be an issue in g++-4.4)
afloat_array<size>::type dots;
Vector2<size> b, c;

__m128 sse_dots[size/4];
SSEVector<size/4> sse_lhs, sse_rhs;
// report bug when above data put in anonymous namespace

int main () {
        // spread randomness to prevent folding actions
        int some_seed;
        std::cin >> some_seed;
        srand (some_seed);

        for (size_t i=0; i<size; ++i) {
                dots [i] = rand () / (RAND_MAX+1.0f);
                b.a [i]= rand () / (RAND_MAX+1.0f);
                b.b [i]= rand () / (RAND_MAX+1.0f);
                b.c [i]= rand () / (RAND_MAX+1.0f);
                b.d [i]= rand () / (RAND_MAX+1.0f);
                c.a [i]= rand () / (RAND_MAX+1.0f);
                c.b [i]= rand () / (RAND_MAX+1.0f);
                c.c [i]= rand () / (RAND_MAX+1.0f);
                c.d [i]= rand () / (RAND_MAX+1.0f);

                // dirty init
                if (i%4 == 0) {
                        sse_dots[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
                        sse_lhs.a[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
                        sse_lhs.b[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
                        sse_lhs.c[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
                        sse_lhs.d[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
                        sse_rhs.a[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
                        sse_rhs.b[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
                        sse_rhs.c[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
                        sse_rhs.d[i/4] = _mm_set_ps1(rand () / (RAND_MAX+1.0f));
                }
        }


        std::cout << "sse test" << std::endl;
        const clock_t sse_begin = clock();
        // do some SSE number crunching
        for( size_t m=0; m<more; ++m ) {
        for( size_t i=0; i<size/4; ++i) {
                sse_dots[i] = _mm_add_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps( sse_lhs.a[i], sse_rhs.a[i] )
                                                                , _mm_mul_ps( sse_lhs.b[i], sse_rhs.b[i] )
                                   )
                                   , _mm_mul_ps( sse_lhs.c[i], sse_rhs.c[i] )
                                   )
                       , _mm_mul_ps( sse_lhs.d[i], sse_rhs.d[i] )
                );
        } }
        const clock_t sse_end = clock();


        // i tried the following to see how your explicit sse performs
        // after a bit of scheduling. but gcc seems already good enough
        // at scheduling.
        /*std::cout << "sse test 2" << std::endl;
        const clock_t sse_begin2 = clock();
        // do some SSE number crunching
        for( size_t m=0; m<more; ++m ) {
        for( size_t i=0; i<size/4; ++i) {
                const __m128
                        a = _mm_mul_ps( sse_lhs.a[i], sse_rhs.a[i]),
                        b = _mm_mul_ps( sse_lhs.b[i], sse_rhs.b[i]),
                        c = _mm_mul_ps( sse_lhs.c[i], sse_rhs.c[i]),
                        d = _mm_mul_ps( sse_lhs.d[i], sse_rhs.d[i]),
                        a_ = _mm_add_ps (a,b),
                        b_ = _mm_add_ps (c,d)
                ;
                sse_dots[i] = _mm_add_ps (a_,b_);
        } }
        const clock_t sse_end2 = clock();
        */



        std::cout << "scalar test" << std::endl;
        // do some number crunching
        const clock_t begin = clock();
        for (size_t m=0; m<more; ++m) {
        for (size_t i=0; i<size; ++i) {
                dots [i] = b.a[i]*c.a[i]
                         + b.b[i]*c.b[i]
                         + b.c[i]*c.c[i]
                         + b.d[i]*c.d[i];
        }}
        const clock_t end = clock();


        // prevent folding
        float accu = 0.0f;
        for (size_t i=0; i<size; ++i)
                accu += dots[i];
        for (size_t i=0; i<size/4; ++i) {
                union {__m128 m; float f[4];} const x = {sse_dots[i]}; //evil
                accu += x.f[0]+x.f[1]+x.f[2]+x.f[3];
        }
        std::cout << accu << std::endl;

        std::cout << "\ntime for " << (long)more*size << " dp4's: "
                  << (end-begin)/(float)(CLOCKS_PER_SEC) << '\n';
        std::cout << "\ntime for " << (long)more*size << " sse-dp4's: "
                  << (sse_end-sse_begin)/(float)(CLOCKS_PER_SEC) << '\n';

        // sse:    7.892  7.953
        // scalar: 7.922  7.937

}

In one test, your code (+ tweaks by me) is faster (sse: 7.892, scalar: 7.922), in another, mine is faster (scalar: 7.922, sse: 7.953). The reason becomes clear when we look at the assemblies:

Here is what g++-4.4 gives me for your (tweaked by me) code:
L11:
	xorl	%eax, %eax
	.p2align 4,,10
L12:
	movaps	_sse_lhs+12000000(%eax), %xmm1
	movaps	_sse_lhs+8000000(%eax), %xmm2
	mulps	_sse_rhs+12000000(%eax), %xmm1
	mulps	_sse_rhs+8000000(%eax), %xmm2
	movaps	_sse_lhs+4000000(%eax), %xmm3
	movaps	_sse_lhs(%eax), %xmm0
	mulps	_sse_rhs+4000000(%eax), %xmm3
	mulps	_sse_rhs(%eax), %xmm0
	addps	%xmm3, %xmm0
	addps	%xmm2, %xmm0
	addps	%xmm1, %xmm0
	movaps	%xmm0, _sse_dots(%eax)
	addl	$16, %eax
	cmpl	$4000000, %eax
	jne	L12
	incl	%edx
	cmpl	$1000, %edx
	jne	L11


For my john-doe-c++ I reap:
L14:
	xorl	%eax, %eax
	.p2align 4,,10
L15:
	movaps	_c+4000000(%eax), %xmm0
	movaps	_c(%eax), %xmm1
	mulps	_b+4000000(%eax), %xmm0
	mulps	_b(%eax), %xmm1
	addps	%xmm1, %xmm0
	movaps	_c+8000000(%eax), %xmm1
	mulps	_b+8000000(%eax), %xmm1
	addps	%xmm1, %xmm0
	movaps	_c+12000000(%eax), %xmm1
	mulps	_b+12000000(%eax), %xmm1
	addps	%xmm1, %xmm0
	movaps	%xmm0, _dots(%eax)
	addl	$16, %eax
	cmpl	$4000000, %eax
	jne	L15
	incl	%edx
	cmpl	$1000, %edx
	jne	L14


Both are nearly identical, the only real difference is that your code runs in xmm0-xmm3, whereas mine runs in only xmm0 and xmm1.

I don't know what MSVC produces, but it is clear that optimising such microbenchmark code with explicit sse is a pure waste of time with g++, because the latter does it all for you.

~/www/ | picogen.org | metatrace
- --- / .-.. . - / -- -.-- / ... .. --. -. .- - ..- .-. . / .- .--. .--. . .- .-. / --. . . -.- -.--


#17 martinsm

    Member

  • Members
  • PipPip
  • 88 posts

Posted 05 November 2009 - 02:41 PM

MSVC produces bad code for such loop. It doesn't use packed instructions, it just generates bunch of movss, mulss, addss instructions.

#18 .oisyn

    DevMaster Staff

  • Moderators
  • 1810 posts

Posted 06 November 2009 - 12:04 AM

Nick said:

It's a complex tale because the CPU will also automatically prefetch data into the cache when it detects a linear access pattern.
Which CPU does that? I've ran a quick test on my Core 2 Duo using both forward, backward and randomized (but full cache line per access) array access, and it reveals that all three methods perform exactly the same.
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#19 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1225 posts

Posted 06 November 2009 - 08:27 AM

.oisyn said:

Which CPU does that? I've ran a quick test on my Core 2 Duo using both forward, backward and randomized (but full cache line per access) array access, and it reveals that all three methods perform exactly the same.
It has been supported since at least the Pentium III. A Core 2 can detect something like eight data streams, both forward and backward. On top of that it can also do speculative loads...

So it's not that simple to get a cache miss out of a modern CPU. :happy: I suspect you might be running into bandwidth limitations, or your random test isn't fully random, or a significant part of the array still fits in L2 cache.

#20 .oisyn

    DevMaster Staff

  • Moderators
  • 1810 posts

Posted 06 November 2009 - 12:08 PM

Well I'm purposely wasting cycles to not be bandwith limited, and my random access pattern basically involves unrolling the forward loop using a 1:16 divider and then taking a random permutation of statements. Also, the array I'm working on is 64MB. Does the prefetching involve larger prefetches than a single cache line?

Here's my code:
#include <iostream>
#include <utility>
#include <set>
#include <intrin.h>
#include <xmmintrin.h>
#include <emmintrin.h>

static const int cSize = 4*1024*1024;
__m128i arr[cSize];

__m128i calc(__m128i v, __m128i a)
{	// merely wasting some cycles here...
	for (int j = 0; j < 256; j++)
		v = _mm_add_epi32(v, a);
	return v;
}

long long __declspec(naked) gettime()
{
	__asm rdtsc;
	__asm ret;
}

int main()
{
	long long t;
	long long t1 = 0, t2 = 0, t3 = 0;
	__m128i add = _mm_set1_epi32(10);

	// make sure the OS commits the pages to mem
	for (int i = cSize-1; i >= 0; i--)
		arr[i] = _mm_set1_epi32(0);

	for (int run = 0; run < 10; run++)
	{
		// forward
		t = -gettime();
		for (int i = 0; i < cSize; i++)
			arr[i] = calc(arr[i], add);
		t += gettime();
		t1 += t;
		std::cout << "forward time: " << t << std::endl;

		// reverse
		t = -gettime();
		for (int i = cSize-1; i >= 0; i--)
			arr[i] = calc(arr[i], add);
		t += gettime();
		t2 += t;
		std::cout << "reverse time: " << t << std::endl;

		// random
		t = -gettime();
		for (int i = 0; i < cSize; i += 16)
		{
			arr[i+ 9] = calc(arr[i+ 9], add);
			arr[i+ 5] = calc(arr[i+ 5], add);
			arr[i+14] = calc(arr[i+14], add);
			arr[i+ 2] = calc(arr[i+ 2], add);
			arr[i+13] = calc(arr[i+13], add);
			arr[i+ 0] = calc(arr[i+ 0], add);
			arr[i+15] = calc(arr[i+15], add);
			arr[i+ 6] = calc(arr[i+ 6], add);
			arr[i+ 3] = calc(arr[i+ 3], add);
			arr[i+11] = calc(arr[i+11], add);
			arr[i+ 7] = calc(arr[i+ 7], add);
			arr[i+ 1] = calc(arr[i+ 1], add);
			arr[i+ 8] = calc(arr[i+ 8], add);
			arr[i+12] = calc(arr[i+12], add);
			arr[i+ 4] = calc(arr[i+ 4], add);
			arr[i+10] = calc(arr[i+10], add);
		}
		t += gettime();
		t3 += t;
		std::cout << "random time : " << t << "\n" << std::endl;
	}

	std::cout << "total forward: " << t1 << std::endl;
	std::cout << "total reverse: " << t2 << std::endl;
	std::cout << "total random : " << t3 << std::endl;
}

C++ addict
-
Currently working on: the 3D engine for Tomb Raider.





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users