C# Managed code too fast?

Afb4377d78290e589e946c5862757ccf
0
Qlone 101 Jun 27, 2006 at 09:41

Hello all,

I’m working on a little side project for my big project (an RTS in managed code, just for fun, nothing serious) and have been benchmarking several early approaches to the math problem.

Managed DirectX contains classes for matrix, quaternions, vectors, etc and uses d3dx internally to get things done. Since I want my project to run on several target platforms, but still maintain the best possible performance, I’ve been benchmarking both .NET math code and managed directx’ own implementations. Here I get some odd benchmark results…

The easiest way, using methods on a MDX matrix object:

public bool Execute()
{
    o = Matrix.Multiply(m, n);
    return true;
}

…………….!
Took: 0h, 0m, 1.656s. for 8000000 iterations.
Iters per ms: 4830,18867924528 - Ms per iter: 0,00020703125

The slightly less easy way, using MDX unsafe native methods:

public bool Execute()
{
    unsafe
    {
        fixed (Matrix* pm = &m) fixed (Matrix* pn = &n) fixed (Matrix* po = &o)
        {
            UnsafeNativeMethods.Matrix.Multiply(po, pm, pn);
        }
    }
    return true;
}

…………….!
Took: 0h, 0m, 0.750s. for 8000000 iterations.
Iters per ms: 10666,6666666667 - Ms per iter: 9,375E-05

And the final way, using some very simple .NET matrix code:

public bool Execute()
{
    Multiply(ref o, ref m, ref n);
    return true;
}

private void Multiply(ref DNMatrix o, ref DNMatrix m, ref DNMatrix n)
{
    o.m11 = m.m11 * n.m11 + m.m12 * n.m21 + m.m13 * n.m31 + m.m14 * n.m41;
    o.m12 = m.m11 * n.m12 + m.m12 * n.m22 + m.m13 * n.m32 + m.m14 * n.m42;
    o.m13 = m.m11 * n.m13 + m.m12 * n.m23 + m.m13 * n.m33 + m.m14 * n.m43;
    o.m14 = m.m11 * n.m14 + m.m12 * n.m24 + m.m13 * n.m34 + m.m14 * n.m44;
    o.m21 = m.m21 * n.m11 + m.m22 * n.m21 + m.m23 * n.m31 + m.m24 * n.m41;
    o.m22 = m.m21 * n.m12 + m.m22 * n.m22 + m.m23 * n.m32 + m.m24 * n.m42;
    o.m23 = m.m21 * n.m13 + m.m22 * n.m23 + m.m23 * n.m33 + m.m24 * n.m43;
    o.m24 = m.m21 * n.m14 + m.m22 * n.m24 + m.m23 * n.m34 + m.m24 * n.m44;
    o.m31 = m.m31 * n.m11 + m.m32 * n.m21 + m.m33 * n.m31 + m.m34 * n.m41;
    o.m32 = m.m31 * n.m12 + m.m32 * n.m22 + m.m33 * n.m32 + m.m34 * n.m42;
    o.m33 = m.m31 * n.m13 + m.m32 * n.m23 + m.m33 * n.m33 + m.m34 * n.m43;
    o.m34 = m.m31 * n.m14 + m.m32 * n.m24 + m.m33 * n.m34 + m.m34 * n.m44;
    o.m41 = m.m41 * n.m11 + m.m42 * n.m21 + m.m43 * n.m31 + m.m44 * n.m41;
    o.m42 = m.m41 * n.m12 + m.m42 * n.m22 + m.m43 * n.m32 + m.m44 * n.m42;
    o.m43 = m.m41 * n.m13 + m.m42 * n.m23 + m.m43 * n.m33 + m.m44 * n.m43;
    o.m44 = m.m41 * n.m14 + m.m42 * n.m24 + m.m43 * n.m34 + m.m44 * n.m44;
}

…………….!
Took: 0h, 0m, 0.750s. for 8000000 iterations.
Iters per ms: 10666,6666666667 - Ms per iter: 9,375E-05
(no, this is not a copy/paste error)

My question, seeing these results is the following: Am I doing something wrong with the MDX calls, since a simple implementation in .net code seems to be ‘just as fast’ as the MDX implementation… I expected the .net code to be quite a bit slower than the native implamantations…

10 Replies

Please log in or register to post a reply.

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jun 27, 2006 at 10:51

lousy cross-forum-poster ;)

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jun 27, 2006 at 11:13

The ‘unsafe’ code is just a C++ routine that looks exactly like your C# implementation. The only difference is that the former is already compiled and the latter is compiled at run-time (JIT). But they should produce exactly the same code. It doesn’t use arrays and such so there’s no overhead from safety checks either.

In other words, C# can perform equivalent to C++ in many cases. It does have its limitations but if you’re aware of them then you can get really good performance. It doesn’t have inline assembly or intrinics support though, so you can’t access the blazing fast MMX/SSE instructions. But that’s not a big problem for the average project, and you can still write an external C++ function…

Afb4377d78290e589e946c5862757ccf
0
Qlone 101 Jun 27, 2006 at 11:22

@.oisyn

lousy cross-forum-poster ;)

Hey, 2 know more than one, right :p:

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jun 27, 2006 at 11:43

@Nick

The ‘unsafe’ code is just a C++ routine that looks exactly like your C# implementation.

In essence, yes, but I believe the d3dx matrix mul uses SSE code. And somehow I don’t think the jitter actually produces SSE code from that piece of C#.

6aa952514ff4e5439df1e9e6d337b864
0
roel 101 Jun 27, 2006 at 11:47

What is the resolution of your timer? And maybe you can compare the resulting (both JIT generated and C++ compiler generated) asm code to be sure that they are identical. VS2005 allows you to switch to asm view when you are debugging C# code, if I’m correct.

Afb4377d78290e589e946c5862757ccf
0
Qlone 101 Jun 27, 2006 at 22:36

@roel

What is the resolution of your timer? And maybe you can compare the resulting (both JIT generated and C++ compiler generated) asm code to be sure that they are identical. VS2005 allows you to switch to asm view when you are debugging C# code, if I’m correct.

The resolution of my timer is the standard windows (16ms?) resolution. I know it’s not very accurate for this kind of measurements, but increasing the number of iterations sort of solves that problem.

If I do that, the numbers don’t really change. On some hardware (AMD) the .net implementation is even slightly faster than the MDX calls. On intel hardware, MDX seems to be a bit faster, but not much (about 1 to 2 percent).

Comparing the generated asm is not something I have the time for at the moment. However, if people are interested in playing with this, I can post my little benchmark app’s source…

Edit: OK, so I checked and this is the assembly view for the .net implementation:

o.m11 = m.m11 * n.m11 + m.m12 * n.m21 + m.m13 * n.m31 + m.m14 * n.m41;
00000000  push        edi  
00000001  push        esi  
00000002  push        ebx  
00000003  mov         ebx,ecx 
00000005  mov         esi,edx 
00000007  mov         edi,dword ptr [esp+10h] 
0000000b  cmp         dword ptr ds:[035AD030h],0 
00000012  je          00000019 
00000014  call        7943FEDE 
00000019  fld         dword ptr [esi] 
0000001b  fmul        dword ptr [edi] 
0000001d  fld         dword ptr [esi+4] 
00000020  fmul        dword ptr [edi+10h] 
00000023  faddp       st(1),st 
00000025  fld         dword ptr [esi+8] 
00000028  fmul        dword ptr [edi+20h] 
0000002b  faddp       st(1),st 
0000002d  fld         dword ptr [esi+0Ch] 
00000030  fmul        dword ptr [edi+30h] 
00000033  faddp       st(1),st 
00000035  fstp        dword ptr [ebx] 
2fcd95b0b62d18275c6b5a6f23f29791
0
tbp 101 Jun 28, 2006 at 00:36

“best possible performance” and d3dx in the same sentence?
http://math-atlas.sourceforge.net would make more sense to bench against.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jun 28, 2006 at 08:55

@.oisyn

In essence, yes, but I believe the d3dx matrix mul uses SSE code. And somehow I don’t think the jitter actually produces SSE code from that piece of C#.

I verified that it doesn’t use SSE. So the code being produced for the C# version is practically equivalent to D3DX for C++. :mellow:

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jun 29, 2006 at 08:54

Hmm, that just plain sucks. I recon they’d optimized the d3dx library. Are you sure you’re not looking at the debug version?

On the other hand, you should be able to target every reasonable x86 platform with that library, including the pre-XP athlons that don’t have SSE support yet. So maybe it’s no surprise after all.

Then again, what serious gamedeveloper is using the d3dx math functions anyway? ;)

Afb4377d78290e589e946c5862757ccf
0
Qlone 101 Jun 29, 2006 at 09:20

Its performance slightly disappointed me too. It’s very tempting to try to do all math stuff in .NET code now, knowing (assuming) it will be as fast or nearly as fast as D3DX’ functions, which should be fast enough for what I’m trying to do.

The only real advantage using d3dx has now is that it’s a pre-built and (hopefully) debugged math library, which makes for a nice quick start on that front. Since platform independence is a target for my project, eventually I’ll need a platform independent math solution anyway…