performance problem with my renderer

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 13, 2012 at 13:18

I’m wondering if anyone can help me with a performance problem I’m having.

I have a renderer class, and a mesh class. the mesh class contains all information necessary to draw a mesh.
You can add and remove meshes to the renderer. As meshes are added, they are sorted by material so I don’t need to recommit material settings for each different mesh.

However, against all expectations, when I compare the performance of my renderer to a renderer which does no optimisations at all,mine seems to run about five times slower. I’m pretty puzzled about this, and hope someone is able to point me in the right direction.

I have done some profiling with intel vtune and this seems to be my bottleneck:

http://oi45.tinypic.com/2q8999k.jpg
These are the machine instructions for that single line of source code (ID3D10EffectPass::Apply()): http://pastebin.com/p44mr06K

And this is my rendering code:

typedef std::multiset<const Mesh3D*, MeshCompare3D> BUCKET3D;
typedef std::map<Material*, BUCKET3D > MESHLIST3D;
typedef std::pair<Material*, BUCKET3D > BUCKETPAIR3D;
MESHLIST3D m_Meshes3D;

void GFX::Renderer::Draw()
{
   //reset the RendererOptimalisationInfo (is needed for letting external functions do their job)
   m_OpInfo.passes = -1;
   m_OpInfo.pGeometry = nullptr;
   m_OpInfo.pLayout = nullptr;
   m_OpInfo.pMaterial = nullptr;
   m_OpInfo.topology = -1;

   GFX::PerFrameInfo perFrameInfo;
   perFrameInfo.pMatP = m_pCam3D->GetMatP();
   perFrameInfo.pMatV = m_pCam3D->GetMatV();
   perFrameInfo.pScene = &m_Scene;
   perFrameInfo.pExtraInfo = m_pPerFrameExtraInfo;

   GFX::PerObjectInfo perObjectInfo;

   MESHLIST3D::const_iterator it;
   BUCKET3D::const_iterator bucketIter;

//DRAW
   for (it = m_Meshes3D.begin(); it != m_Meshes3D.end(); ++it)
   {
   //MESHLIST-level checks (material-specific)
      it->first->m_pEffect->Commit_Material(it->first, perFrameInfo);

      D3D10_TECHNIQUE_DESC tDesc;
      GFX::Technique* pTech = it->first->m_pEffect->GetTechnique(it->first->m_sTechnique);
      pTech->GetD3DTechnique()->GetDesc(&tDesc);
      m_OpInfo.passes = tDesc.Passes;

      if( pTech->GetInputLayout() != m_OpInfo.pLayout )
      {
         m_OpInfo.pLayout = pTech->GetInputLayout();
         m_DxCore.pDevice->IASetInputLayout(m_OpInfo.pLayout);
      }

      for(bucketIter = (it->second.begin()); bucketIter != (it->second.end()); ++bucketIter)
      {
      //BUCKET-level checks (mesh-specific)
         if(!((*bucketIter)->m_Desc.bActive))
         {
            continue;
         }

         if( (*bucketIter)->m_Desc.pGeometry != m_OpInfo.pGeometry )
         {
            m_OpInfo.pGeometry = (*bucketIter)->m_Desc.pGeometry;
            UINT offset = 0;
            UINT stride = m_OpInfo.pGeometry->GetVertexSize();
            m_DxCore.pDevice->
                   IASetVertexBuffers(0,1,m_OpInfo.pGeometry->GetppVBuffer(), &stride, &offset);

            if(m_OpInfo.pGeometry->GetpIBuffer())
            {
               m_DxCore.pDevice->
                     IASetIndexBuffer(m_OpInfo.pGeometry->GetpIBuffer(), DXGI_FORMAT_R32_UINT, 0);
            }
         }

         if( (*bucketIter)->m_Desc.pGeometry->GetTopology() != m_OpInfo.topology )
         {
            m_OpInfo.topology = (*bucketIter)->m_Desc.pGeometry->GetTopology();
            m_DxCore.pDevice->IASetPrimitiveTopology( (*bucketIter)->m_Desc.pGeometry->GetTopology() );
         }

         perObjectInfo.pMatW = &((*bucketIter)->m_Desc.matW);
         perObjectInfo.pExtraInfo = (*bucketIter)->m_Desc.pExtraInfo;
         it->first->m_pEffect->Commit_Object(perObjectInfo);

         //Finally, draw this mesh
         if(m_OpInfo.pGeometry->GetpIBuffer())
         {
            for (int p = 0; p < m_OpInfo.passes; ++p)
            {
               pTech->GetD3DTechnique()->GetPassByIndex(p)->Apply(0);
               m_DxCore.pDevice->DrawIndexed((*bucketIter)->GetDrawCount(),(*bucketIter)->GetDrawStartPos(),0);
            }
         } else {
            for (int p = 0; p < m_OpInfo.passes; ++p)
            {
               pTech->GetD3DTechnique()->GetPassByIndex(p)->Apply(0);
               m_DxCore.pDevice->Draw((*bucketIter)->GetDrawCount(),(*bucketIter)->GetDrawStartPos());
            }
         }//end if

      }//end for
   }//end for
}

22 Replies

Please log in or register to post a reply.

B5262118b588a5a420230bfbef4a2cdf
0
Stainless 151 May 14, 2012 at 08:04

I can’t see anything wrong, but I haven’t done any D3D coding for a while.

Things to think about though.

Are you using dynamic vertex buffers? The apply method may be the point where D3D has to send the buffer over the agp bus. Try using a static vertexbuffer or if you can’t manage that use a cpu managed buffer

Do you have geometry acceleration on your graphic card?

Are you using D3DPOOL_MANAGED instead of D3DPOOL_DEFAULT ? This will cause two copies of the buffers to exist in memory and a very slow cpu based copy to occur, not sure if it will happen on the apply method, but it’s a logical place for it to happen.

Are you ending up with a load of small buffers? If so try packing them into fewer, larger buffers and using a byte offset into the buffers. You may be triggering a buffer swap, which can be slow.

Are you updating high volumes of shader constants? This can add considerable amount of overhead to
the drivers.

As I said, I am rusty at dx coding, but these are the first things I would look at

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 14, 2012 at 11:27

I’m using some dynamic vertex buffers, but only where needed.

There’s (likely) nothing about this that is specific to my computer as my partner has the same problem on his computer with my engine

I’m not sure what D3DPOOL is for, after doing some googling it seems to be something for Direct3D9, I’m working with Direct3D10

Yes, sometimes I have a lot of draw calls, this is definitely an area that needs improving, but it is not the cause for my engine running five times slower than the engine without optimisations, since in that engine they do a lot of draw calls with smaller buffers too.
For example, the unoptimised engine (no optimisations) can draw about 1 million rects with 60+ fps. While my engine can draw only about 2000 rects with 60+fps. The performance in 2D seems to be even waay worse. However, for 2D I do sort by my own algorithm, not by depth buffer. This is for allowing partial transparency in 2D.
In 3D, my render is about 5 times slower than the unoptimised engine, but in 2D, well, about 500 times slower.

I’m not updating a lot of shader constants. However, there is this one thing though: When compiling effects, I have this option

DWORD dwFXFlags = 0;
dwFXFlags |= D3D10_EFFECT_COMPILE_ALLOW_SLOW_OPS; //needed for setting samplerFilter

  

ID3D10Effect* pE;
D3DX10CreateEffectFromFileA(file.c_str(),
NULL,
NULL,
"fx_4_0"
,dwShaderFlags
,dwFXFlags
,LOADER-&gt;GetCoreInfo()-&gt;pDxCore-&gt;pDevice
,NULL
,NULL
,&amp;pE
,&amp;ErrBlob
,&amp;hr);

As the comment says, this is needed for setting the sampler Filter. In my HLSL effects:

cbuffer g_BufferInitInfo
{
uint g_SamplerFilter;
};

SamplerState g_Sampler
{
Filter = g_SamplerFilter;
AddressU = WRAP;
AddressV = WRAP;
MaxAnisotropy = 8;
};

If I don’t do this, than I can’t set the sampler filter at any time (for example at the menu, the user could select his texture filtering style). However, this doesn’t seem to have such a big impact on performance, because I have tried without as well, and I can’t see any noticeable difference.

EDIT: another thing comes to mind.

My mesh class has a method Activate() and Deactivate(). This will determine if the mesh that is already added to the renderer should temporarily not be drawn or, the opposite. I remember him saying that he tried to compute which rects are visible and which aren’t, and Activating/Deactivating meshes accordingly, which seemed to give quite some performance increase. This only helped when zooming in though.

This seemed weird, because I thought that DirectX would determine which geometry is (not) visible through view frustrum clipping and rasterization stage. Am I not correct?

6eaf0e08fe36b2c23ca096562dd7a8b7
0
__________Smile_ 101 May 14, 2012 at 15:27

Maybe you have a problem not in the drawing code but in the optimizations (i.e. bad sorting algorithm)?

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 14, 2012 at 16:58

There’s nothing wrong with the sorting algorithm. Actually it’s not really a sorting algorithm just a binary tree data structure with O(log(n)) complexity for the insert. On top of that: even if it was slow, it wouldn’t really matter because it only sorts when meshes are added to the renderer, not while drawing.

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 May 14, 2012 at 17:22

@Xcrypt

I remember him saying that he tried to compute which rects are visible and which aren’t, and Activating/Deactivating meshes accordingly, which seemed to give quite some performance increase…I thought that DirectX would determine which geometry is (not) visible through view frustrum clipping and rasterization stage. Am I not correct?

The GPU will indeed cull triangles that are outside the frustum, but only after running the vertex shader on them, so you will still pay the cost of vertex shading, plus anything further up in the pipeline, e.g. the shader/texture state changes & draw calls needed for those rectangles. If you can manage to get rid of the rectangles more cheaply, that can provide a performance boost - especially if you can get rid of a lot of rectangles at once. That’s what spatial partitioning data structures are all about, like BSP trees and octrees.

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 14, 2012 at 18:15

@Reedbeta

The GPU will indeed cull triangles that are outside the frustum, but only after running the vertex shader on them, so you will still pay the cost of vertex shading, plus anything further up in the pipeline, e.g. the shader/texture state changes & draw calls needed for those rectangles. If you can manage to get rid of the rectangles more cheaply, that can provide a performance boost - especially if you can get rid of a lot of rectangles at once. That’s what spatial partitioning data structures are all about, like BSP trees and octrees.

Good to know, I might do that, but again that’s probably not the cause of my problem. Do you have any ideas? I’ve been searching for a whole week what might be wrong with my renderer but I just can’t find the problem… I don’t think that it’s my algorithms are causing the overhead, it must be some way I’m abusing DirectX

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 May 14, 2012 at 18:34

Sorry, I don’t know why that Apply() call would be taking so much time. It’s a long shot, but one thing to check would be to count how many Apply() calls you’re doing per frame, on the off chance that something elsewhere is busted, causing you to call it a bazillion times, or something crazy like that.

You also might consider the optimization that when a technique has just one pass, you don’t need to re-Apply() for every mesh; you can just Apply() once for the shader and then draw all the meshes. That’s more of a workaround than a solution, but I’d think it would be an optimization you’d want to make anyway.

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 May 15, 2012 at 14:24

Does the DX runtime tell you anything when your program is running? It usually has some hints as to what could be slowing down your code

B5262118b588a5a420230bfbef4a2cdf
0
Stainless 151 May 15, 2012 at 15:35

try running it through pixwin and look at what it’s actually doing

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 15, 2012 at 17:43

@@Kenneth Gorking

Does the DX runtime tell you anything when your program is running? It usually has some hints as to what could be slowing down your code

No, nothing at all
@Stainless

try running it through pixwin and look at what it’s actually doing

What’s pixwin? Can’t find anything on google, only showing me that it causes errors lol

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 May 15, 2012 at 18:44

@Xcrypt

No, nothing at all

Are you using the debug runtime? If not, I suggest turning it on.
@Xcrypt

What’s pixwin? Can’t find anything on google, only showing me that it causes errors lol

A short GDC talk on PIX for Windows: http://www.microsoft.com/en-us/download/confirmation.aspx?id=15096

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 15, 2012 at 19:08

@Kenneth Gorking

Are you using the debug runtime? If not, I suggest turning it on.

A short GDC talk on PIX for Windows: http://www.microsoft…n.aspx?id=15096

I’ve ran it in debug mod yes. Looking into pix atm

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 15, 2012 at 19:44

All right so I took a look at pix, and I’m seeing some very strange stuff indeed…
Take a look at this:

http://oi47.tinypic.com/auxlzl.jpg

Apparently ClearRTV() is a hell of an expesive function, or I’m doing something wrong :P

Well, after some checking, I’m not sure what’s going on. pix is certainly giving me the right information, but maybe the long durations are caused by interrupts, OS that stops my app’s cpu time and gives time to another program?

Anyway, I’m still uncertain what’s going on. The apply is certainly where my problem is.
Can anyone tell me exactly what it does?

From msdn:
Set the state contained in a pass to the device.

So it has nothing to do with setting the effectvariables? I thought apply was both for passes and fx variables.

If it’s pure for setting passes, then maybe the problem is not in my renderer, but in my effects.

Here’s some sort of a template for my effects:

//STATES
RasterizerState RState { CullMode = BACK; };

BlendState AlphaBlend
{
BlendEnable[0] = TRUE;
SrcBlend = SRC_ALPHA;
DestBlend = INV_SRC_ALPHA;
BlendOp = ADD;
SrcBlendAlpha = ZERO;
DestBlendAlpha = ZERO;
BlendOpAlpha = ADD;
RenderTargetWriteMask[0] = 0x0F;
};

DepthStencilState NoDepthWrites
{
DepthEnable=false;
DepthWriteMask=Zero;
StencilEnable = true;
StencilReadMask = 0xff;
StencilWriteMask = 0xff;
FrontFaceStencilFunc = Always;
FrontFaceStencilPass = Incr;
FrontFaceStencilFail = Keep;
BackFaceStencilFunc = Always;
BackFaceStencilPass = Incr;
BackFaceStencilFail = Keep;
};

//GLOBALS
cbuffer g_BufferPerObject
{
float4x4 g_MatWVP : WORLDVIEWPROJECTION;
};

Texture2D g_TexDiffuse;

cbuffer g_BufferInitInfo
{
uint g_SamplerFilter;
};

SamplerState g_Sampler
{
Filter = g_SamplerFilter;
AddressU = WRAP;
AddressV = WRAP;
MaxAnisotropy = 8;
};

//STRUCTS
//VSInput
struct VSInput{
float3 pos: POSITION;
float2 tex: TEXCOORD0;
};

//PSInput
struct PSInput{
float4 pos: SV_POSITION; //system value
float2 tex: TEXCOORD0;
};

//------------------------------------------------------------------------------------------------------

//VERTEX SHADER
PSInput MainVS(VSInput input) {
PSInput output = (PSInput)0;
output.pos = mul(float4(input.pos.xyz, 1.0), g_MatWVP);
output.tex = input.tex;
return output;
}

//PIXEL SHADER
float4 MainPS(PSInput input) : SV_TARGET {
return g_TexDiffuse.Sample(g_Sampler, input.tex);
}

//------------------------------------------------------------------------------------------------------

//DX10 TECHNIQUES
technique10 t0 {
pass p0 {
SetRasterizerState(RState);
SetBlendState(AlphaBlend, float4(0.0f, 0.0f, 0.0f, 0.0f), 0xffffffff);
SetDepthStencilState(NoDepthWrites,0);

SetVertexShader(CompileShader(vs_4_0, MainVS()));
SetGeometryShader(NULL);
SetPixelShader(CompileShader(ps_4_0, MainPS()));
}
}

This is a simple postex effect, but it shows you how nearly all my effects are made.
Anything wrong with it?

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 16, 2012 at 17:20

@Reedbeta

Sorry, I don’t know why that Apply() call would be taking so much time. It’s a long shot, but one thing to check would be to count how many Apply() calls you’re doing per frame, on the off chance that something elsewhere is busted, causing you to call it a bazillion times, or something crazy like that. You also might consider the optimization that when a technique has just one pass, you don’t need to re-Apply() for every mesh; you can just Apply() once for the shader and then draw all the meshes. That’s more of a workaround than a solution, but I’d think it would be an optimization you’d want to make anyway.

I tried applying per material, not per mesh. But one problem: how do I commit my object info (per mesh) to the shader?

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 May 16, 2012 at 18:46

You mean shader parameters like object-to-world matrix and suchlike? You just set the parameter through the effect API. Parameter setting can be done at any time, whether the shader is currently bound / applied or not.

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 16, 2012 at 18:49

Are you certain about that? That’s what I’m doing, and it’s not working. I guess you need apply for registering the changes made to the effect? If not, then I’m doing something wrong…

ID3D10EffectMatrixVariable* m_pMatWVP_fx;
ID3D10EffectMatrixVariable* m_pMatW_fx;


m_pMatW_fx = m_pD3DEffect->GetVariableByName("g_MatW")->AsMatrix();
ASSERT(m_pMatW_fx->IsValid());

m_pMatWVP_fx = m_pD3DEffect->GetVariableByName("g_MatWVP")->AsMatrix();
ASSERT(m_pMatWVP_fx->IsValid());

void GFX::Effect_PosTexNorm::Commit_Object(const PerObjectInfo& info)
{
m_MatW = *(info.pMatW);
m_MatWVP = m_MatW*m_MatVP;

m_pMatW_fx->SetMatrix(m_MatW);
m_pMatWVP_fx->SetMatrix(m_MatWVP);
}

  

m_pEffect->Commit_Material(it->first, perFrameInfo);

D3D10_TECHNIQUE_DESC tDesc;
GFX::Technique* pTech = it->first->m_pEffect->GetTechnique(it->first->m_sTechnique);
pTech->GetD3DTechnique()->GetDesc(&tDesc);
m_OpInfo.passes = tDesc.Passes;

        if(m_OpInfo.passes < 2) {
            double applyTime=GLOBAL_nsTIMER->TimePassed();
            pTech->GetD3DTechnique()->GetPassByIndex(0)->Apply(0);
            applyTime= GLOBAL_nsTIMER->TimePassed()-applyTime;
            m_ApplyTime+=applyTime;
            ++m_ApplyCalls;
        }

//OTHER CODE [COLLAPSED FOR VISIBILITY]

perObjectInfo.pMatW = &((*bucketIter)->m_Desc.matW);
perObjectInfo.pExtraInfo = (*bucketIter)->m_Desc.pExtraInfo;
it->first->m_pEffect->Commit_Object(perObjectInfo); //<================================commit object

//Finally, draw this mesh already!
if(m_OpInfo.pGeometry->GetpIBuffer())
{
     if(m_OpInfo.passes < 2) {
     m_DxCore.pDevice->DrawIndexed((*bucketIter)->GetDrawCount(),(*bucketIter)->GetDrawStartPos(),0);
      }
      else
      {
             for (int p = 0; p < m_OpInfo.passes; ++p)
               {
                ID3D10EffectTechnique* pFxTech = pTech->GetD3DTechnique();
                ID3D10EffectPass* pPass = pFxTech->GetPassByIndex(p);
                double applyTime=GLOBAL_nsTIMER->TimePassed();
                pPass->Apply(0);
                applyTime= GLOBAL_nsTIMER->TimePassed()-applyTime;
                m_ApplyTime+=applyTime;
                ++m_ApplyCalls;
                m_DxCore.pDevice->DrawIndexed((*bucketIter)->GetDrawCount(),(*bucketIter)->GetDrawStartPos(),0);
                }
        }
}
A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 May 16, 2012 at 19:03

Hmm, that’s odd. I would think that setting the parameter and doing a draw call would be enough. I don’t have my D3D engine handy to test that right now, though.

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 16, 2012 at 19:20

@Reedbeta

Hmm, that’s odd. I would think that setting the parameter and doing a draw call would be enough. I don’t have my D3D engine handy to test that right now, though.

No problem, is it possible to check that later?
Thanks for all the help btw, all of you guys.


The performance issue with my renderer is “solved”. In fact, there never was a performance issue.
I’ll explain…

my partner claimed that the other engine with no optimisations was able to draw 50k rects at 30 fps.
my engine could only draw 1-2k rects at 30fps.

The mistake I made here that I was stupid enough to believe him, and never actually tested it myself.So this way, I lost 2 weeks of precious development time. Never take anyone’s word! Always test it yourself!
I don’t blame him at all, it was my mistake for not testing. He was simply misinformed.

It turns out that the other engine could only draw 500-1k rects at 30fps, after testing. My day was made!


Well, the main issue is solved. But 1-2k rects at 30fps is slow (intel i5 quadcore, 1G vid mem), even without instancing or any kind of optimisation, so I’m left with some questions.
Does that imply that a draw call is so damn expensive that you can only do about 1k of them per frame, in order to sustain a smooth framerate?
I profiled the other engine too. It seems that apply nearly always takes longer than the draw call itself!

Is this normal?

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 May 16, 2012 at 20:20

1K draw calls per frame is very low. I think that was the limit around 7-10 years ago. I don’t know about PC games, but on current-gen consoles it’s not uncommon to hit 10K draw calls per frame for real game scenes. D3D9 had a good deal of per-call overhead that restricted the number of draw calls well below that, but in D3D10-11 draw calls should be much cheaper, so I’d guess 10K-20K draw calls should be achievable today.

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 16, 2012 at 21:02

Hmm, that’s weird. I don’t know what part of my code would make it so that it makes draw calls so damn slow.
Ofc, I’m still learning and all, but I’m pretty performance-concerned as a general coding style. I wouldn’t say that I perform premature optimization, but I always think about the containers and algorithms I use before I actually write my code, even before the optimisation process.
Of course, I have no in-depth knowledge of hardware and non-trivial containers yet (I nearly always use: vector, list, (multi)set, (multi)map, my own hasthable, priority queues, indexed containers, heaps, stacks, and so on) but still, this is weird.

B20d81438814b6ba7da7ff8eb502d039
0
Vilem_Otte 117 May 18, 2012 at 00:38

#Reedbeta - From what I see, one can go even for 100K draw calls without a problem (and maybe even further, I just haven’t tried more). Note, current machine (just to give some more overview - Core i7 + HD 6870 + 8 GB RAM)

A528d11e2590c45b74a53dadde386125
0
Xcrypt 101 May 18, 2012 at 10:53

You’re testing without instancing or culling or multithreading then? that’s fuck weird