Jump to content


performance problem with my renderer


22 replies to this topic

#1 Xcrypt

    New Member

  • Members
  • PipPipPip
  • 144 posts
  • LocationBelgium

Posted 13 May 2012 - 01:18 PM

I'm wondering if anyone can help me with a performance problem I'm having.

I have a renderer class, and a mesh class. the mesh class contains all information necessary to draw a mesh.
You can add and remove meshes to the renderer. As meshes are added, they are sorted by material so I don't need to recommit material settings for each different mesh.

However, against all expectations, when I compare the performance of my renderer to a renderer which does no optimisations at all, mine seems to run about five times slower. I'm pretty puzzled about this, and hope someone is able to point me in the right direction.

I have done some profiling with intel vtune and this seems to be my bottleneck:
http://oi45.tinypic.com/2q8999k.jpg
These are the machine instructions for that single line of source code (ID3D10EffectPass::Apply()): http://pastebin.com/p44mr06K

And this is my rendering code:
typedef std::multiset<const Mesh3D*, MeshCompare3D> BUCKET3D;
typedef std::map<Material*, BUCKET3D > MESHLIST3D;
typedef std::pair<Material*, BUCKET3D > BUCKETPAIR3D;
MESHLIST3D m_Meshes3D;

void GFX::Renderer::Draw()
{
   //reset the RendererOptimalisationInfo (is needed for letting external functions do their job)
   m_OpInfo.passes = -1;
   m_OpInfo.pGeometry = nullptr;
   m_OpInfo.pLayout = nullptr;
   m_OpInfo.pMaterial = nullptr;
   m_OpInfo.topology = -1;

   GFX::PerFrameInfo perFrameInfo;
   perFrameInfo.pMatP = m_pCam3D->GetMatP();
   perFrameInfo.pMatV = m_pCam3D->GetMatV();
   perFrameInfo.pScene = &m_Scene;
   perFrameInfo.pExtraInfo = m_pPerFrameExtraInfo;

   GFX::PerObjectInfo perObjectInfo;

   MESHLIST3D::const_iterator it;
   BUCKET3D::const_iterator bucketIter;

//DRAW
   for (it = m_Meshes3D.begin(); it != m_Meshes3D.end(); ++it)
   {
   //MESHLIST-level checks (material-specific)
	  it->first->m_pEffect->Commit_Material(it->first, perFrameInfo);

	  D3D10_TECHNIQUE_DESC tDesc;
	  GFX::Technique* pTech = it->first->m_pEffect->GetTechnique(it->first->m_sTechnique);
	  pTech->GetD3DTechnique()->GetDesc(&tDesc);
	  m_OpInfo.passes = tDesc.Passes;

	  if( pTech->GetInputLayout() != m_OpInfo.pLayout )
	  {
		 m_OpInfo.pLayout = pTech->GetInputLayout();
		 m_DxCore.pDevice->IASetInputLayout(m_OpInfo.pLayout);
	  }

	  for(bucketIter = (it->second.begin()); bucketIter != (it->second.end()); ++bucketIter)
	  {
	  //BUCKET-level checks (mesh-specific)
		 if(!((*bucketIter)->m_Desc.bActive))
		 {
			continue;
		 }

		 if( (*bucketIter)->m_Desc.pGeometry != m_OpInfo.pGeometry )
		 {
			m_OpInfo.pGeometry = (*bucketIter)->m_Desc.pGeometry;
			UINT offset = 0;
			UINT stride = m_OpInfo.pGeometry->GetVertexSize();
			m_DxCore.pDevice->
				   IASetVertexBuffers(0,1,m_OpInfo.pGeometry->GetppVBuffer(), &stride, &offset);

			if(m_OpInfo.pGeometry->GetpIBuffer())
			{
			   m_DxCore.pDevice->
					 IASetIndexBuffer(m_OpInfo.pGeometry->GetpIBuffer(), DXGI_FORMAT_R32_UINT, 0);
			}
		 }

		 if( (*bucketIter)->m_Desc.pGeometry->GetTopology() != m_OpInfo.topology )
		 {
			m_OpInfo.topology = (*bucketIter)->m_Desc.pGeometry->GetTopology();
			m_DxCore.pDevice->IASetPrimitiveTopology( (*bucketIter)->m_Desc.pGeometry->GetTopology() );
		 }

		 perObjectInfo.pMatW = &((*bucketIter)->m_Desc.matW);
		 perObjectInfo.pExtraInfo = (*bucketIter)->m_Desc.pExtraInfo;
		 it->first->m_pEffect->Commit_Object(perObjectInfo);

		 //Finally, draw this mesh
		 if(m_OpInfo.pGeometry->GetpIBuffer())
		 {
			for (int p = 0; p < m_OpInfo.passes; ++p)
			{
			   pTech->GetD3DTechnique()->GetPassByIndex(p)->Apply(0);
			   m_DxCore.pDevice->DrawIndexed((*bucketIter)->GetDrawCount(),(*bucketIter)->GetDrawStartPos(),0);
			}
		 } else {
			for (int p = 0; p < m_OpInfo.passes; ++p)
			{
			   pTech->GetD3DTechnique()->GetPassByIndex(p)->Apply(0);
			   m_DxCore.pDevice->Draw((*bucketIter)->GetDrawCount(),(*bucketIter)->GetDrawStartPos());
			}
		 }//end if

	  }//end for
   }//end for
}


#2 Stainless

    Member

  • Members
  • PipPipPipPip
  • 582 posts
  • LocationSouthampton

Posted 14 May 2012 - 08:04 AM

I can't see anything wrong, but I haven't done any D3D coding for a while.

Things to think about though.

Are you using dynamic vertex buffers? The apply method may be the point where D3D has to send the buffer over the agp bus. Try using a static vertexbuffer or if you can't manage that use a cpu managed buffer

Do you have geometry acceleration on your graphic card?

Are you using D3DPOOL_MANAGED instead of D3DPOOL_DEFAULT ? This will cause two copies of the buffers to exist in memory and a very slow cpu based copy to occur, not sure if it will happen on the apply method, but it's a logical place for it to happen.

Are you ending up with a load of small buffers? If so try packing them into fewer, larger buffers and using a byte offset into the buffers. You may be triggering a buffer swap, which can be slow.

Are you updating high volumes of shader constants? This can add considerable amount of overhead to
the drivers.

As I said, I am rusty at dx coding, but these are the first things I would look at

#3 Xcrypt

    New Member

  • Members
  • PipPipPip
  • 144 posts
  • LocationBelgium

Posted 14 May 2012 - 11:27 AM

I'm using some dynamic vertex buffers, but only where needed.

There's (likely) nothing about this that is specific to my computer as my partner has the same problem on his computer with my engine

I'm not sure what D3DPOOL is for, after doing some googling it seems to be something for Direct3D9, I'm working with Direct3D10

Yes, sometimes I have a lot of draw calls, this is definitely an area that needs improving, but it is not the cause for my engine running five times slower than the engine without optimisations, since in that engine they do a lot of draw calls with smaller buffers too.
For example, the unoptimised engine (no optimisations) can draw about 1 million rects with 60+ fps. While my engine can draw only about 2000 rects with 60+fps. The performance in 2D seems to be even waay worse. However, for 2D I do sort by my own algorithm, not by depth buffer. This is for allowing partial transparency in 2D.
In 3D, my render is about 5 times slower than the unoptimised engine, but in 2D, well, about 500 times slower.

I'm not updating a lot of shader constants. However, there is this one thing though: When compiling effects, I have this option


DWORD dwFXFlags = 0;
dwFXFlags |= D3D10_EFFECT_COMPILE_ALLOW_SLOW_OPS; //needed for setting samplerFilter


ID3D10Effect* pE;
D3DX10CreateEffectFromFileA(file.c_str(),
NULL,
NULL,
"fx_4_0"
,dwShaderFlags
,dwFXFlags
,LOADER-&gt;GetCoreInfo()-&gt;pDxCore-&gt;pDevice
,NULL
,NULL
,&amp;pE
,&amp;ErrBlob
,&amp;hr);

As the comment says, this is needed for setting the sampler Filter. In my HLSL effects:


cbuffer g_BufferInitInfo
{
uint g_SamplerFilter;
};

SamplerState g_Sampler
{
Filter = g_SamplerFilter;
AddressU = WRAP;
AddressV = WRAP;
MaxAnisotropy = 8;
};

If I don't do this, than I can't set the sampler filter at any time (for example at the menu, the user could select his texture filtering style). However, this doesn't seem to have such a big impact on performance, because I have tried without as well, and I can't see any noticeable difference.

EDIT: another thing comes to mind.

My mesh class has a method Activate() and Deactivate(). This will determine if the mesh that is already added to the renderer should temporarily not be drawn or, the opposite. I remember him saying that he tried to compute which rects are visible and which aren't, and Activating/Deactivating meshes accordingly, which seemed to give quite some performance increase. This only helped when zooming in though.

This seemed weird, because I thought that DirectX would determine which geometry is (not) visible through view frustrum clipping and rasterization stage. Am I not correct?

#4 }:+()___ (Smile)

    Member

  • Members
  • PipPipPip
  • 169 posts

Posted 14 May 2012 - 03:27 PM

Maybe you have a problem not in the drawing code but in the optimizations (i.e. bad sorting algorithm)?
Sorry my broken english!

#5 Xcrypt

    New Member

  • Members
  • PipPipPip
  • 144 posts
  • LocationBelgium

Posted 14 May 2012 - 04:58 PM

There's nothing wrong with the sorting algorithm. Actually it's not really a sorting algorithm just a binary tree data structure with O(log(n)) complexity for the insert. On top of that: even if it was slow, it wouldn't really matter because it only sorts when meshes are added to the renderer, not while drawing.

#6 Reedbeta

    DevMaster Staff

  • Administrators
  • 5308 posts
  • LocationSanta Clara, CA

Posted 14 May 2012 - 05:22 PM

View PostXcrypt, on 14 May 2012 - 11:27 AM, said:

I remember him saying that he tried to compute which rects are visible and which aren't, and Activating/Deactivating meshes accordingly, which seemed to give quite some performance increase...I thought that DirectX would determine which geometry is (not) visible through view frustrum clipping and rasterization stage. Am I not correct?

The GPU will indeed cull triangles that are outside the frustum, but only after running the vertex shader on them, so you will still pay the cost of vertex shading, plus anything further up in the pipeline, e.g. the shader/texture state changes & draw calls needed for those rectangles. If you can manage to get rid of the rectangles more cheaply, that can provide a performance boost - especially if you can get rid of a lot of rectangles at once. That's what spatial partitioning data structures are all about, like BSP trees and octrees.
reedbeta.com - developer blog, OpenGL demos, and other projects

#7 Xcrypt

    New Member

  • Members
  • PipPipPip
  • 144 posts
  • LocationBelgium

Posted 14 May 2012 - 06:15 PM

View PostReedbeta, on 14 May 2012 - 05:22 PM, said:

The GPU will indeed cull triangles that are outside the frustum, but only after running the vertex shader on them, so you will still pay the cost of vertex shading, plus anything further up in the pipeline, e.g. the shader/texture state changes & draw calls needed for those rectangles. If you can manage to get rid of the rectangles more cheaply, that can provide a performance boost - especially if you can get rid of a lot of rectangles at once. That's what spatial partitioning data structures are all about, like BSP trees and octrees.

Good to know, I might do that, but again that's probably not the cause of my problem. Do you have any ideas? I've been searching for a whole week what might be wrong with my renderer but I just can't find the problem... I don't think that it's my algorithms are causing the overhead, it must be some way I'm abusing DirectX

#8 Reedbeta

    DevMaster Staff

  • Administrators
  • 5308 posts
  • LocationSanta Clara, CA

Posted 14 May 2012 - 06:34 PM

Sorry, I don't know why that Apply() call would be taking so much time. It's a long shot, but one thing to check would be to count how many Apply() calls you're doing per frame, on the off chance that something elsewhere is busted, causing you to call it a bazillion times, or something crazy like that.

You also might consider the optimization that when a technique has just one pass, you don't need to re-Apply() for every mesh; you can just Apply() once for the shader and then draw all the meshes. That's more of a workaround than a solution, but I'd think it would be an optimization you'd want to make anyway.
reedbeta.com - developer blog, OpenGL demos, and other projects

#9 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 15 May 2012 - 02:24 PM

Does the DX runtime tell you anything when your program is running? It usually has some hints as to what could be slowing down your code
"Stupid bug! You go squish now!!" - Homer Simpson

#10 Stainless

    Member

  • Members
  • PipPipPipPip
  • 582 posts
  • LocationSouthampton

Posted 15 May 2012 - 03:35 PM

try running it through pixwin and look at what it's actually doing

#11 Xcrypt

    New Member

  • Members
  • PipPipPip
  • 144 posts
  • LocationBelgium

Posted 15 May 2012 - 05:43 PM

@

View PostKenneth Gorking, on 15 May 2012 - 02:24 PM, said:

Does the DX runtime tell you anything when your program is running? It usually has some hints as to what could be slowing down your code
No, nothing at all

View PostStainless, on 15 May 2012 - 03:35 PM, said:

try running it through pixwin and look at what it's actually doing
What's pixwin? Can't find anything on google, only showing me that it causes errors lol

#12 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 15 May 2012 - 06:44 PM

View PostXcrypt, on 15 May 2012 - 05:43 PM, said:

No, nothing at all
Are you using the debug runtime? If not, I suggest turning it on.

View PostXcrypt, on 15 May 2012 - 05:43 PM, said:

What's pixwin? Can't find anything on google, only showing me that it causes errors lol
A short GDC talk on PIX for Windows: http://www.microsoft...n.aspx?id=15096
"Stupid bug! You go squish now!!" - Homer Simpson

#13 Xcrypt

    New Member

  • Members
  • PipPipPip
  • 144 posts
  • LocationBelgium

Posted 15 May 2012 - 07:08 PM

View PostKenneth Gorking, on 15 May 2012 - 06:44 PM, said:

Are you using the debug runtime? If not, I suggest turning it on.


A short GDC talk on PIX for Windows: http://www.microsoft...n.aspx?id=15096

I've ran it in debug mod yes. Looking into pix atm

#14 Xcrypt

    New Member

  • Members
  • PipPipPip
  • 144 posts
  • LocationBelgium

Posted 15 May 2012 - 07:44 PM

All right so I took a look at pix, and I'm seeing some very strange stuff indeed...
Take a look at this:
http://oi47.tinypic.com/auxlzl.jpg

Apparently ClearRTV() is a hell of an expesive function, or I'm doing something wrong :P

Well, after some checking, I'm not sure what's going on. pix is certainly giving me the right information, but maybe the long durations are caused by interrupts, OS that stops my app's cpu time and gives time to another program?

Anyway, I'm still uncertain what's going on. The apply is certainly where my problem is.
Can anyone tell me exactly what it does?

From msdn:
Set the state contained in a pass to the device.

So it has nothing to do with setting the effectvariables? I thought apply was both for passes and fx variables.

If it's pure for setting passes, then maybe the problem is not in my renderer, but in my effects.

Here's some sort of a template for my effects:



//STATES
RasterizerState RState { CullMode = BACK; };

BlendState AlphaBlend
{
BlendEnable[0] = TRUE;
SrcBlend = SRC_ALPHA;
DestBlend = INV_SRC_ALPHA;
BlendOp = ADD;
SrcBlendAlpha = ZERO;
DestBlendAlpha = ZERO;
BlendOpAlpha = ADD;
RenderTargetWriteMask[0] = 0x0F;
};

DepthStencilState NoDepthWrites
{
DepthEnable=false;
DepthWriteMask=Zero;
StencilEnable = true;
StencilReadMask = 0xff;
StencilWriteMask = 0xff;
FrontFaceStencilFunc = Always;
FrontFaceStencilPass = Incr;
FrontFaceStencilFail = Keep;
BackFaceStencilFunc = Always;
BackFaceStencilPass = Incr;
BackFaceStencilFail = Keep;
};

//GLOBALS
cbuffer g_BufferPerObject
{
float4x4 g_MatWVP : WORLDVIEWPROJECTION;
};

Texture2D g_TexDiffuse;

cbuffer g_BufferInitInfo
{
uint g_SamplerFilter;
};

SamplerState g_Sampler
{
Filter = g_SamplerFilter;
AddressU = WRAP;
AddressV = WRAP;
MaxAnisotropy = 8;
};

//STRUCTS
//VSInput
struct VSInput{
float3 pos: POSITION;
float2 tex: TEXCOORD0;
};

//PSInput
struct PSInput{
float4 pos: SV_POSITION; //system value
float2 tex: TEXCOORD0;
};

//------------------------------------------------------------------------------------------------------

//VERTEX SHADER
PSInput MainVS(VSInput input) {
PSInput output = (PSInput)0;
output.pos = mul(float4(input.pos.xyz, 1.0), g_MatWVP);
output.tex = input.tex;
return output;
}

//PIXEL SHADER
float4 MainPS(PSInput input) : SV_TARGET {
return g_TexDiffuse.Sample(g_Sampler, input.tex);
}

//------------------------------------------------------------------------------------------------------

//DX10 TECHNIQUES
technique10 t0 {
pass p0 {
SetRasterizerState(RState);
SetBlendState(AlphaBlend, float4(0.0f, 0.0f, 0.0f, 0.0f), 0xffffffff);
SetDepthStencilState(NoDepthWrites,0);

SetVertexShader(CompileShader(vs_4_0, MainVS()));
SetGeometryShader(NULL);
SetPixelShader(CompileShader(ps_4_0, MainPS()));
}
}

This is a simple postex effect, but it shows you how nearly all my effects are made.
Anything wrong with it?

#15 Xcrypt

    New Member

  • Members
  • PipPipPip
  • 144 posts
  • LocationBelgium

Posted 16 May 2012 - 05:20 PM

View PostReedbeta, on 14 May 2012 - 06:34 PM, said:

Sorry, I don't know why that Apply() call would be taking so much time. It's a long shot, but one thing to check would be to count how many Apply() calls you're doing per frame, on the off chance that something elsewhere is busted, causing you to call it a bazillion times, or something crazy like that.

You also might consider the optimization that when a technique has just one pass, you don't need to re-Apply() for every mesh; you can just Apply() once for the shader and then draw all the meshes. That's more of a workaround than a solution, but I'd think it would be an optimization you'd want to make anyway.

I tried applying per material, not per mesh. But one problem: how do I commit my object info (per mesh) to the shader?

#16 Reedbeta

    DevMaster Staff

  • Administrators
  • 5308 posts
  • LocationSanta Clara, CA

Posted 16 May 2012 - 06:46 PM

You mean shader parameters like object-to-world matrix and suchlike? You just set the parameter through the effect API. Parameter setting can be done at any time, whether the shader is currently bound / applied or not.
reedbeta.com - developer blog, OpenGL demos, and other projects

#17 Xcrypt

    New Member

  • Members
  • PipPipPip
  • 144 posts
  • LocationBelgium

Posted 16 May 2012 - 06:49 PM

Are you certain about that? That's what I'm doing, and it's not working. I guess you need apply for registering the changes made to the effect? If not, then I'm doing something wrong...


ID3D10EffectMatrixVariable* m_pMatWVP_fx;
ID3D10EffectMatrixVariable* m_pMatW_fx;


m_pMatW_fx = m_pD3DEffect->GetVariableByName("g_MatW")->AsMatrix();
ASSERT(m_pMatW_fx->IsValid());

m_pMatWVP_fx = m_pD3DEffect->GetVariableByName("g_MatWVP")->AsMatrix();
ASSERT(m_pMatWVP_fx->IsValid());

void GFX::Effect_PosTexNorm::Commit_Object(const PerObjectInfo& info)
{
m_MatW = *(info.pMatW);
m_MatWVP = m_MatW*m_MatVP;

m_pMatW_fx->SetMatrix(m_MatW);
m_pMatWVP_fx->SetMatrix(m_MatWVP);
}


m_pEffect->Commit_Material(it->first, perFrameInfo);

D3D10_TECHNIQUE_DESC tDesc;
GFX::Technique* pTech = it->first->m_pEffect->GetTechnique(it->first->m_sTechnique);
pTech->GetD3DTechnique()->GetDesc(&tDesc);
m_OpInfo.passes = tDesc.Passes;

		if(m_OpInfo.passes < 2) {
			double applyTime=GLOBAL_nsTIMER->TimePassed();
			pTech->GetD3DTechnique()->GetPassByIndex(0)->Apply(0);
			applyTime= GLOBAL_nsTIMER->TimePassed()-applyTime;
			m_ApplyTime+=applyTime;
			++m_ApplyCalls;
		}

//OTHER CODE [COLLAPSED FOR VISIBILITY]

perObjectInfo.pMatW = &((*bucketIter)->m_Desc.matW);
perObjectInfo.pExtraInfo = (*bucketIter)->m_Desc.pExtraInfo;
it->first->m_pEffect->Commit_Object(perObjectInfo); //<================================commit object

//Finally, draw this mesh already!
if(m_OpInfo.pGeometry->GetpIBuffer())
{
	 if(m_OpInfo.passes < 2) {
	 m_DxCore.pDevice->DrawIndexed((*bucketIter)->GetDrawCount(),(*bucketIter)->GetDrawStartPos(),0);
	  }
	  else
	  {
			 for (int p = 0; p < m_OpInfo.passes; ++p)
			   {
				ID3D10EffectTechnique* pFxTech = pTech->GetD3DTechnique();
				ID3D10EffectPass* pPass = pFxTech->GetPassByIndex(p);
				double applyTime=GLOBAL_nsTIMER->TimePassed();
				pPass->Apply(0);
				applyTime= GLOBAL_nsTIMER->TimePassed()-applyTime;
				m_ApplyTime+=applyTime;
				++m_ApplyCalls;
				m_DxCore.pDevice->DrawIndexed((*bucketIter)->GetDrawCount(),(*bucketIter)->GetDrawStartPos(),0);
				}
		}
}
			


#18 Reedbeta

    DevMaster Staff

  • Administrators
  • 5308 posts
  • LocationSanta Clara, CA

Posted 16 May 2012 - 07:03 PM

Hmm, that's odd. I would think that setting the parameter and doing a draw call would be enough. I don't have my D3D engine handy to test that right now, though.
reedbeta.com - developer blog, OpenGL demos, and other projects

#19 Xcrypt

    New Member

  • Members
  • PipPipPip
  • 144 posts
  • LocationBelgium

Posted 16 May 2012 - 07:20 PM

View PostReedbeta, on 16 May 2012 - 07:03 PM, said:

Hmm, that's odd. I would think that setting the parameter and doing a draw call would be enough. I don't have my D3D engine handy to test that right now, though.
No problem, is it possible to check that later?
Thanks for all the help btw, all of you guys.

---

The performance issue with my renderer is "solved". In fact, there never was a performance issue.
I'll explain...

my partner claimed that the other engine with no optimisations was able to draw 50k rects at 30 fps.
my engine could only draw 1-2k rects at 30fps.

The mistake I made here that I was stupid enough to believe him, and never actually tested it myself. So this way, I lost 2 weeks of precious development time. Never take anyone's word! Always test it yourself!
I don't blame him at all, it was my mistake for not testing. He was simply misinformed.

It turns out that the other engine could only draw 500-1k rects at 30fps, after testing. My day was made!

---

Well, the main issue is solved. But 1-2k rects at 30fps is slow (intel i5 quadcore, 1G vid mem), even without instancing or any kind of optimisation, so I'm left with some questions.
Does that imply that a draw call is so damn expensive that you can only do about 1k of them per frame, in order to sustain a smooth framerate?
I profiled the other engine too. It seems that apply nearly always takes longer than the draw call itself!

Is this normal?

#20 Reedbeta

    DevMaster Staff

  • Administrators
  • 5308 posts
  • LocationSanta Clara, CA

Posted 16 May 2012 - 08:20 PM

1K draw calls per frame is very low. I think that was the limit around 7-10 years ago. I don't know about PC games, but on current-gen consoles it's not uncommon to hit 10K draw calls per frame for real game scenes. D3D9 had a good deal of per-call overhead that restricted the number of draw calls well below that, but in D3D10-11 draw calls should be much cheaper, so I'd guess 10K-20K draw calls should be achievable today.
reedbeta.com - developer blog, OpenGL demos, and other projects





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users