raster going slow

rouncer 104 Jan 07, 2012 at 19:51


These 20 guys are only getting to the screen at 10 fps, and im wondering why… its not the vertex count because if i use a lod i get the exact same fps.

The thing is each guy is 9 draw primitives, armour and body and weapons, and I think that might be why its going slow.

How can I get the guys to the screen without overloading the draw primitives calls?

If I just render the body and head by itself I get a much better frame rate, so how do I draw clothes and armour without slowing down?

12 Replies

Please log in or register to post a reply.

rouncer 104 Jan 08, 2012 at 01:02

I sorta got around the problem by baking all the clothing and accessories, etc, into the same model and drew them all at once, and it worked.
only problem is now materials are going to be a little bit harder… oh well, thanks anyway.

Reedbeta 167 Jan 08, 2012 at 01:39

9*20 = 180 draw calls shouldn’t be a problem. Are you using an optimized build?

rouncer 104 Jan 08, 2012 at 04:04

ahhh! i changed it to release mode and yeh it started working properly, why was it going slow under the debug build? I cant ever remembr it happening before thats why I was really confused…

Reedbeta 167 Jan 08, 2012 at 05:50

Debug builds aren’t optimized, so a lot of things run slower. I’ve seen this myself recently with text; I wrote a simple font renderer for my engine and drawing a couple paragraphs of text took something like 15 ms! Release build runs nice and quick, though. :)

JarkkoL 102 Jan 08, 2012 at 15:45

Rendering only 20 characters in debug build shouldn’t be that slow though. It’s not that CPU taxing.

Vilem_Otte 117 Jan 08, 2012 at 17:21


Rendering only 20 characters in debug build shouldn’t be that slow though. It’s not that CPU taxing.

You’re only half-right - it also heavily depends on the compiler settings (F.e. running application with -ggdb3 -O0 and without -fomit-frame-pointer can be really slow), target configuration, etc. - although you’re right that 20 chars should be okay even in debug build.

[Now comes heavy wizardy, black magic and compiler related stuff]
Ad some basic compiler flags:

Basically for debug you want debugging symbols (meaning to see F.e. variable name during debugging, instead of address - so you can see what’s going on) - e.g. some -ggdb3 and -O0 are really good for these. Also don’t use -fomit-frame-pointer (Don’t keep the frame pointer in a register for functions that don’t need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. It also makes debugging impossible on some machines. - as specified in GCC specs)

For release version, if you don’t care about size, you don’t need debugging symbols, neither you don’t need to keep frame-pointer, so instead of -ggdb3 -O0 -fomit-frame-pointer it is better to use some -O1 or -O2 (O means optimization set - O1 is basic optimization set, O2 is O1 + some others)

For release version, if you care about size it is best to use -Os (Optimization set for size)

And at last (and my favourite) you can go “roarrrrr!” hit your compiler with a club and use O3 (don’t know whether it is available in MSVC, also hitting with club was meant literally - don’t do it, your compiler might not be tough enough to survive :D ).

Anyway for detailed description, see here http://gcc.gnu.org/o…ze-Options.html most of the flags are same in most compilers (dunno where MSVC reference is, though lots of flags are similar).

JarkkoL 102 Jan 09, 2012 at 20:36

That’s my point, regardless of how bad complier settings you got, rendering 20 characters shouldn’t be that CPU taxing ;)

Reedbeta 167 Jan 09, 2012 at 21:35

Yeah, I would tend to agree, but it seems to be the case that it is sometimes much slower. I don’t understand why myself. Like I said, in my case I saw text rendering being several times slower (on the GPU) when I was in a debug build, despite the fact that what the GPU was doing should’ve been exactly the same. It wasn’t due to using the debug runtime of D3D either; using the debug runtime in a release build caused no measurable slowdown.

JarkkoL 102 Jan 09, 2012 at 23:02

Text rendering is kind of special case though where you feed GPU with dynamic data (I assume you use dynamic vertex/index buffer to send the text quads to the GPU). If you use debug d3d dll it could verify the vertex/index data after unlock() or something like that to make things appear much slower for GPU. But yeah, I have seen close to 10x slower debug builds in some f’ed up game engines where every single thing is accessed through an accessor function or something like that. Still 10fps sounds pretty low, but meh.

Vilem_Otte 117 Jan 09, 2012 at 23:11

Okay,so let’s do some benchmarking on my side.

Testing machine - Core i3 + 3 GiB RAM + Radeon HD 5470 (e.g. my current laptop, where I’m sitting). Operating system - Debian Squeeze.

Testing pipeline - Deferred renderer with fast CPU ray tracer for computing VPL positions, using just single spotlight, fully dynamic, resolution @ 1280x720. Note numbers might not be the peak performance of the PC, as I’ll be playing music during the test :D (but during all the tests!)

Testing scenes:
1.) simple scene, just some 150 triangles, 4 different materials
2.) Sibenik cathedral

Testing cases:
1.) Debug build (compiler flags are -O0 -ggdb3 … means no optimisations and we want debugging info for gdb level 3)
2.) Release build (compiler flags are -O1)
3.) Release build v2 (compiler flags are -O2)
4.) Release build v3 (compiler flags are -O3)
5.) Black magic build (my own compiler flags, starting with O3, contating lots of compiler magic stuff)

Scene/Build Simple scene    Sibenik cathedral
Debug         57.050750    112.404224
Release v1   56.718257   110.835902
Release v2   56.137256   110.731450
Release v3   55.950382   110.438174
Black magic   55.079345    109.939037

As you can see debug/release doesn’t have that much impact for small scenes, it has slightly larger for large scenes, though it is not THAT huge (well of course I would need to profile the application how much time are we rendering on GPU and waiting on CPU to actually say how much it gives/takes). Gimme a sec…

Reedbeta 167 Jan 09, 2012 at 23:21

Interesting. Certainly doesn’t make much difference in your case. And Jarkko, yeah, I guess dynamic-vertex-buffer-related overhead could be an issue. It’s still odd that the poor performance was showing up on the GPU instead of the CPU, though (I time CPU with QueryPerformanceCounter and GPU with D3D11 timestamp queries). It could be something’s wrong with my timing code, though, or maybe the OS was consistently interrupting the GPU at the same point in my frame, or something crazy like that.

Vilem_Otte 117 Jan 10, 2012 at 00:52

Okay, so here comes the profiling…

I’ve written my own profiler on my game engine and well, it reports quite lots of details (it can measure percentage, milliseconds, even some engine calls, etc.), lets sum them to: CPU and GPU. What will be where?
In CPU I’m counting ray-tracing part of course, draw calls, state changes (e.g. changing FBO or shader), et cetera - e.g. CPU calls and CPU instructions generally.
In GPU I’m counting just rendering time on GPU (e.g. in first case time we’re waiting till GPU actually finishes computing of something, in second case the time we’re actually doing GPU work)

EDIT: Why this counting? Well basically we want to spend as much time on GPU as possible, because there is whole lot more stuff to do on CPU ;) (AI, Physics, etc.). And also, speed of GPUs grows alot more quickly than speed of CPUs.

Basically lets use same scenes and compiler stuff as in first case … and measure where are we spending most time… (Again the first two values are for simple scene, the next two for Sibenik).

Debug        CPU 50.900325%, GPU 49.099675%  CPU 48.323266%, GPU 51.676734%
Release v1  CPU 50.612472%, GPU 46.387528%   CPU 45.059028%, GPU 54.940972%
Release v2  CPU 48.958393%, GPU 51.041607%   CPU 41.787474%, GPU 58.212526%
Release v3  CPU 48.578865%, GPU 51.421135%   CPU 41.175955%, GPU 58.824045%
Black magic   CPU 48.492081%, GPU 51.507919%     CPU 40.038613%, GPU 59.961387%

So here we see, that as optimisations won’t give us much speed (some 2% more time waiting for GPU stuff) for simple scenes (it will give us boost though - but it is less visible in overall performance), it will give us pretty huge boost for large scenes (almost 10% more of time is spend on waiting for GPU - e.g. it is time to optimize GPU side :D, if I did some really large and very complex scene, something like Power plant, or so, we would se even larger boost on CPU side with optimizations).

If we always wait till GPU finishes (e.g. we’ll count absolute time spend on GPU to absolute time on CPU - lets force wait for GPU through glFinish(); ), the results will be:

Debug        CPU 14.074966%, GPU 85.925034%  CPU  8.012333%, GPU 91.987667%
Release v1  CPU 13.981151%, GPU 86.018849%   CPU  7.988983%, GPU 92.011017%
Release v2  CPU 13.893738%, GPU 86.106262%   CPU  7.966127%, GPU 92.033873%
Release v3  CPU 13.869523%, GPU 86.130477%   CPU  7.946651%, GPU 92.053349%
Black magic   CPU 13.503755%, GPU 86.496245%     CPU  7.845981%, GPU 92.154019%

Here is pretty simple to see, that in simple cases GPU is mostly relaxing (as there is more CPU utilization), and in more complex scene it is pretty much having a lot harder time than with simple scene. It can also be seen (here in simple scene) that optimizations gave us some 0.5% totally (means some 3.6% in CPU only performance).

Also note, that my whole engine is heavily optimized (it uses dynamic BVHs, scenegraphs, visibility culling, intrinsics (especially in ray-tracing part), etc.) - it would be even more visible on some less optimized application.

Anyway so technically what can we see from this (apart from that I’m heavy-optimizing guy)?

Basically we can see that compiler optimizations are giving us ability to write more high-level and/or better structured code, not that we can write algorithms with complexity O(n\^3) and O3 will solve the speed for us (It won’t, it even wasn’t designed to solve it), we should rather re-consider whether not to use O(n log n) rather (even though we will spend few days more on the algorithm).

I’m also not saying that MSVC applications shouldn’t be much faster when switched from Debug to Release mode (I’d like also to note that MSVC stores huge amount of debugging symbols in Win32 applications).

And the last note: In this testing I’ve used GCC 4.5 on Linux. The whole application is written directly in XLib + OpenGL (graphics part).

And no post is complete without the image shining out of it (Simple scene - showing the GI stuff):