clock cycles per operation: any reference?

starstutter 101 Jul 26, 2009 at 23:22

Hey guys, quick question. I know it depends on the type of hardware/driver/ect, but is there any sort of general reference of how much power each type of operation (such as a texture sample or a dot3 operation) takes up, or how much time it takes to execute on modern hardware? It seems a bit farfetched, but any type of reference would be godly to me.

Thanks :)

Oh, and while I’m on the subject, I was wondering how fast texture samples are compared to pure math operations (ie, for storing a complex lighting model in a lookup table, when is the speed trade-off worth it).

6 Replies

Please log in or register to post a reply.

Reedbeta 167 Jul 27, 2009 at 01:32

I don’t know of a location where you can find that information for graphics chips, but maybe someone else does.

As for the second question, it depends a lot on the size/format of the texture and the pattern of accesses. GPUs have a cache for texture lookups, where they store recently-accessed areas of textures. If the texels you need are already in the cache, lookups are very fast; otherwise, they’re very slow. So, generally speaking, if pixels near each other on-screen all look up the same texels, or texels near each other in the texture, you’ll get better performance than if your texture lookups are scattered. Also, if you have too many different textures going into one shader, texture lookup speed will deteriorate due to cache thrashing. Narrower formats (e.g. DXT compressed or monochrome textures) will benefit more from the cache than wide ones (uncompressed RGBA, or gawd forbid, floating point). Also, if a texture is small enough to fit entirely in the cache, lookups to it will be very fast. Finally, some (many?) cards have separate caches for main memory and video memory, so you can actually improve performance sometimes by judiciously placing some textures in main memory.

starstutter 101 Jul 27, 2009 at 03:52

Very good information to know, and that explains a lot.

Still looking for a table however. Would nVidia or ATI (although I am working with nVidia cards) have this infromation or (better yet) papers about them?

JarkkoL 102 Jul 27, 2009 at 07:17

Your best bet is to use nVidia ShaderPerf.

Nick 102 Jul 27, 2009 at 07:36

Here’s an in-depth analysis of the NVIDIA G80 architecture, which still represents their newer chips quite well.

Basically, they have shader clusters that work with vectors that are 16 elements wide. Each cluster has 16 scalar multiply-add units, so these operations take just one clock cycle. Note that something like a dp4 instructions takes 4 cycles to compute for 16 different pixels or vertices. Each cluster also has 4 special function units (SFU). These have two roles in one: interpolation to compute input values (colors and texture coordinates), and computing transcendental operations (anything not mul or add). Since there are only 4 SFU’s it takes 4 clock cycles to process a 16-element vector.

The actual cost really depends on your shader’s ratio between multiply-add instructions and other instructions. For instance if you have an m4x4 instruction, that gives you room for doing 4 special operations in parallel, for free! The same principle is true for texture lookup:

Each cluster is further equipped with four texture address units, and eight texture filter units. These run at lower clock frequency though. But if you can keep the number of texture lookups low, they can execute in parallel with the arithmetic operations.

As Reedbeta already mentioned though, texture performance also depends on cache behaviour. Also the G80’s register file is relatively cramped so this can influence performance in complex ways as well. And to top it off other chips behave somewhat different too.

But anyway, try not to think in terms of clock cycles, but in terms of the right instruction mixture to avoid bottlenecks. A balanced shader has several texture lookups, several transcendental operations, and many more multiplies and additions. Oh and don’t panic if one shader doesn’t have the right mixture; the GPU will try to run several shaders concurrently and maximize the use of each unit independently.

Phlex 101 Jul 27, 2009 at 10:17

As far as I’m aware, as a general rules

old hardware = texture look up’s faster
new hardware = math operations faster

starstutter 101 Jul 28, 2009 at 21:56

thanks a lot for the info guys, this is really helping.