Hey guys, quick question. I know it depends on the type of hardware/driver/ect, but is there any sort of general reference of how much power each type of operation (such as a texture sample or a dot3 operation) takes up, or how much time it takes to execute on modern hardware? It seems a bit farfetched, but any type of reference would be godly to me.
Thanks :)
Oh, and while I'm on the subject, I was wondering how fast texture samples are compared to pure math operations (ie, for storing a complex lighting model in a lookup table, when is the speed trade-off worth it).
clock cycles per operation: any reference?
Started by starstutter, Jul 26 2009 11:22 PM
6 replies to this topic
#1
Posted 26 July 2009 - 11:22 PM
(\__/)
(='.'=) This is Bunny. Copy and paste bunny into
(")_(") your signature to help him gain world domination.
bunny also wants to fight spam: Click Here Bots!
(='.'=) This is Bunny. Copy and paste bunny into
(")_(") your signature to help him gain world domination.
bunny also wants to fight spam: Click Here Bots!
#2
Posted 27 July 2009 - 01:32 AM
I don't know of a location where you can find that information for graphics chips, but maybe someone else does.
As for the second question, it depends a lot on the size/format of the texture and the pattern of accesses. GPUs have a cache for texture lookups, where they store recently-accessed areas of textures. If the texels you need are already in the cache, lookups are very fast; otherwise, they're very slow. So, generally speaking, if pixels near each other on-screen all look up the same texels, or texels near each other in the texture, you'll get better performance than if your texture lookups are scattered. Also, if you have too many different textures going into one shader, texture lookup speed will deteriorate due to cache thrashing. Narrower formats (e.g. DXT compressed or monochrome textures) will benefit more from the cache than wide ones (uncompressed RGBA, or gawd forbid, floating point). Also, if a texture is small enough to fit entirely in the cache, lookups to it will be very fast. Finally, some (many?) cards have separate caches for main memory and video memory, so you can actually improve performance sometimes by judiciously placing some textures in main memory.
As for the second question, it depends a lot on the size/format of the texture and the pattern of accesses. GPUs have a cache for texture lookups, where they store recently-accessed areas of textures. If the texels you need are already in the cache, lookups are very fast; otherwise, they're very slow. So, generally speaking, if pixels near each other on-screen all look up the same texels, or texels near each other in the texture, you'll get better performance than if your texture lookups are scattered. Also, if you have too many different textures going into one shader, texture lookup speed will deteriorate due to cache thrashing. Narrower formats (e.g. DXT compressed or monochrome textures) will benefit more from the cache than wide ones (uncompressed RGBA, or gawd forbid, floating point). Also, if a texture is small enough to fit entirely in the cache, lookups to it will be very fast. Finally, some (many?) cards have separate caches for main memory and video memory, so you can actually improve performance sometimes by judiciously placing some textures in main memory.
reedbeta.com - developer blog, OpenGL demos, and other projects
#3
Posted 27 July 2009 - 03:52 AM
Very good information to know, and that explains a lot.
Still looking for a table however. Would nVidia or ATI (although I am working with nVidia cards) have this infromation or (better yet) papers about them?
Still looking for a table however. Would nVidia or ATI (although I am working with nVidia cards) have this infromation or (better yet) papers about them?
(\__/)
(='.'=) This is Bunny. Copy and paste bunny into
(")_(") your signature to help him gain world domination.
bunny also wants to fight spam: Click Here Bots!
(='.'=) This is Bunny. Copy and paste bunny into
(")_(") your signature to help him gain world domination.
bunny also wants to fight spam: Click Here Bots!
#5
Posted 27 July 2009 - 07:36 AM
Here's an in-depth analysis of the NVIDIA G80 architecture, which still represents their newer chips quite well.
Basically, they have shader clusters that work with vectors that are 16 elements wide. Each cluster has 16 scalar multiply-add units, so these operations take just one clock cycle. Note that something like a dp4 instructions takes 4 cycles to compute for 16 different pixels or vertices. Each cluster also has 4 special function units (SFU). These have two roles in one: interpolation to compute input values (colors and texture coordinates), and computing transcendental operations (anything not mul or add). Since there are only 4 SFU's it takes 4 clock cycles to process a 16-element vector.
The actual cost really depends on your shader's ratio between multiply-add instructions and other instructions. For instance if you have an m4x4 instruction, that gives you room for doing 4 special operations in parallel, for free! The same principle is true for texture lookup:
Each cluster is further equipped with four texture address units, and eight texture filter units. These run at lower clock frequency though. But if you can keep the number of texture lookups low, they can execute in parallel with the arithmetic operations.
As Reedbeta already mentioned though, texture performance also depends on cache behaviour. Also the G80's register file is relatively cramped so this can influence performance in complex ways as well. And to top it off other chips behave somewhat different too.
But anyway, try not to think in terms of clock cycles, but in terms of the right instruction mixture to avoid bottlenecks. A balanced shader has several texture lookups, several transcendental operations, and many more multiplies and additions. Oh and don't panic if one shader doesn't have the right mixture; the GPU will try to run several shaders concurrently and maximize the use of each unit independently.
Basically, they have shader clusters that work with vectors that are 16 elements wide. Each cluster has 16 scalar multiply-add units, so these operations take just one clock cycle. Note that something like a dp4 instructions takes 4 cycles to compute for 16 different pixels or vertices. Each cluster also has 4 special function units (SFU). These have two roles in one: interpolation to compute input values (colors and texture coordinates), and computing transcendental operations (anything not mul or add). Since there are only 4 SFU's it takes 4 clock cycles to process a 16-element vector.
The actual cost really depends on your shader's ratio between multiply-add instructions and other instructions. For instance if you have an m4x4 instruction, that gives you room for doing 4 special operations in parallel, for free! The same principle is true for texture lookup:
Each cluster is further equipped with four texture address units, and eight texture filter units. These run at lower clock frequency though. But if you can keep the number of texture lookups low, they can execute in parallel with the arithmetic operations.
As Reedbeta already mentioned though, texture performance also depends on cache behaviour. Also the G80's register file is relatively cramped so this can influence performance in complex ways as well. And to top it off other chips behave somewhat different too.
But anyway, try not to think in terms of clock cycles, but in terms of the right instruction mixture to avoid bottlenecks. A balanced shader has several texture lookups, several transcendental operations, and many more multiplies and additions. Oh and don't panic if one shader doesn't have the right mixture; the GPU will try to run several shaders concurrently and maximize the use of each unit independently.
#6
Posted 27 July 2009 - 10:17 AM
As far as I'm aware, as a general rules
old hardware = texture look up's faster
new hardware = math operations faster
old hardware = texture look up's faster
new hardware = math operations faster
#7
Posted 28 July 2009 - 09:56 PM
thanks a lot for the info guys, this is really helping.
(\__/)
(='.'=) This is Bunny. Copy and paste bunny into
(")_(") your signature to help him gain world domination.
bunny also wants to fight spam: Click Here Bots!
(='.'=) This is Bunny. Copy and paste bunny into
(")_(") your signature to help him gain world domination.
bunny also wants to fight spam: Click Here Bots!
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users












