Hey guys, quick question. I know it depends on the type of
hardware/driver/ect, but is there any sort of general reference of how
much power each type of operation (such as a texture sample or a dot3
operation) takes up, or how much time it takes to execute on modern
hardware? It seems a bit farfetched, but any type of reference would be
godly to me.
Oh, and while I’m on the subject, I was wondering how fast texture
samples are compared to pure math operations (ie, for storing a complex
lighting model in a lookup table, when is the speed trade-off worth
Please log in or register to post a reply.
I don’t know of a location where you can find that information for
graphics chips, but maybe someone else does.
As for the second question, it depends a lot on the size/format of the
texture and the pattern of accesses. GPUs have a cache for texture
lookups, where they store recently-accessed areas of textures. If the
texels you need are already in the cache, lookups are very fast;
otherwise, they’re very slow. So, generally speaking, if pixels near
each other on-screen all look up the same texels, or texels near each
other in the texture, you’ll get better performance than if your texture
lookups are scattered. Also, if you have too many different textures
going into one shader, texture lookup speed will deteriorate due to
cache thrashing. Narrower formats (e.g. DXT compressed or monochrome
textures) will benefit more from the cache than wide ones (uncompressed
RGBA, or gawd forbid, floating point). Also, if a texture is small
enough to fit entirely in the cache, lookups to it will be very fast.
Finally, some (many?) cards have separate caches for main memory and
video memory, so you can actually improve performance sometimes by
judiciously placing some textures in main memory.
Very good information to know, and that explains a lot.
Still looking for a table however. Would nVidia or ATI (although I am
working with nVidia cards) have this infromation or (better yet) papers
Your best bet is to use nVidia ShaderPerf.
Here’s an in-depth analysis of the NVIDIA
G80 architecture, which
still represents their newer chips quite well.
Basically, they have shader clusters that work with vectors that are 16
elements wide. Each cluster has 16 scalar multiply-add units, so these
operations take just one clock cycle. Note that something like a dp4
instructions takes 4 cycles to compute for 16 different pixels or
vertices. Each cluster also has 4 special function
units (SFU). These have two
roles in one: interpolation to compute input values (colors and texture
coordinates), and computing transcendental operations (anything not mul
or add). Since there are only 4 SFU’s it takes 4 clock cycles to process
a 16-element vector.
The actual cost really depends on your shader’s ratio between
multiply-add instructions and other instructions. For instance if you
have an m4x4 instruction, that gives you room for doing 4 special
operations in parallel, for free! The same principle is true for texture
Each cluster is further equipped with four texture address units, and
eight texture filter units. These run at lower clock frequency though.
But if you can keep the number of texture lookups low, they can execute
in parallel with the arithmetic operations.
As Reedbeta already mentioned though, texture performance also depends
on cache behaviour. Also the G80’s register file is relatively cramped
so this can influence performance in complex ways as well. And to top it
off other chips behave somewhat different too.
But anyway, try not to think in terms of clock cycles, but in terms of
the right instruction mixture to avoid bottlenecks. A balanced shader
has several texture lookups, several transcendental operations, and many
more multiplies and additions. Oh and don’t panic if one shader doesn’t
have the right mixture; the GPU will try to run several shaders
concurrently and maximize the use of each unit independently.
As far as I’m aware, as a general rules
old hardware = texture look up’s faster
new hardware = math operations faster
thanks a lot for the info guys, this is really helping.