Jump to content


clock cycles per operation: any reference?


6 replies to this topic

#1 starstutter

    Senior Member

  • Members
  • PipPipPipPip
  • 1039 posts

Posted 26 July 2009 - 11:22 PM

Hey guys, quick question. I know it depends on the type of hardware/driver/ect, but is there any sort of general reference of how much power each type of operation (such as a texture sample or a dot3 operation) takes up, or how much time it takes to execute on modern hardware? It seems a bit farfetched, but any type of reference would be godly to me.

Thanks :)

Oh, and while I'm on the subject, I was wondering how fast texture samples are compared to pure math operations (ie, for storing a complex lighting model in a lookup table, when is the speed trade-off worth it).
(\__/)
(='.'=)
This is Bunny. Copy and paste bunny into
(")_(") your signature to help him gain world domination.
bunny also wants to fight spam: Click Here Bots!

#2 Reedbeta

    DevMaster Staff

  • Administrators
  • 5309 posts
  • LocationSanta Clara, CA

Posted 27 July 2009 - 01:32 AM

I don't know of a location where you can find that information for graphics chips, but maybe someone else does.

As for the second question, it depends a lot on the size/format of the texture and the pattern of accesses. GPUs have a cache for texture lookups, where they store recently-accessed areas of textures. If the texels you need are already in the cache, lookups are very fast; otherwise, they're very slow. So, generally speaking, if pixels near each other on-screen all look up the same texels, or texels near each other in the texture, you'll get better performance than if your texture lookups are scattered. Also, if you have too many different textures going into one shader, texture lookup speed will deteriorate due to cache thrashing. Narrower formats (e.g. DXT compressed or monochrome textures) will benefit more from the cache than wide ones (uncompressed RGBA, or gawd forbid, floating point). Also, if a texture is small enough to fit entirely in the cache, lookups to it will be very fast. Finally, some (many?) cards have separate caches for main memory and video memory, so you can actually improve performance sometimes by judiciously placing some textures in main memory.
reedbeta.com - developer blog, OpenGL demos, and other projects

#3 starstutter

    Senior Member

  • Members
  • PipPipPipPip
  • 1039 posts

Posted 27 July 2009 - 03:52 AM

Very good information to know, and that explains a lot.

Still looking for a table however. Would nVidia or ATI (although I am working with nVidia cards) have this infromation or (better yet) papers about them?
(\__/)
(='.'=)
This is Bunny. Copy and paste bunny into
(")_(") your signature to help him gain world domination.
bunny also wants to fight spam: Click Here Bots!

#4 JarkkoL

    Senior Member

  • Members
  • PipPipPipPip
  • 475 posts

Posted 27 July 2009 - 07:17 AM

Your best bet is to use nVidia ShaderPerf.

#5 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 27 July 2009 - 07:36 AM

Here's an in-depth analysis of the NVIDIA G80 architecture, which still represents their newer chips quite well.

Basically, they have shader clusters that work with vectors that are 16 elements wide. Each cluster has 16 scalar multiply-add units, so these operations take just one clock cycle. Note that something like a dp4 instructions takes 4 cycles to compute for 16 different pixels or vertices. Each cluster also has 4 special function units (SFU). These have two roles in one: interpolation to compute input values (colors and texture coordinates), and computing transcendental operations (anything not mul or add). Since there are only 4 SFU's it takes 4 clock cycles to process a 16-element vector.

The actual cost really depends on your shader's ratio between multiply-add instructions and other instructions. For instance if you have an m4x4 instruction, that gives you room for doing 4 special operations in parallel, for free! The same principle is true for texture lookup:

Each cluster is further equipped with four texture address units, and eight texture filter units. These run at lower clock frequency though. But if you can keep the number of texture lookups low, they can execute in parallel with the arithmetic operations.

As Reedbeta already mentioned though, texture performance also depends on cache behaviour. Also the G80's register file is relatively cramped so this can influence performance in complex ways as well. And to top it off other chips behave somewhat different too.

But anyway, try not to think in terms of clock cycles, but in terms of the right instruction mixture to avoid bottlenecks. A balanced shader has several texture lookups, several transcendental operations, and many more multiplies and additions. Oh and don't panic if one shader doesn't have the right mixture; the GPU will try to run several shaders concurrently and maximize the use of each unit independently.

#6 Phlex

    Member

  • Members
  • PipPip
  • 53 posts

Posted 27 July 2009 - 10:17 AM

As far as I'm aware, as a general rules

old hardware = texture look up's faster
new hardware = math operations faster

#7 starstutter

    Senior Member

  • Members
  • PipPipPipPip
  • 1039 posts

Posted 28 July 2009 - 09:56 PM

thanks a lot for the info guys, this is really helping.
(\__/)
(='.'=)
This is Bunny. Copy and paste bunny into
(")_(") your signature to help him gain world domination.
bunny also wants to fight spam: Click Here Bots!





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users