# Multi-pass or Multi-tap?

18 replies to this topic

### #1Steven Hansen

Member

• Members
• 86 posts

Posted 26 April 2005 - 09:59 PM

Writing a pixel shader 2.0 implementation for pixel averaging is separable. This can be arbitrarily broken up into numerous passes with 3 taps each, or fewer passes with up to 13 (max?) taps each.

Speed-wise, is it better to have fewer taps, or fewer passes? It seems that two passes with 7 samples each outperforms a single pass with 13 samples, but is that typical? Is the tradeoff hardware dependent? Where is the cutoff? What is the smartest approach for the best performance on the widest variety of hardware (considering ps2.0 must be present).

Thanks.

### #2Nick

Senior Member

• Members
• 1227 posts

Posted 28 April 2005 - 02:04 PM

Pixel averaging. :huh:

Can't you just use the card's mipmap generation and/or bilinear filtering features?

### #3Steven Hansen

Member

• Members
• 86 posts

Posted 28 April 2005 - 04:08 PM

Nick, thank you for stabbing at it. :)

Originally I hoped that by casually mentioning a particular domain, solvable by both Pixel Shaders and FFP, I would establish a visual for the motivation of my intended topic. In this case, I don't need a solution to pixel averaging; I just want to know the costs for the tradeoffs between multi-pass and multi-tap... generally.

I've run a few tests locally, but the results are sketchy, and probably specific to hardware (X800 Pro in this case).

Posed another way, with FFP solution in hand, is it worth the time, hassle, and complexity to develop a pixel shader 2.0 solution as well? Obviously yes if the algorithm can be improved (logarithmic vs polynomial - etc). Sometimes though, the difference is measured in taps vs passes. So, more taps but fewer passes means what? Which is computationally less expensive for today's hardware? Where is the break-even point? Is there a rule-of-thumb? More taps is better than more passes? Vice-versa? Is there any official information on this anywhere?

Thanks.

### #4Nick

Senior Member

• Members
• 1227 posts

Posted 28 April 2005 - 07:11 PM

Sorry for taking it too literal. :wink:

Ok, modern graphics cards implement the fixed-function pipeline with shaders. So in theory, it's best to just write an optimized shader manually. Generally you should also not bother splitting it in mutiple passes. The driver already takes care of that, should the number of instructions exceed the internal maximum.

Unfortunately, the driver can't do miracles and every card has a different configuration. So there's no definitive answer. Either way, you should rely on the driver to save yourself from a lot of work. :excl:

### #5Steven Hansen

Member

• Members
• 86 posts

Posted 28 April 2005 - 08:55 PM

Maybe an illustration will help. Remember that this just highlights the problem domain, not a particular something I want solved.

Gaussian Blur
Source: http://www-personal.engin.umd.umich.edu/~j...81/21_GBlur.pdf

As outlined in Source, one approach to the gaussian blur is to use pascal's triangle for weights. The samples are performed first horizontally, then vertically (which is 3 samples less than doing all at once). The horizontal/vertical weights for a 2-pass 3x3 approximation are: 1 2 1. The entire 3x3 filter result is as follows:

1 2 1
2 4 2
1 2 1

Applying this filter four times is exactly equivalent (minus roundoff errors) to a 9x9 filter. 9x9 weights: 1 8 28 56 70 56 28 8 1 (I don't really want to present the entire 9x9 filter here). Considering the 9x9 blur as the goal, there are functionally different ways to compute it.

1. a) Perform a horizontal 1 2 1 blur (3-tap, 1-pass). b) Perform a vertical 1 2 1 blur on the result. c) Repeat the process three more times.
Passes / Taps: 8 passes, 3 taps each, 24 taps total.
This approach may be performed on FFP hardware by manipulating vertex colors and texture coordinates, provided that the hardware supports three texture samples per pass.

2. a) Perform a 3x3 blur with 9 taps (apply the entire 3x3 filter). b) Repeat step a) three more times.
Passes / Taps: 4 passes, 9 taps each, 36 taps total.
This approach requires pixel shader 2.0 for the 9 taps per pass.

Differences
There are other solutions, but let's start with these two. Note that approach 2 uses half as many passes, but 50% more taps than approach 1. Since I must support older hardware (NVidia MX440, etc) I would have already implemented approach 1. Simply looking at the mechanics of approach 2, I can't tell if it provides sufficient (any?) advantage over approach 1. :huh:

ATI's video shader sample uses approach 2 for gaussian blur. Is that because it's faster, or because it is easier to program, or because it highlights the power of 2.0? I just don't know! :sad:

Additionally, I wonder whether performing 16 taps in 4 passes (4 taps per pass) is better than 16 taps in 8 passes (2 taps per pass) - in the second case pixel shaders will be shorter, but run twice as often.

### #6Steven Hansen

Member

• Members
• 86 posts

Posted 28 April 2005 - 09:00 PM

Addendum to the lengthy previous response--- :)

I've written both of the solutions outlined above and tested them on my machine (x800 pro). I'm not confident that this represents a healthy sample of 2.0 hardware though. For the x800 pro, both of these run in about the same amount of time. I'm not certain what to infer from that. :wacko:

### #7Nick

Senior Member

• Members
• 1227 posts

Posted 29 April 2005 - 02:43 PM

Steven Hansen said:

I've written both of the solutions outlined above and tested them on my machine (x800 pro). I'm not confident that this represents a healthy sample of 2.0 hardware though.
A Radeon X800 is highly optimized for pixel shaders 2.0. The fixed-function pipeline is also fully implemented using shaders.

Quote

For the x800 pro, both of these run in about the same amount of time. I'm not certain what to infer from that.
Extra passes don't really cost anything extra. Rasterization and blending is fully pipelined. The only thing that matters is the number of texture sampling operations and arithmetic operations. The driver will optimize both your implementation so they actually look much alike.

### #8Steven Hansen

Member

• Members
• 86 posts

Posted 29 April 2005 - 04:23 PM

My question hasn't been answered. :mellow: Computationally - including slower 2.0 hardware, is it better to have fewer passes with more taps, or fewer taps and more passes? ... And if it depends, what is the dependency? - as precisely as possible.

Quote

Extra passes don't really cost anything extra
What? How is that possible unless you have a different understanding of passes than I do? Each pass draws a source (using the set vertex and pixel shader) to a destination - usually of exactly the same size. How can drawing the image over and over cost the same as just once?

Quote

in theory, it's best to just write an optimized shader manually. Generally you should also not bother splitting it in mutiple passes
The effect I outlined above requires multiple passes to work correctly. The result of one pass is the input to the next pass. Additionally, determining whether it is optimized or not depends on how expensive the pass vs tap ratio is.

Quote

The driver will optimize both your implementation so they actually look much alike.
We wouldn't expect a compiler to recognize that certain code is executing a linear sort vs a quicksort. We don't expect a compiler to change a linear sort to a quicksort behind our backs! :excl: Linear sort may even be more optimal for our data (mostly sorted already).

In exactly the same way, the approach I outlined above, as well as a slew of other effects, must be followed to the letter for correctness. Multi-pass algorithms build on the results of previous passes. Texture samples must be precisely accurate. The driver simply cannot optimize away previous passes or differing quantities of taps without altering the mathematic result!

Quote

A Radeon X800 is highly optimized for pixel shaders 2.0. The fixed-function pipeline is also fully implemented using shaders.
Yes, I know (good thing too :) ). These facts do not change or address my question though.

### #9JSoftware

Member

• Members
• 70 posts

Posted 30 April 2005 - 02:31 PM

as i see it then multipassing is evil! you have to shovel all the geometry through the pipeline once again. in the singlepass solution you have the same amount of texture fetches as in the multipass solution but you just don't have to send the double amount of geom through..

in my world that's pure logic :cool:
Peregrinus, expectavi pedos meos in cymbalis!

### #10Reedbeta

DevMaster Staff

• 5344 posts
• LocationSanta Clara, CA

Posted 30 April 2005 - 04:18 PM

JSoftware, ever heard of deferred shading? ;-)

Although that actually doesn't seem relevant here. Gaussian blur is an image algorithm, the source information is just the scene rendered to texture. There would be no need to put the geometry through more than once.
reedbeta.com - developer blog, OpenGL demos, and other projects

### #11Nick

Senior Member

• Members
• 1227 posts

Posted 30 April 2005 - 07:47 PM

Steven Hansen said:

What? How is that possible unless you have a different understanding of passes than I do?  Each pass draws a source (using the set vertex and pixel shader) to a destination - usually of exactly the same size.  How can drawing the image over and over cost the same as just once?
I assumed that there's a low number of polygons, so you wouldn't be vertex processing limited. Is this the case?

Drawing 'over and over' again is indeed not a significant problem. The graphics card can read back the pixel's color long beforehand (in parallel), run the shader, and blend the result in parallel again. There are some extra bandwidth requirements, but that's it.

Look at it this way: Suppose we have four big arrays, a[][], b[][], c[][] and d[][], and we would like to compute a = b + c + d for every element. We can do this in two ways:
for(each element)
{
a = b + c + d;
}

or:

for(each element)
{
a = b + c;
}

for(each element)
{
a = a + d;
}

Now suppose we have a processor which can do one addition every clock cycle, and read two values and write one value per clock cycle. Then the first and second approach will execute in exactly the same time! The bottleneck is the additions.

Likewise, when vertex processing and memory bandwidth isn't an issue, modern graphics cards are shader limited. Especially long shaders (like yours) show this behaviour).

Quote

The effect I outlined above requires multiple passes to work correctly.  The result of one pass is the input to the next pass.  Additionally, determining whether it is optimized or not depends on how expensive the pass vs tap ratio is.
What I meant is that generally you shouldn't bother splitting it into extra passes. Obviously it always helps to keep the number of passes low, if possible. The driver will determine futher limitations and split passes when required.

Quote

We wouldn't expect a compiler to recognize that certain code is executing a linear sort vs a quicksort.  We don't expect a compiler to change a linear sort to a quicksort behind our backs!  :excl:  Linear sort may even be more optimal for our data (mostly sorted already).
A GPU is not a CPU. There's still a very strict processing order and dynamic jumps in the shader are certainly not popular. There's also a strict stream of input and output data. So the driver can do a lot of analysis 'behind our back'. Every function call between BeginScene and EndScene is buffered and the graphics card might actually start a draw call more than two frames after we made the actual call! Of course the possibilities aren't infinite, but seriously, don't underestimate the job of the driver.

Quote

In exactly the same way, the approach I outlined above, as well as a slew of other effects, must be followed to the letter for correctness.  Multi-pass algorithms build on the results of previous passes.  Texture samples must be precisely accurate.  The driver simply cannot optimize away previous passes or differing quantities of taps without altering the mathematic result!
It probably won't eliminate passes, that's true. But as I explained above, it actually doesn't have to when you're not vertex processing or rasterization limited.

All I'm trying to say is; it's useless to try to optimize these sort of things. The tap/pass ratio you're talking about simple doesn't determine the bottleneck. The operation you're trying to perform is arithmetic intensive and you can't magically lower the workload. Minor but complicated effects will determine the final performance here, so all you can do is measure the performance in practice. Personally, I'd just pick the most general approach and stop wasting time...

### #12JSoftware

Member

• Members
• 70 posts

Posted 01 May 2005 - 07:35 PM

Reedbeta said:

JSoftware, ever heard of deferred shading? ;-)

Although that actually doesn't seem relevant here. Gaussian blur is an image algorithm, the source information is just the scene rendered to texture. There would be no need to put the geometry through more than once.

oh yeah! i've actually just started making a deferred shader engine :cool:
Peregrinus, expectavi pedos meos in cymbalis!

### #13Steven Hansen

Member

• Members
• 86 posts

Posted 02 May 2005 - 05:16 PM

:blink: Sadly, either I am having trouble communicating the situation, or I have simply addressed this (apparently) highly technical issue to the wrong forum. I wanted to post at DirectXDev, but it isn't really directx specific.

Further illustration: 4 vertices: 1 fullscreen quad. Source and destination are both 1600x1200 resolution.

First scenario: 4 passes, 1 tap each. 4 total reads per source texel. 4 writes per destination pixel/texel. Four intermediate render target textures of size 1600x1200. Algorithm: set source, draw to target 1 - set target 1 as source, draw to target 2 - set target 2 as source, draw to target 3, etc.

Second scenario: 2 passes, 3 taps each. 6 total reads per source texel. 2 writes per destination pixel/texel. Two intermediate render target textures of size 1600x1200 (yay! half the resources). Same algorithm.

Scenario one has 7,680,000 reads and 7,680,000 writes. Scenario two has 11,520,000 reads (50% higher) and 3,840,000 writes (half scenario 1). The driver has no choice but to make all of those reads, and all of those writes, or the solution will be incorrect.

I expect the process is nicely pipelined, and many reads/writes are done in parallel. That's good. However, it is not safe to say that the pipelining is so efficient that writes are free, or that writes and reads have the same cost. If that were so, then the entire cost could be summed from the number of reads and writes. From a previous post with two gaussian blur approximations:

Quote

Passes / Taps: 8 passes, 3 taps each, 24 taps total.
Passes / Taps: 4 passes, 9 taps each, 36 taps total.
If we sum reads and writes, then 24 reads, 8 writes (24+8 = 32) would be faster than 36 reads, 4 writes (36+4 = 40). If we think writes are free (pipelined), then 24 reads is faster than 36 reads. Regardless, my tests showed both approaches running at about the same speed.

So, the measurement is not linear in terms of reads vs writes. Now that (hopefully) this has been established (my original question), I'm sadly guessing that the general relative costs are unknown by the devmaster audience.

### #14Reedbeta

DevMaster Staff

• 5344 posts
• LocationSanta Clara, CA

Posted 02 May 2005 - 07:52 PM

The general relative costs are probably unknown to most anyone. It seems to me you could discover these empirically without too much trouble, by comparing the performance of various combinations of passes and taps. For example, the above data you posted suggests that writes cost about 3 times as much as reads, for your specific application. Also, keep in mind that this probably varies with the card used.

Also, there's no need for four render targets in your scenario 1. You don't need the contents of target 1 after you've used it to render target 2 - so just use two render targets, and alternate which is the source and which the destination.
reedbeta.com - developer blog, OpenGL demos, and other projects

### #15Nick

Senior Member

• Members
• 1227 posts

Posted 02 May 2005 - 09:42 PM

Steven Hansen said:

:blink: Sadly, either I am having trouble communicating the situation, or I have simply addressed this (apparently) highly technical issue to the wrong forum.
You're comunicating it just fine, and I've done a summer internship at NVIDIA so I don't think I have problems understanding the technical aspects.

Like Reedbeta sais, nobody really knows the definitive answer. You can't really predict these kind of things more exactly. We can sortof model the bandwidth need and the shader arithmetic cost and take a guess at what's the bottleneck. But the only thing giving the answers is to actually measure it. And as I've said before, don't underestimate the driver. Compression can alleviate the bandwidth problem and there might be some hidden processing unit that speeds up shader execution in certain cases.

So what I meant to say from the start, in a gentle way, is get over it. There is no magical tap/pass formula. If it's worth your time, try another approach and measure the results. You might even want to dynamically select one approach based on the result on each system individually.

Anyway, I can suggest to post your question on the Beyond3D forum. The people there know more about graphics chips than the designers. :lol: But don't be surprised if (after many pages of educated guessing) the conclusion will be the same...

### #16Steven Hansen

Member

• Members
• 86 posts

Posted 03 May 2005 - 07:44 AM

Sadly, it is WAY too much work to profile all these different approaches. And - due to the nature of my current job - I am going to run into this problem with steadily increasing frequency. ARGH!

The problem is that I don't have a nice baseline. Something like, "oh... shoot for something with fewer overall reads." So I get to implement like four or five possible approaches and then profile them all. Argh. :angry:

Perhaps most frustrating is that for the pixel averaging algorithm, I have an old FFP solution that is faster than the exact same algorithm written in vertex and pixel shaders :excl: ... but for some reason roundoff errors appear to be more significant (looks bad). So now I don't know how to justify starting any particular approach. It *could* be better, but it seems like without some knowledge about the relative costs, it could also be worse. :sad:

### #17auld

New Member

• Members
• 7 posts

Posted 27 October 2007 - 10:52 AM

Step back and you'll see the answer.

You may see faster performance today, but shaders are increasing in performance at an incredible rate. Therefore tomorrow, a shader solution will be faster. We also know its already better looking. QED - shader solution.

2. MP v MT
We know your "pass" is a single quad. So frankly any effort in optimisng the number of passes is largely wasted (assuming its full screen so no clears etc are used). However we know considerable effort is put into optimising TAPs by card manufacturers (faster memory, buses, caches etc). This is because they are precious. QED : chose the method that optimises TAPs.

In my experience (which is a lot, I used to manage Opengl driver development) leaving anything to driver developers is naive.
--

### #18J22

Member

• Members
• 92 posts

Posted 27 October 2007 - 11:43 AM

auld said:

In my experience (which is a lot, I used to manage Opengl driver development) leaving anything to driver developers is naive.
So true. As a developer you have high level knowledge of used algorithms and can choose the best option that matches the hardware. It's unfortunate ATI/Nvidia don't want to reveal much relevant information that would help you with the choice. Graphics hardware is changing quite alot as well and a good algorithm for previous generation hardware may not be so good choice for the next.

### #19Goz

Senior Member

• Members
• 575 posts

Posted 29 October 2007 - 04:14 PM

auld said:

In my experience (which is a lot, I used to manage Opengl driver development) leaving anything to driver developers is naive.

You just need to check Power VR to see that ;)

#### 1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users