Multi-pass or Multi-tap?
#1
Posted 26 April 2005 - 09:59 PM
Speed-wise, is it better to have fewer taps, or fewer passes? It seems that two passes with 7 samples each outperforms a single pass with 13 samples, but is that typical? Is the tradeoff hardware dependent? Where is the cutoff? What is the smartest approach for the best performance on the widest variety of hardware (considering ps2.0 must be present).
Thanks.
#2
Posted 28 April 2005 - 02:04 PM
Can't you just use the card's mipmap generation and/or bilinear filtering features?
#3
Posted 28 April 2005 - 04:08 PM
Originally I hoped that by casually mentioning a particular domain, solvable by both Pixel Shaders and FFP, I would establish a visual for the motivation of my intended topic. In this case, I don't need a solution to pixel averaging; I just want to know the costs for the tradeoffs between multi-pass and multi-tap... generally.
I've run a few tests locally, but the results are sketchy, and probably specific to hardware (X800 Pro in this case).
Posed another way, with FFP solution in hand, is it worth the time, hassle, and complexity to develop a pixel shader 2.0 solution as well? Obviously yes if the algorithm can be improved (logarithmic vs polynomial - etc). Sometimes though, the difference is measured in taps vs passes. So, more taps but fewer passes means what? Which is computationally less expensive for today's hardware? Where is the break-even point? Is there a rule-of-thumb? More taps is better than more passes? Vice-versa? Is there any official information on this anywhere?
Thanks.
#4
Posted 28 April 2005 - 07:11 PM
Ok, modern graphics cards implement the fixed-function pipeline with shaders. So in theory, it's best to just write an optimized shader manually. Generally you should also not bother splitting it in mutiple passes. The driver already takes care of that, should the number of instructions exceed the internal maximum.
Unfortunately, the driver can't do miracles and every card has a different configuration. So there's no definitive answer. Either way, you should rely on the driver to save yourself from a lot of work. :excl:
#5
Posted 28 April 2005 - 08:55 PM
Gaussian Blur
Source: http://www-personal.engin.umd.umich.edu/~j...81/21_GBlur.pdf
As outlined in Source, one approach to the gaussian blur is to use pascal's triangle for weights. The samples are performed first horizontally, then vertically (which is 3 samples less than doing all at once). The horizontal/vertical weights for a 2-pass 3x3 approximation are: 1 2 1. The entire 3x3 filter result is as follows:
1 2 1
2 4 2
1 2 1
Applying this filter four times is exactly equivalent (minus roundoff errors) to a 9x9 filter. 9x9 weights: 1 8 28 56 70 56 28 8 1 (I don't really want to present the entire 9x9 filter here). Considering the 9x9 blur as the goal, there are functionally different ways to compute it.
1. a) Perform a horizontal 1 2 1 blur (3-tap, 1-pass). b) Perform a vertical 1 2 1 blur on the result. c) Repeat the process three more times.
Passes / Taps: 8 passes, 3 taps each, 24 taps total.
This approach may be performed on FFP hardware by manipulating vertex colors and texture coordinates, provided that the hardware supports three texture samples per pass.
2. a) Perform a 3x3 blur with 9 taps (apply the entire 3x3 filter). b) Repeat step a) three more times.
Passes / Taps: 4 passes, 9 taps each, 36 taps total.
This approach requires pixel shader 2.0 for the 9 taps per pass.
Differences
There are other solutions, but let's start with these two. Note that approach 2 uses half as many passes, but 50% more taps than approach 1. Since I must support older hardware (NVidia MX440, etc) I would have already implemented approach 1. Simply looking at the mechanics of approach 2, I can't tell if it provides sufficient (any?) advantage over approach 1. :huh:
ATI's video shader sample uses approach 2 for gaussian blur. Is that because it's faster, or because it is easier to program, or because it highlights the power of 2.0? I just don't know! :sad:
Additionally, I wonder whether performing 16 taps in 4 passes (4 taps per pass) is better than 16 taps in 8 passes (2 taps per pass) - in the second case pixel shaders will be shorter, but run twice as often.
#6
Posted 28 April 2005 - 09:00 PM
I've written both of the solutions outlined above and tested them on my machine (x800 pro). I'm not confident that this represents a healthy sample of 2.0 hardware though. For the x800 pro, both of these run in about the same amount of time. I'm not certain what to infer from that. :wacko:
#7
Posted 29 April 2005 - 02:43 PM
Steven Hansen said:
Quote
#8
Posted 29 April 2005 - 04:23 PM
Quote
Quote
Quote
In exactly the same way, the approach I outlined above, as well as a slew of other effects, must be followed to the letter for correctness. Multi-pass algorithms build on the results of previous passes. Texture samples must be precisely accurate. The driver simply cannot optimize away previous passes or differing quantities of taps without altering the mathematic result!
Quote
#9
Posted 30 April 2005 - 02:31 PM
in my world that's pure logic :cool:
#10
Posted 30 April 2005 - 04:18 PM
Although that actually doesn't seem relevant here. Gaussian blur is an image algorithm, the source information is just the scene rendered to texture. There would be no need to put the geometry through more than once.
#11
Posted 30 April 2005 - 07:47 PM
Steven Hansen said:
Drawing 'over and over' again is indeed not a significant problem. The graphics card can read back the pixel's color long beforehand (in parallel), run the shader, and blend the result in parallel again. There are some extra bandwidth requirements, but that's it.
Look at it this way: Suppose we have four big arrays, a[][], b[][], c[][] and d[][], and we would like to compute a = b + c + d for every element. We can do this in two ways:
for(each element)
{
a = b + c + d;
}
or:
for(each element)
{
a = b + c;
}
for(each element)
{
a = a + d;
}
Now suppose we have a processor which can do one addition every clock cycle, and read two values and write one value per clock cycle. Then the first and second approach will execute in exactly the same time! The bottleneck is the additions.Likewise, when vertex processing and memory bandwidth isn't an issue, modern graphics cards are shader limited. Especially long shaders (like yours) show this behaviour).
Quote
Quote
Quote
All I'm trying to say is; it's useless to try to optimize these sort of things. The tap/pass ratio you're talking about simple doesn't determine the bottleneck. The operation you're trying to perform is arithmetic intensive and you can't magically lower the workload. Minor but complicated effects will determine the final performance here, so all you can do is measure the performance in practice. Personally, I'd just pick the most general approach and stop wasting time...
#12
Posted 01 May 2005 - 07:35 PM
Reedbeta said:
oh yeah! i've actually just started making a deferred shader engine :cool:
#13
Posted 02 May 2005 - 05:16 PM
Further illustration: 4 vertices: 1 fullscreen quad. Source and destination are both 1600x1200 resolution.
First scenario: 4 passes, 1 tap each. 4 total reads per source texel. 4 writes per destination pixel/texel. Four intermediate render target textures of size 1600x1200. Algorithm: set source, draw to target 1 - set target 1 as source, draw to target 2 - set target 2 as source, draw to target 3, etc.
Second scenario: 2 passes, 3 taps each. 6 total reads per source texel. 2 writes per destination pixel/texel. Two intermediate render target textures of size 1600x1200 (yay! half the resources). Same algorithm.
Scenario one has 7,680,000 reads and 7,680,000 writes. Scenario two has 11,520,000 reads (50% higher) and 3,840,000 writes (half scenario 1). The driver has no choice but to make all of those reads, and all of those writes, or the solution will be incorrect.
I expect the process is nicely pipelined, and many reads/writes are done in parallel. That's good. However, it is not safe to say that the pipelining is so efficient that writes are free, or that writes and reads have the same cost. If that were so, then the entire cost could be summed from the number of reads and writes. From a previous post with two gaussian blur approximations:
Quote
Passes / Taps: 4 passes, 9 taps each, 36 taps total.
So, the measurement is not linear in terms of reads vs writes. Now that (hopefully) this has been established (my original question), I'm sadly guessing that the general relative costs are unknown by the devmaster audience.
#14
Posted 02 May 2005 - 07:52 PM
Also, there's no need for four render targets in your scenario 1. You don't need the contents of target 1 after you've used it to render target 2 - so just use two render targets, and alternate which is the source and which the destination.
#15
Posted 02 May 2005 - 09:42 PM
Steven Hansen said:
Like Reedbeta sais, nobody really knows the definitive answer. You can't really predict these kind of things more exactly. We can sortof model the bandwidth need and the shader arithmetic cost and take a guess at what's the bottleneck. But the only thing giving the answers is to actually measure it. And as I've said before, don't underestimate the driver. Compression can alleviate the bandwidth problem and there might be some hidden processing unit that speeds up shader execution in certain cases.
So what I meant to say from the start, in a gentle way, is get over it. There is no magical tap/pass formula. If it's worth your time, try another approach and measure the results. You might even want to dynamically select one approach based on the result on each system individually.
Anyway, I can suggest to post your question on the Beyond3D forum. The people there know more about graphics chips than the designers. :lol: But don't be surprised if (after many pages of educated guessing) the conclusion will be the same...
#16
Posted 03 May 2005 - 07:44 AM
The problem is that I don't have a nice baseline. Something like, "oh... shoot for something with fewer overall reads." So I get to implement like four or five possible approaches and then profile them all. Argh. :angry:
Perhaps most frustrating is that for the pixel averaging algorithm, I have an old FFP solution that is faster than the exact same algorithm written in vertex and pixel shaders :excl: ... but for some reason roundoff errors appear to be more significant (looks bad). So now I don't know how to justify starting any particular approach. It *could* be better, but it seems like without some knowledge about the relative costs, it could also be worse. :sad:
#17
Posted 27 October 2007 - 10:52 AM
1. Old FFP v shaders
You may see faster performance today, but shaders are increasing in performance at an incredible rate. Therefore tomorrow, a shader solution will be faster. We also know its already better looking. QED - shader solution.
2. MP v MT
We know your "pass" is a single quad. So frankly any effort in optimisng the number of passes is largely wasted (assuming its full screen so no clears etc are used). However we know considerable effort is put into optimising TAPs by card manufacturers (faster memory, buses, caches etc). This is because they are precious. QED : chose the method that optimises TAPs.
In my experience (which is a lot, I used to manage Opengl driver development) leaving anything to driver developers is naive.
Friends help you move, real friends help you move bodies.
#18
Posted 27 October 2007 - 11:43 AM
auld said:
#19
Posted 29 October 2007 - 04:14 PM
auld said:
You just need to check Power VR to see that ;)
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users













