I am developing an application for work that deals with sensor
performance. I need to calculate the standard deviation of an entire
color buffer (1-channel 16-bit floating point frame buffer object) in
Opengl. For testing purposes, I am simply reading up the buffer into
main memory and calculating the standard deviation on the CPU.
Of-course, as you might expect, this is a killer on the FPS.
I have googled and googled, searched every where but I cannot find any
references or leads to this. The only reference that was mentioned was
some article by Horn that talks about calculating the sum of an entire
buffer using shaders, but I could not find the article itself.
My first idea is to simply use the non-programmable pipeline and do
additive blending onto another 16-bit f.p. surface drawing verticle
lines ontop of each other over an over again with the texture
coordinates referencing the buffer for each verticle line, once for the
number of pixels in the x direction (I may skip pixels to just
approximate the s.d.), and then draw single points ontop of each other
referencing the horizontal sums. This would give me the mean value, then
I would repeat the process somehow using the differences minus the mean
squared. This doesn’t seam terribly efficient but it would be better
than reading it into the main memory. Also, I don’t know how to do this
without having to ping-pong between textures when it comes to the second
pass using the mean.
Anybody have better (more efficient) ideas for this?
Thanks in advance,
Please log in or register to post a reply.
A couple of ideas:
(1) use the automatic mipmap generation capabilities of the hardware to
generate the mean (you can read back the value of the topmost mip level
to get the mean)
(2) render the texture to a second texture using a shader that
calculates the squared difference at each pixel, and use automatic
mipmap generation again to generate the mean
(3) then you can read back the mean and take the square root.
BTW, all this falls under the category of general-purpose GPU
computation (a phrase you may want to google). People have done much
more complicated things with it, like performing simulations of cloth on
the GPU using texels to store the position of vertices of the cloth
surface, and shaders to update them at each timestep.
Thanks for the quick reply Reedbeta,
I am using ARB_texture_rectangle for the frame buffer object to
support abritrary buffer size (non-powers of two), and also so the
coordinate referencing is easier (instead of 0.0 to 1.0, they are
referenced by ‘pixel’ i.e. 0 to 511 etc.) This object does not support
BTW. Would generating mip-maps every frame be fast enough?
I have looked into this GPGPU stuff, but surprisely I haven’t found any
info for such a simple task. There is a reference to a ‘summation’
method in the book GPU gems 2, so I may just purchase a copy. I don’t
think the solution to this is trivial because of the parallel nature of
I think you haven’t found a reference for a summation method because
everyone does that with mipmaps :lol:
However, if you really must use ARB_texture_rectangle, you can still
write a pixel shader to do a 2x2 box filter on the image and resample it
to an image of half the size in each dimension. Then repeat until the
image is small enough. Of course you have to figure out how to handle
odd dimensions if you work with non-PO2 textures. However, I believe
this is the most efficient way to do the resampling, taking advantage of
the GPU parallelism to the greatest extent possible. And yes, you should
be able to attain a decent FPS on this with a bit of optimization. It’s
not a very costly operation compared to some of the things people like
to do, e.g. post-processing blur and bloom filters on every frame.
I just found ARB_texture_non_power_of_two,
It appears that this DOES support mip-mapping, the only difference
between this and ARB_texture_rectangle is the way coordinates are
I’ll give both methods a try to see which is faster. At first I thought
that mip-maps might be overkill (because I don’t need any of the
intermediate levels) but it is only 1/3 of the original buffer overhead
because of the logarithmic way each level reduces so we’ll see.
thanks for your help,