Hi, I wrote a DX11 DirectCompute implementation of a
Buddhabrot/Nebulabrot fractal renderer. The submitted picture is a
Nebulabrot with max iteration values set to red:10,000 green:1,000
blue:100. I rendered the above image (original image was 1592×1028) at
around 14 fps (140,000 samples a second) for around 35 minutes.
On my HD5750, with the above max iteration values, I actually get frame
rates around 42 fps, but I didn’t change the default yielding behavior
of DXUT when the application becomes inactive, so the above render was
performed at 14 fps (doh!).
I wrote a CPU implmentation (non-simd & non-multithreaded) as well and
my DirectCompute implementation is around 4-6 times faster than the CPU
version. My CPU is an Intel Core2Quad Q6600 2.4ghz (not overclocked). I
had earlier written a Mandelbrot DirectCompute implementation and that
was 50+ times faster than CPU. Since the Buddhabrot is more complex than
the Mandelbrot, I guess reduced performance is to be expected. I’m
guessing the extensive scattered global memory writes of a Buddhabrot
implementation may be slowing down the DirectCompute version.
For more details about my implementation (source & binary provided) see
my blog post:
Please log in or register to post a reply.
Very cool! I’ve been meaning to spend some time checking out this
compute shader stuff…
Have you tried writing it in a good ol’ pixel shader for comparison?
Cool indeed! And subscribed to your blog :)
Hi everyone. Thanks for the comments!
Yeah, compute shaders seem pretty cool. It seems not everything, but
lots of cool stuff can be sped up using it. Personally want to try gi
pathtracer, fluid dynamics and post effects stuff in the future (those
seem to be what other ppl have had success with so far.)
The Buddhabrot algorithm requires a lot of random scattered writes. The
above Nebulabrot has an iteration max of 10,000, so in the worst case,
9,999+999+99 scattered writes to the output uav buffer are gonna occur
in a single compute shader thread (and there are 10,000 threads
executing in parallel. The # 10,000 for total thread count is not
related to the iteration max. Just coincidence that I have 100 thread
groups each with 100 threads = 10,000 ttl.) I think it’ll be hard and
unnatural to implement it in the pixel shader.
Thanks for subscribing! :)