#TheNut - first a note, i think you mis-written z-near for z-pass

Blurring the shadow volumes is another thing. You can perform accurate penumbra shadows by penumbra wedges (this comes at high cost though), or just simple screen-space bilateral blur - both are although suited for PCs rather than mobile devices.

How big is the scene? What is target platform?

I implemented Z-fail shadow volumes through geometry shader some months ago (and had a thread about it here), and it performed very well for complex scenes on Radeon 6770 (also on GTX 580 - but what doesnt perform well on this category of GPU), and for moderate scenes even on notebook graphics (Mobility Radeon 5470). Though if youre targeting mobile devices, I doubt they will perform that good.

There are also other possibilities of computing shadows. Ray tracing on GPU (it IS realtime, but most probably will need a bit more work than 2 weeks)? Plannar shadows (assuming you dont have too much geometry you cast shadows on)? Shadow blobs + light maps (works well, light maps do awesome work in static scenes)? Light maps + shadow map for N dynamic actors/objects in specific range (you can create shadow map atlas pretty well, smaller objects get smaller shadow map, larger objects get larger shadow map)? etc.

If you describe a bit more details, I might give you few suggestions on how to implement it quickly, because 2 weeks arent that much time.

Ah sorry, when I came back I didnt realize that I posted here and the thread wasn't anymore on the first page. Here is the sample that works on AMD graphics cards - http://wikisend.com/...57596/PT_01.rar, if you have NVidia, I have to perform a benchmark or two tomorrow with it on NVidia GPU (and thus make it actually run on NVidia), I could post it also.

It is very simple and computes just one path at a time (for benchmarking purpose right now). The OpenCL code is generic and non-specific (and thus it is slower, compared to specific code like in Aila's implementation). I also thought of trying stack-less traversal as already several people (including Dade from ompf, where I asked for more hints on faster non-specific implementation) gave me opinion that it is faster than stack-based (at least on current architectures -> Radeon 6xxx and up).

I'm working on 2 projects right now, the first one is real time path tracer (working with both, OpenCL and CUDA). I don't have sponza or sibenik here on notebook and neither some good GPU - so I can supply just 512x384 image with 16spp (took few seconds), on Radeon 5470... here you go:

The second project is the game (on our own game engine) we're working on, here are two images from area lights test (again on bad notebook GPU, giving out quite bad fps, and without any antialiasing, nor hardware tessellation):

First of all the code should be:

for (int spp=0; spp<32; spp++) {

Init_Info_Buffer; // reset path termination flag along other things.

for (int depth=0; depth<3; depth++) {

if(depth == 0)
{
Kernel_Gen_Cam_Rays;
Kernel_Shoot_Rays;
}
else
{
Kernel_Gen_Random_Hemi_Rays; // diffused rays
Kernel_Shoot_Rays; // if hit sky, record color, else keep path for next depth.
}

// begin direct illumination for none terminating paths.
// discard camera rays to sky for example but still record sky color.

Kernel_Gen_Rays_To_Random_Light;
Kernel_Shoot_Rays; // for none terminating paths, discard rays not hitting any lights.

}

Send_Pixels_To_Screen;

}


Basically you want to shoot primary rays just once (and also explicit rays for them). Then you cast 1st iteration of your path, generated from results of your primary rays (and also explicit step). then you cast 2nd iteration of your path, generated from 1st iteration of path (and also explicit step), etc. etc. etc. (until you meet some conditions - F.e. limit path length)

I'll have a bit of time tomorrow afternoon (first I gotta go swimming though, it's important to keep some kind of condition when you're sitting too much at PC ), I'll try to find a bit of time to quickly put together some sample pathtracer in OpenCL that would demonstrate the idea completely.

As for the speed, you probably won't get much better than CPU has (I'd say for complex scenes you get just like few times better than the best CPUs today, on the best GPUs). GPUs have huge "horsepower", but only brute force - better (= first of all bigger, and maybe a bit faster) caches would help a lot. Better drivers also (honestly they suck, they suck very hard - especially on NVidia.