1
109 Apr 29, 2013 at 18:34 opencl

Hi,

I’m trying to implement a path tracer with and without direct illumination in OpenCL and would like to have advises.

Do I upload everything to OpenCL, KD-Tree, triangles, materials and all and let OpenCL do the tracing? Or do I split the work so that OpenCL does only the ray intersections? Do I setup the buffer so that OpenCL does one screen at a time or do I split the screen into smaller groups? Do I use OpenCL only to generate for example 512 rays and then let the CPU do the rest?

I have no clue how to proceed. I just need the logic on how this should work and I appreciate any help.
Thanks

#### 15 Replies

0
117 Apr 30, 2013 at 23:38

Okay, I could give a lot of tips out there … so let’s start with few of them…

1.) Break your ray-tracer apart. You want separate OpenCL kernels for ray generation, ray traversal and image reconstruction. The reason is quite obvious, you want all your threads to perform only ray generation, only ray traversal or only reconstruction, why? The answer is simple -> Caching and shared memory access. Also this gives you quite good modularity, you can switch between several traversal kernels and benchmark for the best, etc.

2.) Your frame progress should work like this (1st update physics & scene), then your rendering part should look like:
i. construct/re-construct or re-fit your acceleration structure (be it KD-tree, BVH, QBVH, Octree, or anything else)
ii. perform primary ray generation
iii. traverse primary rays
iv. generate shadow rays for last traversed rays (steps iv and v are performing so called explicit step, sometimes called next-event estimation)
vi. if we don’t want/need to perform next step in path tracing, goto step x
vii. generate rays for next step in path tracing
viii. traverse them and goto step iv
ix. compute image from all data gained in previous steps

NOTE: that you don’t need to compute all pixel at once, but you can batch them (it might improve speed, but also might not, or you might even get lower speed in the end)
NOTE2: steps ii - ix can be repeated N times, where N are your samples-per-pixel, of course averaging resulting data accordingly

3.) Upload only what is needed and when is needed. For ray generation you probably need just camera description (primary rays), for secondary rays it’s a lot tougher, you need primary rays and their results, also you need material properties for your primitives and the primitives.
For ray traversal you need just acceleration structure and primitives (they can be in different format than for other parts of renderer, F.e. using Woop triangles instead of standard triangles for ray-triangle intersection yields better performance).
For reconstruct kernels you need primitives and their material properties, along with ALL ray results and rays.

Some code example (from my OpenTracer): primary ray generator kernel http://pastebin.com/gXFGB5sJ, efficient traversal kernel (fetching nodes through texture) http://pastebin.com/CxLD8384 and very simple reconstruct kernel writing out only barycentric coordinates as color http://pastebin.com/APex5KG2

0
109 May 01, 2013 at 00:43

Thanks Vilem, I appreciate the help.

I see what you mean, make sense. But I have a couple more questions…

ii. generate the primary rays first, do we generate rays only for the number of pixels on the view port, or do we generate like 10 million of them?

iii. Do we also traverse transparent and reflective materials? Some rays may take way longer to traverse if they hit a glass object which has a lot of internal reflections!

Thanks!

0
117 May 01, 2013 at 02:22

ii. Only for the number of pixels in viewport - why is that, we’re computing single sample of path. Of course you can re-use primary rays for ALL the sample paths you compute for given image … this brings few disadvantages when you need to compute specific effects (supersampling, DoF, etc.).

iii. You only traverse given generated ray, if you hit something you don’t care about material, you just record hit. In next step of the path you generate reflected/refracted ray on place of this and traverse that one (just looking for closest single hit), etc. In pseudo code:

Array<Pixel> result = Array<Pixel>(width, height);

// NOTE: Size of each path step and explicit step will be exactly width*height
// So totally we have width*height*spp*MAX_PATH_LENGTH*2 rays

Array<Ray> rays = Array<Ray>(width * height * spp * MAX_PATH_LENGTH * 2);
Array<RayResult> results = Array<RayResult>(width * height * spp * MAX_PATH_LENGTH * 2);

int idx = 0;
int spp_offset = width * height * MAX_PATH_LENGTH * 2;
int path_offset = width * height * 2;
int explicit_offset_from_path = width * height;

for(int i = 0; i < spp; i++)
{
for(int j = 0; j < MAX_PATH_LENGTH; j++)
{
/** Ray generation **/
if(j == 0)
{
// OpenCL
// Standardly generate primary rays
for(int x = 0; x < width * height; x++)
{
rays[idx] = GeneratePrimaryRays(camera, x);
idx++;
}
// \OpenCL
}
else
{
// OpenCL
// Generate next step path ray from previously computed path rays
for(int x = 0; x < width * height; x++)
{
rays[idx] = GeneratePathRays(rays[idx - path_offset], results[idx - path_offset]);
idx++;
}
// \OpenCL
}

/** Ray Traversal **/
// OpenCL
int tmp = i * spp_offset + j * path_offset;
for(int x = 0; x < width * height; x++)
{
results[tmp] = FindNearestHit(scene);
tmp++;
}
// \OpenCL

/** Explicit ray generation **/
// OpenCL
tmp = i * spp_offset + j * path_offset + explicit_offset_from_path;
for(int x = 0; x < width * height; x++)
{
rays[tmp] = GenerateShadowRays(rays[tmp - explicit_offset_from_path], results[tmp - explicit_offset_from_path], scene);
tmp++;
}
// \OpenCL

/** Explicit ray traversal **/
// OpenCL
tmp = i * spp_offset + j * path_offset + explicit_offset_from_path;
for(int x = 0; x < width * height; x++)
{
results[tmp] = FindNearestHit(scene);
tmp++;
}
// \OpenCL
}
}

// OpenCL
for(int i = 0; i < width * height; i++)
{
}
// \OpenCL


You should also do some kind of russian roullette for path termination, that will F.e. set your rays range to max = -1, min = 1 (and your traversal will then ignore invalid rays). You still though should use memory for terminated paths (even though you won’t traverse them), because accessing memory divided with this simple scheme (and doing shading on top of it) is just damn lot easier than doing some black magic instead.

0
109 May 01, 2013 at 18:06

Now I understand perfectly. Your pseudo code made that clear to me.

But I fail to understand the need to split the OpenCL calls into 4 groups. Why can’t it be done in one call?

0
117 May 02, 2013 at 00:35

Basically this is just an optimization, you could put everything to one huge OpenCL kernel, but that would result in crappy performance (some of your threads would be generating rays, some traversing, some shading), plus you’d need quite a load of local variables (and local memory on GPU is very limited, if you go over the limits it’ll be slow, damn slow).

0
109 May 02, 2013 at 01:21

ah ok I see what you mean. But for example, if I clEnqueueWriteBuffer all my triangles only once, they will be kept in the GPU memory until I free the buffer right? So all my kernels can use it and it should be fast right? Or do I have to free the buffers not used by the current kernel, and re-create it again for another kernel that will use it?

I was under the impression that you can upload all your triangles, normals etc. and that will not slow down anything because they are uploaded only once? I’m obviously wrong from all the docs I’ve read I’m still not sure when the GPU performance is affected.

0
117 May 03, 2013 at 01:07

You don’t have to re-upload data to GPU every time. You use some cl_mem structures for buffers on GPU. You can re-use them, but be aware that you have to set the right rights for them (e.g. triangles are read_only, rays are read_write, resulting pixels might be write_only, etc.).

0
109 May 03, 2013 at 03:27

ok that’s what I did thank you. I’ve managed to make it work as per your pseudo code (sort of), but I ran into problems. Your array example dimensioned as width * height * spp * MAX_PATH_LENGTH * 2 is killing my machine :) For a 512x512 with 32 SPP and depth of 6, that’s over 100 millions items, and almost 1GB of RAM for a ray (dir.xyzw and org.xyzw) on a 64BIT CPU using double precision (64bit*8=). Is it suppose to be like that?

So if I understand your pseudo code, all it does is generate rays correct? Does it take into consideration reflective surfaces, refractions, bumps etc? Or that’s done in the Shade Pixel at the end? how does the Shade Pixel works?

Thanks again for helping.

0
117 May 03, 2013 at 10:42

Reflection/refraction can be handled inside ray-generation (although you need to use scene data inside that kernel, along with previous ray information).

Ad memory - this is kind of problem, there are 2 kinds of solutions to this:

1.) Batching - as I mentioned, it makes addressing a bit more messy. You compute maximally F.e. 256x256 pixels at once, breaking your viewport into these “tiles”. This works, but makes addressing a lot more hard.

2.) Compute only 1 spp at once, incrementally adding the new ones. The problem here is, that you need to clear your buffer and total spp count every time you move either your camera, or object in scene. You though can actually shade whole your path each time you compute single spp inside frame (incrementally).

It works like this (very shortened pseudo-code):

resultImage.clear(BLACK_COLOR); // Note that this isn't needed (assuming i've already waken up and my brain works correctly)!

for(int i = 0; i < spp; i++)
{
ComputeWholePath();

}

DisplayResult();

// Where shade pixel kernel does (assuming we got i-th spp right now, indexing from zero)
float4 color = ResolvePath();

result = result * i / (i + 1) + color * 1 / (i + 1);


This approach works quite well (although you have to have your resulting pixel buffer with read-write access, not just write-only).

0
109 May 03, 2013 at 21:00

ok, I see what you mean. So what I have done so far is this…

for (int spp=0; spp<32; spp++) {

Init_Info_Buffer; // reset path termination flag along other things.

for (int depth=0; depth<3; depth++) {

Kernel_Gen_Cam_Rays;
Kernel_Shoot_Rays;

// begin direct illumination for none terminating paths.
// discard camera rays to sky for example but still record sky color.

Kernel_Gen_Rays_To_Random_Light;
Kernel_Shoot_Rays; // for none terminating paths, discard rays not hitting any lights.

Kernel_Gen_Random_Hemi_Rays; // diffused rays
Kernel_Shoot_Rays; // if hit sky, record color, else keep path for next depth.

}

Send_Pixels_To_Screen;

}


I have it working that way on my GTX with 960 cores, but it’s not all that much faster than on a Quad Core (8 threads) CPU, maybe 2x faster at most. It does not do reflection or refraction, only diffuse, no texture, only material colors. I can only get 1 FPS maybe with a 512x512 Cornell scene with 2 spheres, 4K triangles.

I think I’m doing this all wrong still, 1 FPS for 4K triangles is more than bad. I have my KD-Tree data uploaded to the GPU along with the vertices, normal, UVs, and colors, all as READ_ONLY and uploaded only once. (I’m doing static scenes only right now, I don’t think I’m ready for animation yet!)

All my Kernel_ call do have to re-upload the info buffer, but that buffer only has ray (dir, org) information and a few other things like skip the ray or not, accumulated color and that’s about it.

0
117 May 03, 2013 at 23:12

First of all the code should be:

for (int spp=0; spp<32; spp++) {

Init_Info_Buffer; // reset path termination flag along other things.

for (int depth=0; depth<3; depth++) {

if(depth == 0)
{
Kernel_Gen_Cam_Rays;
Kernel_Shoot_Rays;
}
else
{
Kernel_Gen_Random_Hemi_Rays; // diffused rays
Kernel_Shoot_Rays; // if hit sky, record color, else keep path for next depth.
}

// begin direct illumination for none terminating paths.
// discard camera rays to sky for example but still record sky color.

Kernel_Gen_Rays_To_Random_Light;
Kernel_Shoot_Rays; // for none terminating paths, discard rays not hitting any lights.

}

Send_Pixels_To_Screen;

}


Basically you want to shoot primary rays just once (and also explicit rays for them). Then you cast 1st iteration of your path, generated from results of your primary rays (and also explicit step). then you cast 2nd iteration of your path, generated from 1st iteration of path (and also explicit step), etc. etc. etc. (until you meet some conditions - F.e. limit path length)

I’ll have a bit of time tomorrow afternoon (first I gotta go swimming though, it’s important to keep some kind of condition when you’re sitting too much at PC :D), I’ll try to find a bit of time to quickly put together some sample pathtracer in OpenCL that would demonstrate the idea completely.

As for the speed, you probably won’t get much better than CPU has (I’d say for complex scenes you get just like few times better than the best CPUs today, on the best GPUs). GPUs have huge “horsepower”, but only brute force - better (= first of all bigger, and maybe a bit faster) caches would help a lot. Better drivers also (honestly they suck, they suck very hard - especially on NVidia.

0
109 May 04, 2013 at 04:44

swimming? you are so lucky. Me I’m still in the snow right now.

oh yes! I was doing it that way, but I did it wrong in my message, sorry!

>sample pathtracer in OpenCL
Yes please, I would really like that, thank you.

I didn’t know about the GPU not being that much faster than a CPU. So how do they have all those videos on Youtube about real-time path tracing where they get 10+ FPS with millions of triangles???

0
109 May 17, 2013 at 17:25

how did the swimming go Vilem? hope you didn’t drawn, didn’t hear from you for a while :lol: :lol: :lol:

0
117 May 18, 2013 at 00:06

Ah sorry, when I came back I didnt realize that I posted here and the thread wasn’t anymore on the first page. Here is the sample that works on AMD graphics cards - http://wikisend.com/…57596/PT_01.rar, if you have NVidia, I have to perform a benchmark or two tomorrow with it on NVidia GPU (and thus make it actually run on NVidia), I could post it also.

It is very simple and computes just one path at a time (for benchmarking purpose right now). The OpenCL code is generic and non-specific (and thus it is slower, compared to specific code like in Aila’s implementation). I also thought of trying stack-less traversal as already several people (including Dade from ompf, where I asked for more hints on faster non-specific implementation) gave me opinion that it is faster than stack-based (at least on current architectures -> Radeon 6xxx and up).

0
109 May 18, 2013 at 01:39

Thanks Vilemand as you suspected, it isn’t running on NVIdia, it says…

Initializing OpenGL
Error

I’m looking at your cl code so I can understand how you do it. But what about the main loop, how does it call the kernel(s)? Will you provide the sample c code (or the pseudo code) for the loop (please :rolleyes: )? Thanks Vilem