cuda distance fields

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Jan 02, 2013 at 05:49

this is my very first cuda program, and its just hacking up the bindlesstexture example.
Its marching a distance field of a sphere, and reporting back the normal on a hit.
Im only getting 15 fps and it should be going a 1000 fps, as its only going for 100 steps then capping off, and i
have a gtx690, even though i dont even know how to use more than one cuda device at once as its two 680s, you
have to code for multiple gpus?
i think i have to assign threads in a different way, actually im not sure how the blocks, threads and grids even work. how would you
assign threads for a raytracer?

I figure when im doing this for real, im allowed to use the vector library that comes with d3d am i not?

cudadistancefield.png

heres the code for the kernel, even though this might not be the problem.

__global__ void
d_render(uchar4 *d_output, uint imageW, uint imageH, float lod)
{
    uint x = blockIdx.x * blockDim.x + threadIdx.x;
    uint y = blockIdx.y * blockDim.y + threadIdx.y;

    float u = x / (float) imageW;
    float v = y / (float) imageH;

float3 orig={0,0,-100};
float3 dir={(u-0.5f)*2,(v-0.5f)*2,1};

dir=dir/sqrt(dir.x*dir.x+dir.y*dir.y+dir.z*dir.z);

float3 rp=orig;
bool hit=false;
int step;
for(step=0;step<100;step++)
{
if(hit==false)
{
float dist=sqrt(rp.x*rp.x+rp.y*rp.y+rp.z*rp.z)-40;
if(dist<=1)
{
  hit=true;
}
rp+=dir*dist*0.5f;
}
    }


    float4 color;

if(hit)
{
float3 nor=rp/sqrt(rp.x*rp.x+rp.y*rp.y+rp.z*rp.z);
color.x=(nor.x+1)/2;
color.y=(nor.y+1)/2;
color.z=(nor.z+1)/2;
color.w=1;
}
else
{
color.x=0;
color.y=0;
color.z=0;
color.w=1;
}

    uint i = y * imageW + x;
d_output[i]=to_uchar4(color * 255.0);
}

16 Replies

Please log in or register to post a reply.

A3c652c6832b95ef5c3e63e60527e1ab
0
Albertone 101 Jan 02, 2013 at 08:38

Did you take a look at the Inigo Quilez’s website? http://iquilezles.org/default.html
There’s some interesting material on distance field rendering - in general, that site is a true goldmine!

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Jan 02, 2013 at 11:55

thanks for that, but i was just wondering why am i going so slow? its pretty hard to stuff up a single sphere raymarcher, so its not the raymarching, ive done something systemwise thats failing.

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Jan 02, 2013 at 12:24

when i changed window block size from 16 to 64 i got it up to 40 fps? why does that happen? im not sure if its still as fast as it should be, or if im even taking advantage of the whole of the gtx690. it went faster, but i got an error from it, it said it is an invalid configuration and was constantly pumping out debug.

const dim3 windowSize(512, 512);
const dim3 windowBlockSize(64, 64, 1);
const dim3 windowGridSize(windowSize.x / windowBlockSize.x, windowSize.y / windowBlockSize.y);

really confused, can someone help?

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Jan 02, 2013 at 14:04

i looked around for more info, and i read im doing it exactly as i should, then why is it so slow? (2fps 1024x1024)

B5262118b588a5a420230bfbef4a2cdf
0
Stainless 151 Jan 02, 2013 at 15:36

just look at the work you are doing for each pixel.

100 square roots in the inner loop? per pixel?

since you are comparing against 1 anyway, the sqrt is redundant. 1 x 1 = 1; square root 1 = 1

remove the square root in the inner loop for a start

You have that -40 after the sqrt, which is rubbish. Think about it, you know better than that.

Were you tired when you wrote this, you don’t normally make silly mistakes like that. :P

Say the distance calculated is 30, the result of that calculation will be -10, which is less than 1 , so you have a hit.

Also when you have a hit, why do you continue doing the loop?

Have a coffee man, and look at something else for an hour then come back to it and you will smack yourself in the face and fix it.

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Jan 03, 2013 at 06:53

I know the code is bad, its just i just prooved that it should be going 100 times faster, i inserted the code in a fragment shader, launched d3d11 and got 200 fps instead of 2 fps at about 1000x1000 with the exact same code square roots and all. so its not the raymarching code thats the problem at this stage, im doing something wrong in cuda.

Maybe i havent set it up properly?

B5262118b588a5a420230bfbef4a2cdf
0
Stainless 151 Jan 03, 2013 at 09:55
float dist = sqrt((rp.x*rp.x)+(rp.y*rp.y)+(rp.z*rp.z))-40;

What you want to do is say, “I have a sphere of radius 1 at 0,0,40”

What that code does is say “If the sample point is closer to the camera than 40 units, draw a bit of a sphere”

float3 spherepos={0,0,40};
float3 range = spherepos-rp;

Then if you want to be technically correct, (or you use a radius other than 1)

if (length(range)<=radius)
{
      // hit
}

Your loop (in pseudo code) is

loop 100 times
if I haven’t had a hit….. move sample point, check for hit
if I have had a hit…… continue looping

What you should be doing is

loop 100 times
move sample point, check for hit
if hit …… render pixel and quit

Currently you are looping 100 times regardless

Also you are using the broken distance calculation to move the sample point. I can’t actually work out what that will do, but it’s not good.

You will get away with bad shader code on a nice machine because of the sheer power of modern graphics cards, but run the same code on a machine with an older card and man will it grind.

B5262118b588a5a420230bfbef4a2cdf
0
Stainless 151 Jan 03, 2013 at 10:20

I don’t know if this will help or hinder :D , but it’s how I do it.

I use a couple of cheats though, I render a sky cube to a texture storing the position of the pixel instead of the colour.

This is a speed up to help with creating the ray.

And I use a 3d texture as my scene, so really it’s volume rendering rather than ray tracing, but it’s so similar you might be able to steal bits.

// Raycasting fragment program implementation
fragment_out fragment_main( vertex_fragment IN,
   uniform sampler2D tex, 
                            uniform sampler3D volume_tex, 
   uniform float stepsize 
  )

{
  fragment_out OUT;
  float2 texc = ((IN.Pos.xy / IN.Pos.w) + 1) / 2; // find the right place to lookup in the backside buffer
  float4 start = IN.TexCoord; // the start position of the ray is stored in the texturecoordinate
  float4 back_position  = tex2D(tex, texc);
  float3 dir = float3(0,0,0);
  dir.x = back_position.x - start.x;
  dir.y = back_position.y - start.y;
  dir.z = back_position.z - start.z;
  float len = length(dir.xyz); // the length from front to back is calculated and used to terminate the ray
  float3 norm_dir = normalize(dir);
  float delta = stepsize;
  float3 delta_dir = norm_dir * delta;
  float delta_dir_len = length(delta_dir);
  float3 vec = start;
  float4 col_acc = float4(0,0,0,0);
  float alpha_acc = 0;
  float length_acc = 0;
  float4 color_sample;
  float alpha_sample;

  for(int i = 0; i < 450; i++)
    {
      color_sample = tex3D(volume_tex,vec);
      alpha_sample = color_sample.a * stepsize;
      col_acc   += (1.0 - alpha_acc) * color_sample * alpha_sample * 3;
      alpha_acc += alpha_sample;
      vec += delta_dir;
      length_acc += delta_dir_len;
      if(length_acc >= len || alpha_acc > 1.0) break; // terminate if opacity > 1 or the ray is outside the volume
    }

  OUT.Color =  col_acc;
  return OUT;
}
Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Jan 03, 2013 at 12:59

@rouncer

I know the code is bad, its just i just prooved that it should be going 100 times faster, i inserted the code in a fragment shader, launched d3d11 and got 200 fps instead of 2 fps at about 1000x1000 with the exact same code square roots and all. so its not the raymarching code thats the problem at this stage, im doing something wrong in cuda. Maybe i havent set it up properly?

thanks man, i know it could be written better, but its going too slow for even what it is, I told you i wrote a pixel shader in d3d11 and it went 300 fps instead of 2fps. SAME CODE. why the different framerate?

B20d81438814b6ba7da7ff8eb502d039
0
Vilem_Otte 117 Jan 04, 2013 at 02:01

I didn’t have time to react here yet, but now…

first of all there is a huge difference in compiling OpenCL source (respectively CUDA) and shaders. Especially when the compiler has to take atomics, loops, huge branches, etc. into account. When you write same codes in *same* language (it’s technically different), and compile it with 2 different compilers (both with different level of optimization), then you get 2 different resulting speeds, on single machine.

Also shader languages are more limited and they do some heavy optimizations (sometimes even “destructive”), that just can’t be done in OpenCL/CUDA, because some1 might need that feature (in shader languages you say that specific features are unavailable)

Next thing - if your code is the same and there is more than 100 times difference, try very simple code - writing just some data to the texture (adding 2 textures together or so - in CUDA adding 2 buffers and show last buffer (texture) on screen through D3D/GL and CL/GL CL/D3D (respectively CUDA/GL CUDA/D3D interop). The timings should be similar (shader should be a bit faster though, but no more than like max. 5 times, probably u get around 2 times or so). If it is still 100 times slower, there is something going terribly wrong (bad workgroup sizes, etc. - F.e. you use just 1/16th of all processing units available on GPU or so).

i think i have to assign threads in a different way, actually im not sure how the blocks, threads and grids even work. how would you
assign threads for a raytracer?

I tried 2 approaches in recent time.

1st (faster on my AMDs in KD trees, not yet tried in BVHs) - use 1 single quite big CL kernel to create rays, traverse them through KD trees and test them against triangles. Quite fast, I get dozen(s) of MRays/s.

2nd - smaller CL kernels to generate rays, traverse them and shade. A bit slower (5% or so), but those kernels look more human that 1 big beast (with heavy use of #defines and other magic to be the fastest).

I also should try persistent threads that we work with in lab (but I haven’t implemented them to my framework yet).

EDIT: I have slept a bit, and got some little time - so few hints on the code that might speed it up a bit (if you have already fixed the loop to be faster, these things might increase performance a little bit more):

// First of all, processing chars and integers on GPU is not as effective as using float
// so it's recommended to pass float4* d_output instead of uchar4* - it will be faster and a lot

// Note that here you probably should be writing directly to texture and not to buffer (that has to be
// read back if I remember correctly), this wastes a lot of time. You basically run compute kernel
// wait till it finishes, read back data it generates, and then write them back to GPU memory.
// Fix me if I'm wrong here, and if you're actually working with texture. I'm used to work with OpenCL
// and there you have buffers and textures separately (and from those few examples of CUDA I've tried
// long time ago I'm still thinking that it works the same way in CUDA).

// Second thing, pasting image dimensions as uints are also waste, you're just dividing by
// these values converted to floating point. Pass directly floats invImageW
// where invImageW = 1.0f / imageW, same goes for imageH
// and pass integer image width separately
__global__ void d_render(uchar4 *d_output, uint imageW, uint imageH, float lod)
{
        uint x = blockIdx.x * blockDim.x + threadIdx.x;
        uint y = blockIdx.y * blockDim.y + threadIdx.y;
        float u = x / (float) imageW;
        float v = y / (float) imageH;

       float3 orig={0,0,-100};
       float3 dir={(u-0.5f)*2,(v-0.5f)*2,1};

       // There is normalize in cuda, i can't say whether it's faster, but most likely compiler friendlier
       // than this - so dir = normalize(dir); - it won't be slower than this, and most likely it'll be even faster
       dir=dir/sqrt(dir.x*dir.x+dir.y*dir.y+dir.z*dir.z);

       // I think that Stainless said everything here, the loop is basically done wrong (in terms of performance)
       float3 rp=orig;
       bool hit=false;
       int step;
       for(step=0;step<100;step++)
       {
           if(hit==false)
           {
               float dist=sqrt(rp.x*rp.x+rp.y*rp.y+rp.z*rp.z)-40;
               if(dist<=1)
               {
                            hit=true;
               }
               rp+=dir*dist*0.5f;
           }
        }

       // Again, normalize, also don't write to color by components, it probably won't be done in single mov, but in multiple ones
       float4 color;
       if(hit)
       {
           float3 nor=rp/sqrt(rp.x*rp.x+rp.y*rp.y+rp.z*rp.z);
           color.x=(nor.x+1)/2;
           color.y=(nor.y+1)/2;
           color.z=(nor.z+1)/2;
           color.w=1;
       }
       else
       {
           color.x=0;
           color.y=0;
           color.z=0;
           color.w=1;
       }
      
       // It would be better to write directly to texture here, but well, if you want it in buffer, then as I mentioned, use float4.
       uint i = y * imageW + x;
       d_output[i]=to_uchar4(color * 255.0);
}
Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Jan 04, 2013 at 14:43

so are you saying cuda is a piece of shit compared to a shader?

i dont see how d3d11 can compile that code piece of cake but cuda the piece of shit cant compile it…

B20d81438814b6ba7da7ff8eb502d039
0
Vilem_Otte 117 Jan 04, 2013 at 16:30

Not necessarily - it’s good for bigger things. With shaders you’re still quite limited - you can read/write just to textures (or transform feedback, but well using it is **** (add any ugly word that comes to your mind)), the operations are all approximated for speed, etc.

On the other hands if you want to write complex ray tracer it’s better to stay in CUDA/OpenCL world (or use compute shader) - OpenCL can handle large kernels without any problems, you can also run for precision here (because it can use precise math instead of fast one, note that OpenCL defaultly uses precise math, not fast), you can do large branching without hurting performance too much (which is good here), etc.
Also accessing memory in OpenCL/CUDA is more in your hands than in shader - so you can do all high performance stuff (note that you actually *have to* if you want good performance, in shaders the compiler+driver takes care about that). You can use atomics to synchronize work between threads in your workgroup, etc. So there is like a TON of stuff that is better to do with OpenCL/CUDA.

Although I wouldn’t count distance fields to them - they’re probably better in standard shader (note you can achieve approx. same speed in OpenCL, but most likely with a lot more coding).

EDIT: I’d like to also name here one problem - tile-based deferred shading. It’s quite common these days and sounds like great idea. But using OpenCL here complicates a lot of things compared to standard deferred shading - for example shadow maps. I know there are solutions (bind all available shadow maps, use shadow map atlases, etc. - but it’s still overhead, so you get a bit faster lighting, but can even lose somewhere else).

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Jan 05, 2013 at 06:46

All I can ask for is a simple raytracer demo i can run and compile and it goes full speed, can you help with that pleeeeeeaaaasssee, im totally STUCK!!!!

I cant even go 10 steps without it slowing to 20 fps, this is not gpu speed, its more like cpu speed.

B20d81438814b6ba7da7ff8eb502d039
0
Vilem_Otte 117 Jan 05, 2013 at 11:00

I don’t have that much time to write an example (and it’d still be in OpenCL because I’m not so good CUDA user) - but here you go with one more complex example - http://users.softlab.ece.ntua.gr/\~ttsiod/cudarenderer-BVH.html

If you can implement this, you pretty much have quite robust and good ray tracer, even for games scenes.

Fd80f81596aa1cf809ceb1c2077e190b
0
rouncer 103 Jan 05, 2013 at 15:09

look man its simple, if it cant go 100 steps the whole things flawed to cpu speed, have i even got my video card working properly?

I was looking forward to using CUDA, im sick and tired of copy and paste D3D11, this could have opened a whole new world to me, but 2fps? look man its not the code.

HEY! i just gotta a whole new idea, what if i wrap the bejeezus out of d3d11? like a game engine? save all the typing, thats all cuda would have gave me.

B5262118b588a5a420230bfbef4a2cdf
0
Stainless 151 Jan 06, 2013 at 09:32

CUDA is not a magic spell, you can’t just install it say “compile” three times and sacrifice a chicken then get a 100 fps raytracer

You run crap code in CUDA and you get crap performance, just like writing crap code in D3D gives you crap performance

CUDA has a problem with threads, it also has cache issues, but if you code with limits in mind you can get good performance.

Read that example Vilem showed you, down load the app and try it. If that example runs slowly then you know that you have a problem with your machine and CUDA. At that point you should give up and attack D3D

If it runs fast, then you know that it’s your code and you need to study the example to find out why