OpenCL vs Cuda question
#1
Posted 04 January 2012 - 02:42 AM
I know next to nothing about OpenCL and Cuda, but do OpenCL run on Nvidia cards? And do Cuda runs on ATI cards? Is OpenCL for Mac only? Are both just as good? Will one die off (like the old Beta tapes did vs VHS)? Is there something better, or easier? Sorry for all the questions!
I want to run some code in parallel. I have a 4 core Intel CPU with 8 threads and my code works perfect in 8 threads, but I want more speed, so I thought to use the GPU. I read many papers on OpenCL and stuff, but it seems so hard!! I'm not a beginner in programming (did it since the Commodor64 when I was 16), and still I can't get a grasp on the OpenCL stuff. I know the concept, but I don't know where to start, which dependencies I need, DLL, drivers and whatever. I know I have to write some OpenCL code and put that as text for the OpenCL to compile, but that's about all I know. I don't know how the thing works. Any one can please help me on this?
Thanks,
Al
#2
Posted 04 January 2012 - 05:06 AM
Alienizer said:
- Yes, OpenCL is a standard, like OpenGL. All the major players work together with the Khronos group to produce a standard set of APIs.
Alienizer said:
- No, CUDA is a proprietary GPGPU API, toolset, and driver developed and maintained by nVidia for nVidia GPUs only. AMD, as well as Intel, have chosen to adopt OpenCL instead (the right choice IMO). Nevertheless, OpenCL requires two things. You must download the appropriate OpenCL drivers and framework for your video card. You can find them on your video manufacturers website. Take a look at the table in this wiki article. Should help you find what you're looking for.
Alienizer said:
- No, although it was Apple who first started the OpenCL standard. OpenCL works on Windows, Mac, and Linux. It is also currently being brought to the web via WebCL.
Alienizer said:
- The general consensus is that nVidia produces a better toolset to work with. Much like how some prefer to work with their Cg shaders instead of HLSL or GLSL. I personally prefer not to get intertwined with proprietary technology and opt for the open standards instead. From my experience, OpenCL is fairly straightforward once you get to know the API, although it can be a bit daunting at first. I learned OpenCL from the AMD APP SDK, which had some documentation equivalent to reading assembly code
Alienizer said:
- Not in the foreseeable future, but GPGPU is mostly reserved for academia. It's not a widely spoken topic right now, although that might change once WebCL comes around. You can do interesting things with GPGPU, but you can do equally visually interesting things using the normal programmable graphics pipeline. It all depends on what you want to do.
#3
Posted 04 January 2012 - 07:12 PM
Is it possible to pre-compile the kernel and use the "binary?" as oppose to always use the string format of it?
If I were to make a game for everyone to play, do I have to give them some re-distributable, or do they simply have to update their video driver for the latest and all should work (on Nvidia and ATI)?
#4
Posted 04 January 2012 - 09:38 PM
Alienizer said:
- As I mentioned earlier, GPGPU is mostly used by the academic or scientific community. These are communities that lack people skills
Alienizer said:
- Yes. In OpenCL the function clGetProgramInfo(...) can retrieve the program binary after you compile the kernel with a call to clCreateProgramWithSource(...) and clBuildProgram(...). You should note that the binary returned is device specific. It may not work on other hardware. Unless there's a compelling reason to risk that, it's better to just distribute the kernel source and have the target hardware compile for best optimization and compatibility. If you want to protect your kernel source, just encrypt it or compile it into your exe as a resource.
Alienizer said:
- No, but your users must have OpenCL drivers installed along with a supported video card. Both AMD and nVidia bundle OpenCL drivers with their standard driver package, so you should be fine.
#5
Posted 04 January 2012 - 10:48 PM
TheNut, on 04 January 2012 - 09:38 PM, said:
I see! Thanks.
TheNut, on 04 January 2012 - 09:38 PM, said:
Perfect. I just have to check if OpenCL is installed. Is there an API or something to test if it is installed?
I use clGetDeviceInfo to get the LocalWorkSize as oppose to use 256 as in the demo. Is that the right way to do that?
In the demo, they have an array with 11 million cl_float. It works fine, I changed it to 50 million, and it works, 60 as well, but 70 miilion doesn't. How do we know how large of an array it can handle? Is that based on the total RAM on the video card? The exmaple, has 3 arrays, each 11 miilion of cl_float. That's about 88MB and my card has 2GB. Now, with 70 million, it's 560MB, why isn't it taking it?
Thanks again for your help TheNut
#6
Posted 04 January 2012 - 11:55 PM
Alienizer said:
I haven't seen the demo, so I'm not sure what they're doing. Using the device info function provides useful hints about what your maximums are, so yes, it is a good idea to fetch data from it. The work values you use to execute a kernel however vary. It's not like you should use the same value every time for every problem. It depends how you divide up your tasks and how you would like to perform them. I'm not sure why you're having problems with increasing memory usage, but you might want to check the clCreateBuffer(...) function and see if your host/system memory usage fits in with what you're allocating.
#7
Posted 05 January 2012 - 12:01 AM
TheNut, on 04 January 2012 - 11:55 PM, said:
I haven't seen the demo, so I'm not sure what they're doing. Using the device info function provides useful hints about what your maximums are, so yes, it is a good idea to fetch data from it. The work values you use to execute a kernel however vary. It's not like you should use the same value every time for every problem. It depends how you divide up your tasks and how you would like to perform them. I'm not sure why you're having problems with increasing memory usage, but you might want to check the clCreateBuffer(...) function and see if your host/system memory usage fits in with what you're allocating.
oh ok, I get it, thanks.
One more thing, having 8 host threads on the CPU, can we (or is there a way) to run kernels on the GPU in this fasion? Or do we have to run the main app in a single CPU thread? I know OpenCL isn't thread safe, but I don't see how to implement this in a thread safe way! Is there a ready sate to test for?
#8
Posted 05 January 2012 - 01:00 AM
#9
Posted 05 January 2012 - 01:58 AM
Reedbeta, on 05 January 2012 - 01:00 AM, said:
I see what you mean. Make sense! Thanks Reedbeta.
But, does that mean that I have to write the whole game in OpenCL (of course without the UI and stuff)? For example, a chess game. It uses recursion, which OpenCL doesn't do. So the way I understand it, I have to make a bunch of arrays of different things the game uses, piece positions, score, color, kind etc. and run different kernels depending on the stage it's at?
So for the chess game example, I set up an array with the initial board position, run the kernel to compute the score, then for depth 2, I run a loop of all valid moves, make a move, then for that move I run the kernel to get the score, then go on to the next valid move until done, then repeat for depth 3?
In other words, on the host I would...
for depth = 1 to 5 {
for each valid move {
make the move
run kernel to get score of current board position
undo last move
}
}
Or I am seeing the world upside down?
#10
Posted 05 January 2012 - 03:12 AM
#11
Posted 05 January 2012 - 03:42 AM
Reedbeta, on 05 January 2012 - 03:12 AM, said:
ok thanks Reedbeta
#12
Posted 05 January 2012 - 04:50 AM
#13
Posted 05 January 2012 - 04:35 PM
TheNut, on 05 January 2012 - 04:50 AM, said:
I know what you mean! I've been up all night trying to figure out how to make this work. Your explanation is most valuable, I can see my problem now.
TheNut, on 05 January 2012 - 04:50 AM, said:
I'm running a GTX-280 which has 240 processing cores. a host loop of 50 and 10 million array items (doing a dot on two float4), it takes my CPU 7 secs to do, on the GPU it takes 0.6 sec.
As for the Chess, I was not successful. I was going about it the wrong way and I want to do it your way. But, I;m not sure if I understand it. What do you mean by "9 compute units"? and 2304? My GTX has 240 processing cores, yours has over 2400?
Maybe I'm missing the technical aspect of OpenCL/GPU. I thought it runs a kernel on arrays of numbers, in my case, 240 items at a time. Is that wrong?
#14
Posted 05 January 2012 - 05:34 PM
#15
Posted 05 January 2012 - 06:53 PM
In hardware terms, the number of stream processors are what counts. I'm not a hardware expert, but a GPU is divided into 3 components. You have your compute units (similar to cores in CPU terms), each compute unit has a number of stream processors (similar to SIMD units), and each stream processor has a number of processing elements (which is like your ALU). My Radeon is advertised with 720 stream processing units, which break down into 9 compute units, 16 stream processors per compute unit, and 5 processing elements per stream processor (you have to lookup some of this info). It's not necessary to break it down all the way, but it's interesting to know. Now since each processing element can perform both a mult and an add in a single shot, that equates to 1440 flops per hertz (720 x 2). At 700Mhz, that means my card pushes 700 x 1440 ~= 1 terra flops. If you take the number of arithmetic operations in your kernel and divide by that value, you can get a _minimum_ estimate of how much time your kernel requires. There is of course a ton of other factors, but this would be a best-case scenario.
#16
Posted 05 January 2012 - 10:53 PM
So in order to program something, do we have to take into consideration the (in your case) 9 units, 16 streams and 5 elements? What I don't get is, how to split the work into those! Thanks again TheNut for helping me out.
BTW, I like your website, and your Nutty wallpaper (it's my background now) and Companion Cube Wallpaper are awesome! Did you render those with your own program?
#17
Posted 06 January 2012 - 01:31 AM
Ad arch - F.e. Radeon's smallest units are stream processing units formed in SIMD cores (computing units), they're overlooked by Ultra-threaded processor connected through local data share on every SIMD core to SIMD core (Ultra threaded processor has Instruction cache that feeds instructions to the SIMD cores). They're also connected through L1 cache to L2 cache (and global data share between SIMD cores), L2 is then connected to memory controller.
AMD stream processing units in SIMD cores are different than NVidia ones. AMD has stream processing units in groups of 6, where 4 are the same (ALU-like unit with set of FP/INT operations), 5th has different set of FP/INT arithmetic operations and the last one looks after them (This group is called stream processor). NVidia has stream processing units in groups of 9, where 8 are the same (ALU-like with one set of FP/INT arithmetic operations) and the last one looks after them. NVidia stream processing units are also a bit more powerful (in term of computation) than AMDs.
So for performance what counts are SIMD cores = compute units. Every SIMD core is similar to standard x86-core with SIMD unit (where ALUs in SIMD unit would do the same function as stream processors do in SIMD core in GPU) - so, my HD6770 here has 10 compute units - so it could be compared (in case of behaviour) to 10-core CPU.
Ad 'The Nut counts' (Note: Sorry for repeating some of info you provided - trying to keep it clear)
If his gpu has 720 stream processing units, broke into 9 compute units (to make a joke on Nutty address - I have 10 now
__kernel void main(__global float *in_a, __global float *in_b, __global float *in_c, __global float *out_a, const int count)
{
const int id = get_global_id(0);
if(id < count)
{
out_b[id] = in_a[id] * in_b[id] + in_c[id];
}
}
This will actually perform mainly (not only) the MAD instruction (multiply-and-add) in single hertz, but here is the probem, not every instruction does perform two operations at time, actually most perform just single one. The code without any two-op instructions would have at peak just half the TFlops count (means some 0.5 TFlops). And you're still not counting with memory (accessing VRAM isn't free), waiting for data for RAM (that is even more expensive), and also that the compiled code is not nearly as effective as single instruction! (and Flops goes very quickly down).
Even though, if you count the resulting (G)Flops for some algorithm you will probably get the higher counts per core than CPU has (well as long as it won't be SIMD uneffective). Also we're mostly comparing some 10-SIMD core GPU with 2-core or 4-core CPU (what about to compare it to new 16-core Opteron? It might be interesting - though GPU still fetches data from VRAM faster than CPU from RAM (depends on RAM speed, I had older 1066 DDR2 before and now I use 2133 DDR3 (good Santa Claus
Huh, hope that there isn't too much informations at once. Also I'm not GPU architect neither GPU expert, it is just kind of my hobby, so...
If you don't know how to speed up application, go "roarrrrrr!", hit the compiler with the club and use -O3 :D
#18
Posted 06 January 2012 - 02:26 AM
Vilem Otte, on 06 January 2012 - 01:31 AM, said:
Make sense!
Vilem Otte, on 06 January 2012 - 01:31 AM, said:
AMD stream processing units in SIMD cores are different than NVidia ones. AMD has stream processing units in groups of 6, where 4 are the same (ALU-like unit with set of FP/INT operations), 5th has different set of FP/INT arithmetic operations and the last one looks after them (This group is called stream processor). NVidia has stream processing units in groups of 9, where 8 are the same (ALU-like with one set of FP/INT arithmetic operations) and the last one looks after them. NVidia stream processing units are also a bit more powerful (in term of computation) than AMDs.
So for performance what counts are SIMD cores = compute units. Every SIMD core is similar to standard x86-core with SIMD unit (where ALUs in SIMD unit would do the same function as stream processors do in SIMD core in GPU) - so, my HD6770 here has 10 compute units - so it could be compared (in case of behaviour) to 10-core CPU.
So, if I code for SSE on my Intel i7-3960X (6 cores 12 thread, 3.6GHZ, DDR3 good Santa Claus
Vilem Otte, on 06 January 2012 - 01:31 AM, said:
If his gpu has 720 stream processing units, broke into 9 compute units (to make a joke on Nutty address - I have 10 now
__kernel void main(__global float *in_a, __global float *in_b, __global float *in_c, __global float *out_a, const int count)
{
const int id = get_global_id(0);
if(id < count)
{
out_b[id] = in_a[id] * in_b[id] + in_c[id];
}
}
This will actually perform mainly (not only) the MAD instruction (multiply-and-add) in single hertz, but here is the probem, not every instruction does perform two operations at time, actually most perform just single one. The code without any two-op instructions would have at peak just half the TFlops count (means some 0.5 TFlops). And you're still not counting with memory (accessing VRAM isn't free), waiting for data for RAM (that is even more expensive), and also that the compiled code is not nearly as effective as single instruction! (and Flops goes very quickly down).
I've noticed that. I ran a code similar to yours and it was real fast. But when I use more complex operation, it slows down alot. One code I tested on double4 cross/dot/sqrt and such were not that much faster than my CPU!
Vilem Otte, on 06 January 2012 - 01:31 AM, said:
Huh, hope that there isn't too much informations at once. Also I'm not GPU architect neither GPU expert, it is just kind of my hobby, so...
Great article, never too much info for me, the more info the more I learn, and this info is sure valuable, thank you!
#19
Posted 06 January 2012 - 05:35 AM
Glad you enjoy my wallpaper
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users












