Jump to content


OpenCL vs Cuda question


  • You cannot reply to this topic
18 replies to this topic

#1 Alienizer

    Member

  • Members
  • PipPipPip
  • 266 posts

Posted 04 January 2012 - 02:42 AM

Hi,

I know next to nothing about OpenCL and Cuda, but do OpenCL run on Nvidia cards? And do Cuda runs on ATI cards? Is OpenCL for Mac only? Are both just as good? Will one die off (like the old Beta tapes did vs VHS)? Is there something better, or easier? Sorry for all the questions!

I want to run some code in parallel. I have a 4 core Intel CPU with 8 threads and my code works perfect in 8 threads, but I want more speed, so I thought to use the GPU. I read many papers on OpenCL and stuff, but it seems so hard!! I'm not a beginner in programming (did it since the Commodor64 when I was 16), and still I can't get a grasp on the OpenCL stuff. I know the concept, but I don't know where to start, which dependencies I need, DLL, drivers and whatever. I know I have to write some OpenCL code and put that as text for the OpenCL to compile, but that's about all I know. I don't know how the thing works. Any one can please help me on this?

Thanks,
Al

#2 TheNut

    Senior Member

  • Moderators
  • 1472 posts
  • LocationThornhill, ON

Posted 04 January 2012 - 05:06 AM

Alienizer said:

do OpenCL run on Nvidia cards

- Yes, OpenCL is a standard, like OpenGL. All the major players work together with the Khronos group to produce a standard set of APIs.

Alienizer said:

do Cuda runs on ATI cards

- No, CUDA is a proprietary GPGPU API, toolset, and driver developed and maintained by nVidia for nVidia GPUs only. AMD, as well as Intel, have chosen to adopt OpenCL instead (the right choice IMO). Nevertheless, OpenCL requires two things. You must download the appropriate OpenCL drivers and framework for your video card. You can find them on your video manufacturers website. Take a look at the table in this wiki article. Should help you find what you're looking for.

Alienizer said:

Is OpenCL for Mac only?

- No, although it was Apple who first started the OpenCL standard. OpenCL works on Windows, Mac, and Linux. It is also currently being brought to the web via WebCL.

Alienizer said:

Are both just as good?

- The general consensus is that nVidia produces a better toolset to work with. Much like how some prefer to work with their Cg shaders instead of HLSL or GLSL. I personally prefer not to get intertwined with proprietary technology and opt for the open standards instead. From my experience, OpenCL is fairly straightforward once you get to know the API, although it can be a bit daunting at first. I learned OpenCL from the AMD APP SDK, which had some documentation equivalent to reading assembly code :) I remember there was missing documentation, especially related to memory management. If you are developing on an AMD card with their SDK, just be cautious about what you read.

Alienizer said:

Will one die off?

- Not in the foreseeable future, but GPGPU is mostly reserved for academia. It's not a widely spoken topic right now, although that might change once WebCL comes around. You can do interesting things with GPGPU, but you can do equally visually interesting things using the normal programmable graphics pipeline. It all depends on what you want to do.
http://www.nutty.ca - Being a nut has its advantages.

#3 Alienizer

    Member

  • Members
  • PipPipPip
  • 266 posts

Posted 04 January 2012 - 07:12 PM

Thanks TheNut, I appreciate the explanation. I downloaded the NVIDIA GPU Computing SDK and got the AddVector demo working. It's easier than I thought!! Why do they make it seem so complicated!?

Is it possible to pre-compile the kernel and use the "binary?" as oppose to always use the string format of it?

If I were to make a game for everyone to play, do I have to give them some re-distributable, or do they simply have to update their video driver for the latest and all should work (on Nvidia and ATI)?

#4 TheNut

    Senior Member

  • Moderators
  • 1472 posts
  • LocationThornhill, ON

Posted 04 January 2012 - 09:38 PM

Alienizer said:

Why do they make it seem so complicated!?

- As I mentioned earlier, GPGPU is mostly used by the academic or scientific community. These are communities that lack people skills :D nVidia does a very good job bringing complicated topics down to Earth for most people to understand, hence their strong developer relations. As GPGPU evolves and becomes more mainstream, you'll likely see things get better for it.

Alienizer said:

Is it possible to pre-compile the kernel and use the "binary?" as oppose to always use the string format of it?

- Yes. In OpenCL the function clGetProgramInfo(...) can retrieve the program binary after you compile the kernel with a call to clCreateProgramWithSource(...) and clBuildProgram(...). You should note that the binary returned is device specific. It may not work on other hardware. Unless there's a compelling reason to risk that, it's better to just distribute the kernel source and have the target hardware compile for best optimization and compatibility. If you want to protect your kernel source, just encrypt it or compile it into your exe as a resource.

Alienizer said:

do I have to give them some re-distributable

- No, but your users must have OpenCL drivers installed along with a supported video card. Both AMD and nVidia bundle OpenCL drivers with their standard driver package, so you should be fine.
http://www.nutty.ca - Being a nut has its advantages.

#5 Alienizer

    Member

  • Members
  • PipPipPip
  • 266 posts

Posted 04 January 2012 - 10:48 PM

View PostTheNut, on 04 January 2012 - 09:38 PM, said:

Yes. In OpenCL the function clGetProgramInfo(...) can retrieve the program binary after you compile the kernel with a call to clCreateProgramWithSource(...) and clBuildProgram(...). You should note that the binary returned is device specific. It may not work on other hardware. Unless there's a compelling reason to risk that, it's better to just distribute the kernel source and have the target hardware compile for best optimization and compatibility. If you want to protect your kernel source, just encrypt it or compile it into your exe as a resource.

I see! Thanks.

View PostTheNut, on 04 January 2012 - 09:38 PM, said:

No, but your users must have OpenCL drivers installed along with a supported video card. Both AMD and nVidia bundle OpenCL drivers with their standard driver package, so you should be fine.

Perfect. I just have to check if OpenCL is installed. Is there an API or something to test if it is installed?

I use clGetDeviceInfo to get the LocalWorkSize as oppose to use 256 as in the demo. Is that the right way to do that?

In the demo, they have an array with 11 million cl_float. It works fine, I changed it to 50 million, and it works, 60 as well, but 70 miilion doesn't. How do we know how large of an array it can handle? Is that based on the total RAM on the video card? The exmaple, has 3 arrays, each 11 miilion of cl_float. That's about 88MB and my card has 2GB. Now, with 70 million, it's 560MB, why isn't it taking it?

Thanks again for your help TheNut

#6 TheNut

    Senior Member

  • Moderators
  • 1472 posts
  • LocationThornhill, ON

Posted 04 January 2012 - 11:55 PM

Alienizer said:

Is there an API or something to test if it is installed?
clGetDeviceInfo(...) will try to get you an OpenCL device that matches your device type, which can be a way to test for OpenCL support. A proper way would be to query for the available OpenCL platforms, in which you call clGetPlatformIDs(...) and clGetPlatformInfo(...). Take a look at the OpenCL spec. It covers everything you need.

I haven't seen the demo, so I'm not sure what they're doing. Using the device info function provides useful hints about what your maximums are, so yes, it is a good idea to fetch data from it. The work values you use to execute a kernel however vary. It's not like you should use the same value every time for every problem. It depends how you divide up your tasks and how you would like to perform them. I'm not sure why you're having problems with increasing memory usage, but you might want to check the clCreateBuffer(...) function and see if your host/system memory usage fits in with what you're allocating.
http://www.nutty.ca - Being a nut has its advantages.

#7 Alienizer

    Member

  • Members
  • PipPipPip
  • 266 posts

Posted 05 January 2012 - 12:01 AM

View PostTheNut, on 04 January 2012 - 11:55 PM, said:

clGetDeviceInfo(...) will try to get you an OpenCL device that matches your device type, which can be a way to test for OpenCL support. A proper way would be to query for the available OpenCL platforms, in which you call clGetPlatformIDs(...) and clGetPlatformInfo(...). Take a look at the OpenCL spec. It covers everything you need.

I haven't seen the demo, so I'm not sure what they're doing. Using the device info function provides useful hints about what your maximums are, so yes, it is a good idea to fetch data from it. The work values you use to execute a kernel however vary. It's not like you should use the same value every time for every problem. It depends how you divide up your tasks and how you would like to perform them. I'm not sure why you're having problems with increasing memory usage, but you might want to check the clCreateBuffer(...) function and see if your host/system memory usage fits in with what you're allocating.

oh ok, I get it, thanks.

One more thing, having 8 host threads on the CPU, can we (or is there a way) to run kernels on the GPU in this fasion? Or do we have to run the main app in a single CPU thread? I know OpenCL isn't thread safe, but I don't see how to implement this in a thread safe way! Is there a ready sate to test for?

#8 Reedbeta

    DevMaster Staff

  • Administrators
  • 4974 posts
  • LocationBellevue, WA

Posted 05 January 2012 - 01:00 AM

You probably don't need more than one CPU thread talking to OpenCL. In this kind of app, typically all the CPU does is management and housekeeping, getting everything set up and then telling the GPU what to do. It shouldn't be CPU intensive.
reedbeta.com - developer blog, OpenGL demos, and other projects

#9 Alienizer

    Member

  • Members
  • PipPipPip
  • 266 posts

Posted 05 January 2012 - 01:58 AM

View PostReedbeta, on 05 January 2012 - 01:00 AM, said:

You probably don't need more than one CPU thread talking to OpenCL. In this kind of app, typically all the CPU does is management and housekeeping, getting everything set up and then telling the GPU what to do. It shouldn't be CPU intensive.

I see what you mean. Make sense! Thanks Reedbeta.

But, does that mean that I have to write the whole game in OpenCL (of course without the UI and stuff)? For example, a chess game. It uses recursion, which OpenCL doesn't do. So the way I understand it, I have to make a bunch of arrays of different things the game uses, piece positions, score, color, kind etc. and run different kernels depending on the stage it's at?

So for the chess game example, I set up an array with the initial board position, run the kernel to compute the score, then for depth 2, I run a loop of all valid moves, make a move, then for that move I run the kernel to get the score, then go on to the next valid move until done, then repeat for depth 3?

In other words, on the host I would...

for depth = 1 to 5 {
  for each valid move {
	make the move
	run kernel to get score of current board position
	undo last move
  }
}

Or I am seeing the world upside down?

#10 Reedbeta

    DevMaster Staff

  • Administrators
  • 4974 posts
  • LocationBellevue, WA

Posted 05 January 2012 - 03:12 AM

Something like that sounds right. I don't know enough about it to say whether that's the best way of iterating over a tree on the GPU - I know the GPGPU people have developed data structures and algorithms that are specialized for the GPU, and are sometimes very different-looking than the traditional CPU data structures and algorithms. But in general, yes, it's common to need to queue up a bunch of jobs of various sizes and accessing various buffers in order to get something done. This is true for graphics as well as for GPGPU use cases. :)
reedbeta.com - developer blog, OpenGL demos, and other projects

#11 Alienizer

    Member

  • Members
  • PipPipPip
  • 266 posts

Posted 05 January 2012 - 03:42 AM

View PostReedbeta, on 05 January 2012 - 03:12 AM, said:

Something like that sounds right. I don't know enough about it to say whether that's the best way of iterating over a tree on the GPU - I know the GPGPU people have developed data structures and algorithms that are specialized for the GPU, and are sometimes very different-looking than the traditional CPU data structures and algorithms. But in general, yes, it's common to need to queue up a bunch of jobs of various sizes and accessing various buffers in order to get something done. This is true for graphics as well as for GPGPU use cases. :)

ok thanks Reedbeta

#12 TheNut

    Senior Member

  • Moderators
  • 1472 posts
  • LocationThornhill, ON

Posted 05 January 2012 - 04:50 AM

Try to avoid using loops in a kernel. The more independent you make your work items, the better you will leverage the GPU. Chess is an interesting problem because there are many interesting ways to solve this using the GPGPU. One possible solution would be to have several work groups running their own chess games, with each group focusing on the results of moving one piece. So work group one would analyze all the possibilities of moving a knight to the bottom left. Each work item would be responsible for checking the repercussions of that action or probabilities leading towards a successful checkmate. Work item one for example would check to see if the king would become exposed by that movement. Work item 2 could check if the queen would be put in danger, etc. On my particular hardware, I have 9 compute units, each capable of running 256 work items in parallel. That's about 2304 games calculated per iteration. You can swift through some solutions pretty quickly at that rate. Far better than GNU Chess I'm sure, which I tire of waiting for every time ;)
http://www.nutty.ca - Being a nut has its advantages.

#13 Alienizer

    Member

  • Members
  • PipPipPip
  • 266 posts

Posted 05 January 2012 - 04:35 PM

View PostTheNut, on 05 January 2012 - 04:50 AM, said:

Try to avoid using loops in a kernel. The more independent you make your work items, the better you will leverage the GPU. Chess is an interesting problem because there are many interesting ways to solve this using the GPGPU. One possible solution would be to have several work groups running their own chess games, with each group focusing on the results of moving one piece. So work group one would analyze all the possibilities of moving a knight to the bottom left. Each work item would be responsible for checking the repercussions of that action or probabilities leading towards a successful checkmate. Work item one for example would check to see if the king would become exposed by that movement. Work item 2 could check if the queen would be put in danger, etc. On my particular hardware, I have 9 compute units, each capable of running 256 work items in parallel. That's about 2304 games calculated per iteration. You can swift through some solutions pretty quickly at that rate. Far better than GNU Chess I'm sure, which I tire of waiting for every time ;)

I know what you mean! I've been up all night trying to figure out how to make this work. Your explanation is most valuable, I can see my problem now.

View PostTheNut, on 05 January 2012 - 04:50 AM, said:

I have 9 compute units, each capable of running 256 work items in parallel. That's about 2304 games calculated per iteration.

I'm running a GTX-280 which has 240 processing cores. a host loop of 50 and 10 million array items (doing a dot on two float4), it takes my CPU 7 secs to do, on the GPU it takes 0.6 sec.

As for the Chess, I was not successful. I was going about it the wrong way and I want to do it your way. But, I;m not sure if I understand it. What do you mean by "9 compute units"? and 2304? My GTX has 240 processing cores, yours has over 2400?

Maybe I'm missing the technical aspect of OpenCL/GPU. I thought it runs a kernel on arrays of numbers, in my case, 240 items at a time. Is that wrong?

#14 Alienizer

    Member

  • Members
  • PipPipPip
  • 266 posts

Posted 05 January 2012 - 05:34 PM

ok, I check CL_DEVICE_MAX_COMPUTE_UNITS and on my GTX-280 I have 30. What does this means vs 240 processing cores???

#15 TheNut

    Senior Member

  • Moderators
  • 1472 posts
  • LocationThornhill, ON

Posted 05 January 2012 - 06:53 PM

Sorry, it was late when I wrote that and I didn't mean to come off implying the relationship between work items and hardware execution. Rather, I wanted to illustrate that 9 compute units each have a work group with 256 work items. When all 9 compute units complete (however long that takes), then roughly 2304 games will have been calculated. I say roughly because it's not a synchronous operation (unless you force it), so some games may compute faster than others.

In hardware terms, the number of stream processors are what counts. I'm not a hardware expert, but a GPU is divided into 3 components. You have your compute units (similar to cores in CPU terms), each compute unit has a number of stream processors (similar to SIMD units), and each stream processor has a number of processing elements (which is like your ALU). My Radeon is advertised with 720 stream processing units, which break down into 9 compute units, 16 stream processors per compute unit, and 5 processing elements per stream processor (you have to lookup some of this info). It's not necessary to break it down all the way, but it's interesting to know. Now since each processing element can perform both a mult and an add in a single shot, that equates to 1440 flops per hertz (720 x 2). At 700Mhz, that means my card pushes 700 x 1440 ~= 1 terra flops. If you take the number of arithmetic operations in your kernel and divide by that value, you can get a _minimum_ estimate of how much time your kernel requires. There is of course a ton of other factors, but this would be a best-case scenario.
http://www.nutty.ca - Being a nut has its advantages.

#16 Alienizer

    Member

  • Members
  • PipPipPip
  • 266 posts

Posted 05 January 2012 - 10:53 PM

oh wow, ok! That's the best explanation I've seen.

So in order to program something, do we have to take into consideration the (in your case) 9 units, 16 streams and 5 elements? What I don't get is, how to split the work into those! Thanks again TheNut for helping me out.

BTW, I like your website, and your Nutty wallpaper (it's my background now) and Companion Cube Wallpaper are awesome! Did you render those with your own program?

#17 Vilem Otte

    Valued Member

  • Members
  • PipPipPip
  • 215 posts

Posted 06 January 2012 - 01:31 AM

You won't split work in those, you will enqueue the all work to-be-done to GPU and it will split itself (with help of driver) the work between compute units.

Ad arch - F.e. Radeon's smallest units are stream processing units formed in SIMD cores (computing units), they're overlooked by Ultra-threaded processor connected through local data share on every SIMD core to SIMD core (Ultra threaded processor has Instruction cache that feeds instructions to the SIMD cores). They're also connected through L1 cache to L2 cache (and global data share between SIMD cores), L2 is then connected to memory controller.

AMD stream processing units in SIMD cores are different than NVidia ones. AMD has stream processing units in groups of 6, where 4 are the same (ALU-like unit with set of FP/INT operations), 5th has different set of FP/INT arithmetic operations and the last one looks after them (This group is called stream processor). NVidia has stream processing units in groups of 9, where 8 are the same (ALU-like with one set of FP/INT arithmetic operations) and the last one looks after them. NVidia stream processing units are also a bit more powerful (in term of computation) than AMDs.

So for performance what counts are SIMD cores = compute units. Every SIMD core is similar to standard x86-core with SIMD unit (where ALUs in SIMD unit would do the same function as stream processors do in SIMD core in GPU) - so, my HD6770 here has 10 compute units - so it could be compared (in case of behaviour) to 10-core CPU.

Ad 'The Nut counts' (Note: Sorry for repeating some of info you provided - trying to keep it clear)
If his gpu has 720 stream processing units, broke into 9 compute units (to make a joke on Nutty address - I have 10 now :P) - then every compute unit of his has 80 stream processing units (e.g. 16 stream processors). Why just 5 counts, instead of 6 I've mentioned? The "police" (the one looking after them) doesn't count in these, because it doesn't perform any computation. That's what he said. Now he did something what is called marketing trick - thats the computation of flops - you can do 2 float ops in one herz, so at 700 MHz, you can do ~ 1TFlops ... but this is ideal peak performance (only counted on paper), you won't ever get near this. To get near this, you would have to write just this code for GPU:

__kernel void main(__global float *in_a, __global float *in_b, __global float *in_c, __global float *out_a, const int count)
{
	 const int id = get_global_id(0);
	 if(id < count)
	 {
          out_b[id] = in_a[id] * in_b[id] + in_c[id];
	 }
}

This will actually perform mainly (not only) the MAD instruction (multiply-and-add) in single hertz, but here is the probem, not every instruction does perform two operations at time, actually most perform just single one. The code without any two-op instructions would have at peak just half the TFlops count (means some 0.5 TFlops). And you're still not counting with memory (accessing VRAM isn't free), waiting for data for RAM (that is even more expensive), and also that the compiled code is not nearly as effective as single instruction! (and Flops goes very quickly down).

Even though, if you count the resulting (G)Flops for some algorithm you will probably get the higher counts per core than CPU has (well as long as it won't be SIMD uneffective). Also we're mostly comparing some 10-SIMD core GPU with 2-core or 4-core CPU (what about to compare it to new 16-core Opteron? It might be interesting - though GPU still fetches data from VRAM faster than CPU from RAM (depends on RAM speed, I had older 1066 DDR2 before and now I use 2133 DDR3 (good Santa Claus :D) - the difference can be seen when doing some memory-op heavy computations).

Huh, hope that there isn't too much informations at once. Also I'm not GPU architect neither GPU expert, it is just kind of my hobby, so...
My blog about game development (and not just game development) - http://gameprogramme...y.blogspot.com/

If you don't know how to speed up application, go "roarrrrrr!", hit the compiler with the club and use -O3 :D

#18 Alienizer

    Member

  • Members
  • PipPipPip
  • 266 posts

Posted 06 January 2012 - 02:26 AM

View PostVilem Otte, on 06 January 2012 - 01:31 AM, said:

You won't split work in those, you will enqueue the all work to-be-done to GPU and it will split itself (with help of driver) the work between compute units.

Make sense!

View PostVilem Otte, on 06 January 2012 - 01:31 AM, said:

Ad arch - F.e. Radeon's smallest units are stream processing units formed in SIMD cores (computing units), they're overlooked by Ultra-threaded processor connected through local data share on every SIMD core to SIMD core (Ultra threaded processor has Instruction cache that feeds instructions to the SIMD cores). They're also connected through L1 cache to L2 cache (and global data share between SIMD cores), L2 is then connected to memory controller.

AMD stream processing units in SIMD cores are different than NVidia ones. AMD has stream processing units in groups of 6, where 4 are the same (ALU-like unit with set of FP/INT operations), 5th has different set of FP/INT arithmetic operations and the last one looks after them (This group is called stream processor). NVidia has stream processing units in groups of 9, where 8 are the same (ALU-like with one set of FP/INT arithmetic operations) and the last one looks after them. NVidia stream processing units are also a bit more powerful (in term of computation) than AMDs.

So for performance what counts are SIMD cores = compute units. Every SIMD core is similar to standard x86-core with SIMD unit (where ALUs in SIMD unit would do the same function as stream processors do in SIMD core in GPU) - so, my HD6770 here has 10 compute units - so it could be compared (in case of behaviour) to 10-core CPU.

So, if I code for SSE on my Intel i7-3960X (6 cores 12 thread, 3.6GHZ, DDR3 good Santa Claus :D), would that be faster/same/slower than my GTX-280 (processor clock runs at 1.3GHZ - 240 shader cores)?


View PostVilem Otte, on 06 January 2012 - 01:31 AM, said:

Ad 'The Nut counts' (Note: Sorry for repeating some of info you provided - trying to keep it clear)
If his gpu has 720 stream processing units, broke into 9 compute units (to make a joke on Nutty address - I have 10 now :P) - then every compute unit of his has 80 stream processing units (e.g. 16 stream processors). Why just 5 counts, instead of 6 I've mentioned? The "police" (the one looking after them) doesn't count in these, because it doesn't perform any computation. That's what he said. Now he did something what is called marketing trick - thats the computation of flops - you can do 2 float ops in one herz, so at 700 MHz, you can do ~ 1TFlops ... but this is ideal peak performance (only counted on paper), you won't ever get near this. To get near this, you would have to write just this code for GPU:

__kernel void main(__global float *in_a, __global float *in_b, __global float *in_c, __global float *out_a, const int count)
{
	 const int id = get_global_id(0);
	 if(id < count)
	 {
		  out_b[id] = in_a[id] * in_b[id] + in_c[id];
	 }
}

This will actually perform mainly (not only) the MAD instruction (multiply-and-add) in single hertz, but here is the probem, not every instruction does perform two operations at time, actually most perform just single one. The code without any two-op instructions would have at peak just half the TFlops count (means some 0.5 TFlops). And you're still not counting with memory (accessing VRAM isn't free), waiting for data for RAM (that is even more expensive), and also that the compiled code is not nearly as effective as single instruction! (and Flops goes very quickly down).

I've noticed that. I ran a code similar to yours and it was real fast. But when I use more complex operation, it slows down alot. One code I tested on double4 cross/dot/sqrt and such were not that much faster than my CPU!

View PostVilem Otte, on 06 January 2012 - 01:31 AM, said:

Even though, if you count the resulting (G)Flops for some algorithm you will probably get the higher counts per core than CPU has (well as long as it won't be SIMD uneffective). Also we're mostly comparing some 10-SIMD core GPU with 2-core or 4-core CPU (what about to compare it to new 16-core Opteron? It might be interesting - though GPU still fetches data from VRAM faster than CPU from RAM (depends on RAM speed, I had older 1066 DDR2 before and now I use 2133 DDR3 (good Santa Claus :D) - the difference can be seen when doing some memory-op heavy computations).

Huh, hope that there isn't too much informations at once. Also I'm not GPU architect neither GPU expert, it is just kind of my hobby, so...


Great article, never too much info for me, the more info the more I learn, and this info is sure valuable, thank you!

#19 TheNut

    Senior Member

  • Moderators
  • 1472 posts
  • LocationThornhill, ON

Posted 06 January 2012 - 05:35 AM

Your video card is rated at 933 GFLOPS and your CPU was benched at ~120 GLOPS (I had to google for the figures). Your video card should run laps around your CPU (naturally).

Glad you enjoy my wallpaper ;) The logo and companion cube are 3d renders of mine. The former is actually a frame taken from an intro sequence I put in my video games. The fractal flames are based on my "Infinity" engine, which randomly generates fractals rendered by the CPU. I would actually like to get around to adding GPU support. Fractron 9000 renders their flames using the GPU and it is insanely fast.
http://www.nutty.ca - Being a nut has its advantages.





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users