I think your crashing is memory related. I see you're using a 4-vector float, which means you should have allocated [SIZE * 4 * sizeof(float)] in your device and host memory. Since you're only using the first index in the vector while leaving the other indices zeroed, I'm willing to believe you didn't supply a sufficiently large enough memory block. When the global_id passes [SIZE / 4], it runs out of memory and screams bloody murder
Do you mind posting your C++ OpenCL code? It could shed some clues.
Also, for efficiency reasons you should maximize the use of vectors instead of computing the trig on individual floats. For example, let's say you want to calculate 1024 sin and cos values. This is how I would set it up.
// Allocate host memory
unsigned int size = 1024;
float *hostRead = new float[size];
float *hostWriteSin = new float[size];
float *hostWriteCos = new float[size];
// Initialize the read array
for (unsigned int i = 0; i < size; ++i)
hostRead[i] = (i / (float)size) * Math::PI;
// Create device memory and copy values from host memory
cl_mem memRead= clCreateBuffer(clContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size * sizeof(float), hostRead, &result);
cl_mem memWriteSin = clCreateBuffer(clContext, CL_MEM_WRITE_ONLY, size * sizeof(float), 0, &result);
cl_mem memWriteCos = clCreateBuffer(clContext, CL_MEM_WRITE_ONLY, size * sizeof(float), 0, &result);
// Run kernel, copy back device memory to host memory, etc..
You know you have 1024 values, but you can reduce your calculations to 1024 / 4 = 256 (due to simd architecture). So, your kernel should look like this:
__kernel void vector_gpu (__global const float4 *src,
__global float4 *dst,
__global float4 *dst2,
const int num)
// Where "num" == 1024 / 4 = 256. Only 256 executions.
const int globalId = get_global_id(0);
if ( globalId < num )
dst[globalId] = sincos(src[globalId], &dst2[globalId]);