OpenCL on GPU programming

merlion_m_ · ‎11-28-2014

Hello, I'm new to Intel GPU and I'm trying to do some OpenCL programming on Graphics.

1. Is there the wavefront concept on Intel GPU? What is the proper work group size?

On AMD GPU, code is actually executed in groups of 64 threads. On Nvidia GPU, this number is 32. On
Intel GPU, is this number is the number of EUs in one subslice multiplied by 7? I use the template
in visual studio for OpenCL, and the group size is NULL. I don't know whether this influence the
performance.

clEnqueueNDRangeKernel(ocl->commandQueue, ocl->kernel, 2, NULL, globalWorkSize, NULL, 0, NULL,
NULL);

2. If I run a kernel many times, will the cache contains the data, just like C programming?

For example, If I run the following code, when the clEnqueueNDRangeKernel first start the kernel,
data will be introduced from memory into cache. Then, If I run the kernel second time, and the data
is the same. Can it reuse the data in the cache? I mean it doesn't need to get the data from the
memory? Just like usual C/C++ programming. Or in another situation, clEnqueueNDRangeKernel will
empty the cache and need to reload the data again?

for(int i=0; i<100; i++){

clEnqueueNDRangeKernel(..Add...);

}

23 __kernel void Add(__global uint* pA, __global uint* pB, __global uint* pC){
.....
31 pC[id] = pA[id] + pB[id];
32 }

3.When using clCreateBuffer, what's the difference between the flags "CL_MEM_READ_ONLY |
CL_MEM_USE_HOST_PTR" and "CL_MEM_READ_ONLY "? Because I think they use the same memory. Then should
they have the same speed?

I only find the "The Compute Architecture of Intel® Processor Graphics Gen7.5 and Gen8.0". If there
is some "GPU OpenCL programming guide", please let me know.

Thanks!

Robert_I_Intel · ‎11-30-2014

1. There is no concept of wavefront on Intel GPUs. What we have is 7 hardware threads per EU, each hardware thread is capable of executing 8, 16 or 32 work items (typically 16, since kernels are mostly compiled SIMD16). There is no one good "proper" work-group size. Depending on how complex your kernel is, whether it uses local memory or not, etc. the best work-group size will really depend on your kernel. Good work group sizes to try are 64, 128, or 256 (but sometimes 16 and 32 are worth a try). Code Builder tool that comes as part of Intel(R) INDE has a good tool where you can run experiments to determine what the optimal size of the work group is. If you leave it NULL, the runtime will select what it thinks the optimal size is, but you are better off experimenting and figuring out the right value yourself.

2. There is no guarantee that the cache will contain the data: you are better off not relying on that.

3. On Intel Integrated Graphics you should always use "CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR". In addition, you should make sure that your buffer size is a multiple of 4096 bytes and cache aligned on 64 bytes. Take a look at the following example https://software.intel.com/en-us/articles/gpu-quicksort-in-opencl-20-using-nested-parallelism-and-work-group-scan-functions and see how the buffers are created there. Check the references section of that article as well: you will find plenty of useful information. At the very minimum you should use _aligned_malloc when allocating memory and use CL_MEM_USE_HOST_PTR when creating buffers with that memory (it will ensure that you are NOT copying memory to the GPU - memory is shared between host and device):

uint* pArray = (uint*)_aligned_malloc (((arraySize*sizeof(uint))/64 + 1)*64, 4096);

db = clCreateBuffer(contextHdl, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, ((sizeof(uint)*size)/64 + 1)*64, pArray, &ciErrNum);

merlion_m_ · ‎12-09-2014

Hello, Robert Ioffe

I use the clGetEventProfilingInfo and I want to use the CL_PROFILING_COMMAND_START/END to profile a program. Is the clock on CPU and GPU different? For example, when I profile the CPU queue, start= 491685923329796, end= 491686168593611. But when I profile GPU queue, start=1101049110880, end=1101588889840. They are not in an order of magnitude. Is it right?

Robert_I_Intel · ‎12-09-2014

The important thing is the difference between end and start, which is in clocks, and when you perform that calculation, you will realize that the differences are of the same order of magnitude. See the following for the use of this counters: https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro#3

merlion_m_ · ‎12-09-2014

I know the during time is "end - start". I just wonder if I launch the CPU and GPU at the same time, how can I guarantee that they start to run at the same time? or close time? If they use the same time, then this is easy. I can see whether the start times are relatively close. But they are different, how can I find that? If the CPU and GPU are on the same die, why this is different?

Robert_I_Intel · ‎12-10-2014

The only thing you can do to guarantee that a kernel in a CPU queue and a kernel in a GPU queue start close together is to make them depend on the same user event and then fire that event. Or make one kernel depend on another kernel via event. I would not rely on start times, since the only reliable thing there is the difference between the start and end of the same kernel. CPU and GPU device support is implemented by different teams :)