Solved: kernel “vector + vector”, return the right result only if vector's length is a multiple of 64

Xin_Q_Intel · ‎08-24-2015

I'm new to OpenCL. And I'm trying to run a kernel “vector + vector”, I could get the right result only if vector's length equals a multiple of 64. For example, I will get the output below when I set the length to 16.

No protocol specified
platform 1: vendor 'Intel(R) Corporation'
device 0: 'Intel(R) HD Graphics'
0 + 16 = 0
1 + 15 = 0
2 + 14 = 0
3 + 13 = 0
4 + 12 = 0
5 + 11 = 0
6 + 10 = 0
7 + 9 = 0
8 + 8 = 0
9 + 7 = 0
10 + 6 = 0
11 + 5 = 0
12 + 4 = 0
13 + 3 = 0
14 + 2 = 0
15 + 1 = 0

You can find the code from this website http://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/

Environment：

CentOS 7.1
i7 4790
OpenCL 1.2
SDK: Intel SDK 2015 Production16.4.2.1 from Intel Media Server Studio Community version.

Robert_I_Intel · ‎08-25-2015

Dear Xin,

The code in question has a couple of defects:

1. It does not check whether return code ret is actually a success: if it did, your program would terminate at line 84 (while attempting to call clEnqueueNDRangeKernel), since your global size (16) is less than your local size (64).

2. If you correct the program as follows: set local_item_size to 16, 8, 4, 2 or 1, the program will perform correctly.

3. Alternatively, you could provide 0 instead of &local_item_size parameter and let the runtime pick the local size for you.

Anyway, the code in question will not perform very well on Intel(R) Processor Graphics. Please see the following article for a better example:

https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics

View solution in original post

Robert_I_Intel · ‎08-25-2015

Dear Xin,

The code in question has a couple of defects:

1. It does not check whether return code ret is actually a success: if it did, your program would terminate at line 84 (while attempting to call clEnqueueNDRangeKernel), since your global size (16) is less than your local size (64).

2. If you correct the program as follows: set local_item_size to 16, 8, 4, 2 or 1, the program will perform correctly.

3. Alternatively, you could provide 0 instead of &local_item_size parameter and let the runtime pick the local size for you.

Anyway, the code in question will not perform very well on Intel(R) Processor Graphics. Please see the following article for a better example:

https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics

Xin_Q_Intel · ‎08-25-2015

Dear Robert,

Thanks for your help, it perform correctly now.

BTW, I want to compute some matrix using OpenCL, do you know a BLAS library running well on our Intel(R) Processor Graphics? I have tried AMD's clBLAS, but the performance is quite bad.

Robert_I_Intel · ‎08-26-2015

Dear Xin,

We just recently published a sample: https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics on how to do SGEMM on Intel Processor Graphics. Unfortunately, we don't have a full-blown BLAS library optimized for it yet.