OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

kernel “vector + vector”, return the right result only if vector's length is a multiple of 64

Xin_Q_Intel
Employee
665 Views

I'm new to OpenCL. And I'm trying to run a kernel “vector + vector”, I could get the right result only if vector's length equals  a multiple of 64. For example, I will get the output below when I set the length to 16. 

No protocol specified
platform 1: vendor 'Intel(R) Corporation'
 device 0: 'Intel(R) HD Graphics'
0 + 16 = 0
1 + 15 = 0
2 + 14 = 0
3 + 13 = 0
4 + 12 = 0
5 + 11 = 0
6 + 10 = 0
7 + 9 = 0
8 + 8 = 0
9 + 7 = 0
10 + 6 = 0
11 + 5 = 0
12 + 4 = 0
13 + 3 = 0
14 + 2 = 0
15 + 1 = 0

 

You can find the code from this website http://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/

Environment:

  • CentOS 7.1
  • i7 4790
  • OpenCL 1.2
  • SDK: Intel SDK  2015 Production16.4.2.1 from Intel Media Server Studio Community version.
0 Kudos
1 Solution
Robert_I_Intel
Employee
665 Views

Dear Xin,

The code in question has a couple of defects:

1. It does not check whether return code ret is actually a success: if it did, your program would terminate at line 84 (while attempting to call clEnqueueNDRangeKernel), since your global size (16) is less than your local size (64).

2. If you correct the program as follows: set local_item_size to 16, 8, 4, 2 or 1, the program will perform correctly.

3. Alternatively, you could provide 0 instead of &local_item_size parameter and let the runtime pick the local size for you.

Anyway, the code in question will not perform very well on Intel(R) Processor Graphics. Please see the following article for a better example:

https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics

 

View solution in original post

0 Kudos
3 Replies
Robert_I_Intel
Employee
666 Views

Dear Xin,

The code in question has a couple of defects:

1. It does not check whether return code ret is actually a success: if it did, your program would terminate at line 84 (while attempting to call clEnqueueNDRangeKernel), since your global size (16) is less than your local size (64).

2. If you correct the program as follows: set local_item_size to 16, 8, 4, 2 or 1, the program will perform correctly.

3. Alternatively, you could provide 0 instead of &local_item_size parameter and let the runtime pick the local size for you.

Anyway, the code in question will not perform very well on Intel(R) Processor Graphics. Please see the following article for a better example:

https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics

 

0 Kudos
Xin_Q_Intel
Employee
665 Views

Dear Robert, 

Thanks for your help, it perform correctly now.

BTW, I want to compute some matrix using OpenCL, do you know a BLAS library running well on our Intel(R) Processor Graphics? I have tried AMD's clBLAS, but the performance is quite bad.

0 Kudos
Robert_I_Intel
Employee
665 Views

Dear Xin,

We just recently published a sample: https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics on how to do SGEMM on Intel Processor Graphics. Unfortunately, we don't have a full-blown BLAS library optimized for it yet.

0 Kudos
Reply