- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm new to OpenCL. And I'm trying to run a kernel “vector + vector”, I could get the right result only if vector's length equals a multiple of 64. For example, I will get the output below when I set the length to 16.
No protocol specified
platform 1: vendor 'Intel(R) Corporation'
device 0: 'Intel(R) HD Graphics'
0 + 16 = 0
1 + 15 = 0
2 + 14 = 0
3 + 13 = 0
4 + 12 = 0
5 + 11 = 0
6 + 10 = 0
7 + 9 = 0
8 + 8 = 0
9 + 7 = 0
10 + 6 = 0
11 + 5 = 0
12 + 4 = 0
13 + 3 = 0
14 + 2 = 0
15 + 1 = 0
You can find the code from this website http://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/
Environment:
- CentOS 7.1
- i7 4790
- OpenCL 1.2
- SDK: Intel SDK 2015 Production16.4.2.1 from Intel Media Server Studio Community version.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Xin,
The code in question has a couple of defects:
1. It does not check whether return code ret is actually a success: if it did, your program would terminate at line 84 (while attempting to call clEnqueueNDRangeKernel), since your global size (16) is less than your local size (64).
2. If you correct the program as follows: set local_item_size to 16, 8, 4, 2 or 1, the program will perform correctly.
3. Alternatively, you could provide 0 instead of &local_item_size parameter and let the runtime pick the local size for you.
Anyway, the code in question will not perform very well on Intel(R) Processor Graphics. Please see the following article for a better example:
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Xin,
The code in question has a couple of defects:
1. It does not check whether return code ret is actually a success: if it did, your program would terminate at line 84 (while attempting to call clEnqueueNDRangeKernel), since your global size (16) is less than your local size (64).
2. If you correct the program as follows: set local_item_size to 16, 8, 4, 2 or 1, the program will perform correctly.
3. Alternatively, you could provide 0 instead of &local_item_size parameter and let the runtime pick the local size for you.
Anyway, the code in question will not perform very well on Intel(R) Processor Graphics. Please see the following article for a better example:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Robert,
Thanks for your help, it perform correctly now.
BTW, I want to compute some matrix using OpenCL, do you know a BLAS library running well on our Intel(R) Processor Graphics? I have tried AMD's clBLAS, but the performance is quite bad.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Xin,
We just recently published a sample: https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics on how to do SGEMM on Intel Processor Graphics. Unfortunately, we don't have a full-blown BLAS library optimized for it yet.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page