- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi,

I'm new to OpenCL and I have implemented a program to compute the

dot product. The program works as expected if I use a GPU and it

returns a wrong result if I use a CPU with more than one work-item

in a work-group. I was able to find the reason for the problem

using only two work-items per work-group and one work-group

per NDrange. I have two work-items before and after the reduction

operation if I use a GPU and only one work-item after the

reduction operation if I use a CPU so that the partial sum of the

work-group will not be stored. The program uses libOpenCL.so.1 from

opencl-1.2-sdk-6.3.0.1904, opencl_runtime_16.1.1_x64_sles_6.4.0.25,

and the OpenCL driver from CUDA-8.0. Does somebody know why I have

only one work-item after the reduction operation? Is something

wrong with my kernel (most likely) or have I detected a problem with

the Intel OpenCL implementation for CPUs (very unlikely)?

loki introduction 230 gcc dot_prod_OpenCL_orig.c errorCodes.c -lOpenCL

loki introduction 231 a.out

Try to find first GPU on available platforms.

...

******** Using platform 1 ********

Use device Quadro K2200.

before reduction: local_id = 0

before reduction: local_id = 1

after reduction: local_id = 0

after reduction: local_id = 1

sum = 6.000000e+01

loki introduction 232 gcc dot_prod_OpenCL.c errorCodes.c -lOpenCL

loki introduction 233 a.out

Try to find first CPU on available platforms.

******** Using platform 0 ********

Use device Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.

before reduction: local_id = 0

before reduction: local_id = 1

after reduction: local_id = 1

sum = 2.265776e-316

loki introduction 234 strace a.out |& grep ocl

open("/usr/local/intel/opencl-1.2-6.4.0.25/lib64/libintelocl.so", O_RDONLY|O_CLOEXEC) = 5

open("/usr/local/intel/opencl-1.2-6.4.0.25/lib64/__ocl_svml_l9.so", O_RDONLY|O_CLOEXEC) = 3

loki introduction 235

dot_prod_OpenCL.h

-----------------

#define VECTOR_SIZE 10

#define WORK_ITEMS_PER_WORK_GROUP 2 /* power of two required */

#define WORK_GROUPS_PER_NDRANGE 1

dotProdKernel.cl

----------------

#if defined (cl_khr_fp64) || defined (cl_amd_fp64)

#include "dot_prod_OpenCL.h"

__kernel void dotProdKernel (__global const double * restrict a,

__global const double * restrict b,

__global double * restrict partial_sum)

{

/* Use local memory to store each work-items running sum. */

__local double cache[WORK_ITEMS_PER_WORK_GROUP];

double temp = 0.0;

int cacheIdx = get_local_id (0);

for (int tid = get_global_id (0);

tid < VECTOR_SIZE;

tid += get_global_size (0))

{

temp += a[tid] * b[tid];

}

cache[cacheIdx] = temp;

/* Ensure that all work-items have completed, before you add up the

* partial sums of each work-item to the sum of the work-group

*/

barrier (CLK_LOCAL_MEM_FENCE);

/* Each work-item will add two values and store the result back to

* "cache". We need "log_2 (WORK_ITEMS_PER_WORK_GROUP)" steps to

* reduce all partial values to one work-group value.

* WORK_ITEMS_PER_WORK_GROUP must be a power of two for this

* reduction.

*/

printf ("before reduction: local_id = %u\n", get_local_id (0));

for (int i = get_local_size (0) / 2; i > 0; i /= 2)

{

if (cacheIdx < i)

{

cache[cacheIdx] += cache[cacheIdx + i];

barrier (CLK_LOCAL_MEM_FENCE);

}

}

printf ("after reduction: local_id = %u\n", get_local_id (0));

/* store the partial sum of this work-group */

if (cacheIdx == 0)

{

partial_sum[get_group_id (0)] = cache[0];

}

}

#else

#error "Double precision floating point not supported."

#endif

Thank you very much for any help in advance.

Kind regards

Siegmar

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Reductions can be tricky to implement. For a simple starting point you could make each work item do more multiplies (for example, 4, 8, or 16 of them) then calculate a partial sum for each work item without a barrier.

Alternately, this implementation looks somewhat close to your algorithm: http://www.openclblog.com/2012/11/opencl-and-dot-product.html. ; (Though you may need to limit max local size to something smaller than CL_MAX_WORK_GROUP_SIZE to get this to work on the CPU. I Tried with 256 and results passed the correctness check at the end.)

As you're developing your own algorithm, the Code Builder tools can help with debugging and with checking the outputs.

For a peek ahead at "advanced" ways to more deeply optimize the dot product algorithm, take a look at https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics. ;

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Jeffrey,

thank you very much for your answer. Unfortunately it doesn't address my problem. My reduction works as expected if I use the OpenCL implementation from CUDA-8.0, because I don't lose work-items in the reduction phase. However, I lose work-item 0 in the reduction phase if I use the Intel implementation for my CPU so that the result of the work-group will not be stored in array "partial_sum". For my test case I use only 2 work-items in 1 work-group so that I'm far away from CL_MAX_WORK_GROUP_SIZE. I've added statements to print the values after the reduction.

int cacheIdx = get_local_id (0);

...

printf ("after reduction: local_id = %u\n", get_local_id (0));

for (int i = 0; i < WORK_ITEMS_PER_WORK_GROUP; ++i)

{

printf ("cache[%d] = %e\n", i, cache*);
}
/* store the partial sum of this work-group */
if (cacheIdx == 0)
{
partial_sum[get_group_id (0)] = cache[0];
}*

loki introduction 141 icc dot_prod_OpenCL.c errorCodes.c -lOpenCL

loki introduction 142 a.out

Found 2 platform(s).

Try to find first CPU on available platforms.

******** Using platform 0 ********

Use device Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.

before reduction: local_id = 0

before reduction: local_id = 1

after reduction: local_id = 1

cache[0] = 6.000000e+01

cache[1] = 3.000000e+01

sum = 1.151084e-316

loki introduction 143

"cache[0]" holds the correct value, but "local_id 0" isn't available so that the value will not be stored in "partial_sum[get_group_id (0)]" and therefore gets lost. I'm working with SuSE Linux Enterprise Server and don't have Code Builder, because "mono" isn't available.

loki introduction 144 CodeBuilder

/usr/bin/CodeBuilder: line 9: exec: mono: not found

loki introduction 145

Do you have any suggestions how I can nevertheless find out why work-item 0 isn't available after the reduction operation? I've added my files so that you can possibly find out if you have the same problem on your machine.

Thank you very much for any further help in advance.

Kind regards

Siegmar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I think this section...

for (int i = get_local_size (0) / 2; i > 0; i /= 2) { if (cacheIdx < i) { cache[cacheIdx] += cache[cacheIdx + i]; barrier (CLK_LOCAL_MEM_FENCE); } }

... uses a barrier within divergent control flow, which is disallowed. Search the spec for: "If work_group_barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the work_group_barrier."

Do you get the same incorrect behavior if the barrier is outside of the if block?

for (int i = get_local_size (0) / 2; i > 0; i /= 2) { if (cacheIdx < i) { cache[cacheIdx] += cache[cacheIdx + i]; } barrier (CLK_LOCAL_MEM_FENCE); }

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Ben,

thank you very much for your answer. It solves my problem and I should have seen your solution myself.

Kind regards and thank you once more

Siegmar

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page