dot product kernel doesn't work on CPUs

Siegmar_G_ · ‎01-03-2017

Hi,

I'm new to OpenCL and I have implemented a program to compute the
dot product. The program works as expected if I use a GPU and it
returns a wrong result if I use a CPU with more than one work-item
in a work-group. I was able to find the reason for the problem
using only two work-items per work-group and one work-group
per NDrange. I have two work-items before and after the reduction
operation if I use a GPU and only one work-item after the
reduction operation if I use a CPU so that the partial sum of the
work-group will not be stored. The program uses libOpenCL.so.1 from
opencl-1.2-sdk-6.3.0.1904, opencl_runtime_16.1.1_x64_sles_6.4.0.25,
and the OpenCL driver from CUDA-8.0. Does somebody know why I have
only one work-item after the reduction operation? Is something
wrong with my kernel (most likely) or have I detected a problem with
the Intel OpenCL implementation for CPUs (very unlikely)?

loki introduction 230 gcc dot_prod_OpenCL_orig.c errorCodes.c -lOpenCL
loki introduction 231 a.out

Try to find first GPU on available platforms.
...
******** Using platform 1 ********
Use device Quadro K2200.

before reduction: local_id = 0
before reduction: local_id = 1
after reduction: local_id = 0
after reduction: local_id = 1
sum = 6.000000e+01

loki introduction 232 gcc dot_prod_OpenCL.c errorCodes.c -lOpenCL
loki introduction 233 a.out

Try to find first CPU on available platforms.
******** Using platform 0 ********
Use device Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.

before reduction: local_id = 0
before reduction: local_id = 1
after reduction: local_id = 1
sum = 2.265776e-316

loki introduction 234 strace a.out |& grep ocl
open("/usr/local/intel/opencl-1.2-6.4.0.25/lib64/libintelocl.so", O_RDONLY|O_CLOEXEC) = 5
open("/usr/local/intel/opencl-1.2-6.4.0.25/lib64/__ocl_svml_l9.so", O_RDONLY|O_CLOEXEC) = 3
loki introduction 235

dot_prod_OpenCL.h
-----------------

#define   VECTOR_SIZE          10
#define WORK_ITEMS_PER_WORK_GROUP 2   /* power of two   required   */
#define WORK_GROUPS_PER_NDRANGE   1

dotProdKernel.cl
----------------

#if defined (cl_khr_fp64) || defined (cl_amd_fp64)
#include "dot_prod_OpenCL.h"

__kernel void dotProdKernel (__global const double * restrict a,
                   __global const double * restrict b,
                   __global double * restrict partial_sum)
{
    /* Use local memory to store each work-items running sum.       */
    __local double cache[WORK_ITEMS_PER_WORK_GROUP];

double temp = 0.0;
int cacheIdx = get_local_id (0);

    for (int tid = get_global_id (0);
   tid < VECTOR_SIZE;
   tid += get_global_size (0))
    {
      temp += a[tid] * b[tid];
    }
    cache[cacheIdx] = temp;

    /* Ensure that all work-items have completed, before you add up the
     * partial sums of each work-item to the sum of the work-group
     */
    barrier (CLK_LOCAL_MEM_FENCE);

    /* Each work-item will add two values and store the result back to
     * "cache". We need "log_2 (WORK_ITEMS_PER_WORK_GROUP)" steps to
     * reduce all partial values to one work-group value.
     * WORK_ITEMS_PER_WORK_GROUP must be a power of two for this
     * reduction.
     */
    printf ("before reduction: local_id = %u\n", get_local_id (0));
    for (int i = get_local_size (0) / 2; i > 0; i /= 2)
    {
      if (cacheIdx < i)
      {
   cache[cacheIdx] += cache[cacheIdx + i];
   barrier (CLK_LOCAL_MEM_FENCE);
      }
    }
    printf ("after reduction: local_id = %u\n", get_local_id (0));
    /* store the partial sum of this work-group               */
    if (cacheIdx == 0)
    {
      partial_sum[get_group_id (0)] = cache[0];
    }
}
#else
#error "Double precision floating point not supported."
#endif

Thank you very much for any help in advance.

Kind regards

Siegmar

Jeffrey_M_Intel1 · ‎01-03-2017

Reductions can be tricky to implement. For a simple starting point you could make each work item do more multiplies (for example, 4, 8, or 16 of them) then calculate a partial sum for each work item without a barrier.

Alternately, this implementation looks somewhat close to your algorithm: http://www.openclblog.com/2012/11/opencl-and-dot-product.html. ; (Though you may need to limit max local size to something smaller than CL_MAX_WORK_GROUP_SIZE to get this to work on the CPU. I Tried with 256 and results passed the correctness check at the end.)

As you're developing your own algorithm, the Code Builder tools can help with debugging and with checking the outputs.

For a peek ahead at "advanced" ways to more deeply optimize the dot product algorithm, take a look at https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics. ;

Siegmar_G_ · ‎01-04-2017

Hi Jeffrey,

thank you very much for your answer. Unfortunately it doesn't address my problem. My reduction works as expected if I use the OpenCL implementation from CUDA-8.0, because I don't lose work-items in the reduction phase. However, I lose work-item 0 in the reduction phase if I use the Intel implementation for my CPU so that the result of the work-group will not be stored in array "partial_sum". For my test case I use only 2 work-items in 1 work-group so that I'm far away from CL_MAX_WORK_GROUP_SIZE. I've added statements to print the values after the reduction.

int cacheIdx = get_local_id (0);

...

    printf ("after reduction: local_id = %u\n", get_local_id (0));
    for (int i = 0; i < WORK_ITEMS_PER_WORK_GROUP; ++i)
    {
      printf ("cache[%d] = %e\n", i, cache);
    }
    /* store the partial sum of this work-group               */
    if (cacheIdx == 0)
    {
      partial_sum[get_group_id (0)] = cache[0];
    }

loki introduction 141 icc dot_prod_OpenCL.c errorCodes.c -lOpenCL
loki introduction 142 a.out

Found 2 platform(s).
Try to find first CPU on available platforms.
******** Using platform 0 ********
Use device Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.

before reduction: local_id = 0
before reduction: local_id = 1
after reduction: local_id = 1
cache[0] = 6.000000e+01
cache[1] = 3.000000e+01
sum = 1.151084e-316
loki introduction 143

"cache[0]" holds the correct value, but "local_id 0" isn't available so that the value will not be stored in "partial_sum[get_group_id (0)]" and therefore gets lost. I'm working with SuSE Linux Enterprise Server and don't have Code Builder, because "mono" isn't available.

loki introduction 144 CodeBuilder
/usr/bin/CodeBuilder: line 9: exec: mono: not found
loki introduction 145

Do you have any suggestions how I can nevertheless find out why work-item 0 isn't available after the reduction operation? I've added my files so that you can possibly find out if you have the same problem on your machine.

Thank you very much for any further help in advance.

Kind regards

Siegmar

Ben_A_Intel · ‎01-04-2017

I think this section...

    for (int i = get_local_size (0) / 2; i > 0; i /= 2)
    {
      if (cacheIdx < i)
      {
        cache[cacheIdx] += cache[cacheIdx + i];
        barrier (CLK_LOCAL_MEM_FENCE);
      }
    }

... uses a barrier within divergent control flow, which is disallowed. Search the spec for: "If work_group_barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the work_group_barrier."

Do you get the same incorrect behavior if the barrier is outside of the if block?

    for (int i = get_local_size (0) / 2; i > 0; i /= 2)
    {
      if (cacheIdx < i)
      {
        cache[cacheIdx] += cache[cacheIdx + i];
      }
      barrier (CLK_LOCAL_MEM_FENCE);
    }

Siegmar_G_ · ‎01-04-2017

Hi Ben,

thank you very much for your answer. It solves my problem and I should have seen your solution myself.

Kind regards and thank you once more

Siegmar