Solved: Bug in device_selector class: CPU vs GPU

Viet-Duc · ‎06-10-2021

Hi,

I would like to report a bug of the device selector which lead to wrong numerical result.

The attched code is migrated from CUDA using DPCT. If you look right below main function.

int main (int argc, char *argv[])
{
#ifdef DPCT
    dpct::device_ext &dev_ct1 = dpct::get_current_device();
    sycl::queue &q_ct1 = dev_ct1.default_queue();
#else
    sycl::queue q_ct1(sycl::default_selector{});
#endif

The former is inserted by DPCT while the latter is standard device selection in DPC++.

I've simply wrapped them within a macro.

The former gives consistent results on with iCPU and iGPU.

$ dpcpp -DDPCT main.cpp 
$ SYCL_PI_TRACE=1 SYCL_DEVICE_FILTER=opencl:cpu ./a.out

SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so
SYCL_PI_TRACE[all]: Selected device ->
SYCL_PI_TRACE[all]:   platform: Intel(R) OpenCL
SYCL_PI_TRACE[all]:   device: Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz
TBB Warning: The number of workers is currently limited to 11. The request for 31 workers is ignored. Further requests for more workers will be silently ignored until the limit changes.

<SYCL implementation>
Problem size: 512 x 512 
Workgroup size: 128 
Time = 0.000477, delt = 4.768372e-04, iter = 92522, res = 9.999751e-04

The default_selector, however gives wrong results only for cpu:

$ dpcpp main.cpp
$ SYCL_PI_TRACE=1 SYCL_DEVICE_FILTER=opencl:cpu ./a.out

SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so
SYCL_PI_TRACE[all]: Selected device ->
SYCL_PI_TRACE[all]:   platform: Intel(R) OpenCL
SYCL_PI_TRACE[all]:   device: Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz

<SYCL implementation>
Problem size: 512 x 512 
Workgroup size: 128 
Time = 0.000477, delt = 4.768372e-04, iter = 1, res = 0.000000e+00

Please also note that TBB warning is absent in the latter case. This leads me to believe that there might be a problem with thread initialization. These results were obtained using Intel DevCloud

This is an important issue for me. As useful as dpct helper functions are, they break compatability with CUDA backend. I appreciate if you can provide some insights into this issue.

Thanks.

Alina_S_Intel · ‎06-17-2021

This is expected behavior. I have reduced your reproducer to the smaller one. Please, note that there is an important difference between sycl::queue and default_queue() provided by DPCT. DPCT default_queue() is IN ORDER queue so it does not need wait() function here while sycl::queue is OUT OF ORDER queue and you need wait() function for submit() to make sure the kernel is completed before you call memcpy.

#include <CL/sycl.hpp>

#include <dpct/dpct.hpp>

#include <stdlib.h>

#include <stdio.h>

#include <math.h>

#define NUM 64

#define BLOCK_SIZE 4

#define Real double

#define xLength 1.0f

int main (int argc, char *argv[])

{

#ifdef DPCT

dpct::device_ext &dev_ct1 = dpct::get_current_device();

sycl::queue &q_ct1 = dev_ct1.default_queue();

#else

sycl::queue q_ct1(sycl::default_selector{});

#endif

int size = (NUM + 2) * (NUM + 2);

Real* F;

F = (Real *) calloc (size, sizeof(Real));

for (int i = 0; i < size; ++i) {

F[i] = 0.0f;

}

///////////////////////////////////////////

Real* F_d;

F_d = sycl::malloc_device<double>(size, q_ct1);

q_ct1.memcpy(F_d, F, size * sizeof(Real)).wait();

//////////////////////////////////////

q_ct1.submit([&](sycl::handler &cgh) {

cgh.parallel_for(sycl::range<1>(size),

[=](sycl::item<1> item_ct1) {

F_d[item_ct1] = 1;

});

}).wait(); // wait is needed here

q_ct1.memcpy(F, F_d, size * sizeof(Real)).wait();

Real R_F = 0;

for (int i = 0; i < size; ++i) {

R_F += F[i];

}

printf("F Sum = %e\n", R_F);

sycl::free(F_d, q_ct1);

free(F);

return 0;

}

For more information about in-order queue, please refer to https://software.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/optimization-and-programming-guide/extensions/queue-order-properties.html

We will no longer respond to this thread.

If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Thanks,

View solution in original post

VidyalathaB_Intel · ‎06-11-2021

Hi,

Thanks for reaching out to us.

Could you please try using cpu_selector as device type instead of default selector.

Meanwhile we will look into this issue internally, we will get back to you soon.

Regards,

Vidya.

Viet-Duc · ‎06-15-2021

Hi Vidya,

I appologize for somewhat late response. I've tried using the cpu_selector{} directly as you suggested.

Unfortunately, the same issue was observed where the first time step immediately finished without perform any iteration.

Other approaches I have tried with no avail:

- Changing thread setting, i.e. DPCPP_CPU_NUM_CUS, DPCPP_CPU_AFFINITY

- Force stringent precision via -fp-model=precise

- Run in single thread, i.e DPCPP_CPU_NUM_CUS=1

This issue is hard to pin down since it happens with some codes and not the others.

For instance, the heartwall cuda implemenation from RODINIA benchmark suits has similar problem with CPU device.

Although I was able to get correct result with specific number of thread and affinity.

May be it a thread-safty issue ?

Alina_S_Intel · ‎06-17-2021

This is expected behavior. I have reduced your reproducer to the smaller one. Please, note that there is an important difference between sycl::queue and default_queue() provided by DPCT. DPCT default_queue() is IN ORDER queue so it does not need wait() function here while sycl::queue is OUT OF ORDER queue and you need wait() function for submit() to make sure the kernel is completed before you call memcpy.

#include <CL/sycl.hpp>

#include <dpct/dpct.hpp>

#include <stdlib.h>

#include <stdio.h>

#include <math.h>

#define NUM 64

#define BLOCK_SIZE 4

#define Real double

#define xLength 1.0f

int main (int argc, char *argv[])

{

#ifdef DPCT

dpct::device_ext &dev_ct1 = dpct::get_current_device();

sycl::queue &q_ct1 = dev_ct1.default_queue();

#else

sycl::queue q_ct1(sycl::default_selector{});

#endif

int size = (NUM + 2) * (NUM + 2);

Real* F;

F = (Real *) calloc (size, sizeof(Real));

for (int i = 0; i < size; ++i) {

F[i] = 0.0f;

}

///////////////////////////////////////////

Real* F_d;

F_d = sycl::malloc_device<double>(size, q_ct1);

q_ct1.memcpy(F_d, F, size * sizeof(Real)).wait();

//////////////////////////////////////

q_ct1.submit([&](sycl::handler &cgh) {

cgh.parallel_for(sycl::range<1>(size),

[=](sycl::item<1> item_ct1) {

F_d[item_ct1] = 1;

});

}).wait(); // wait is needed here

q_ct1.memcpy(F, F_d, size * sizeof(Real)).wait();

Real R_F = 0;

for (int i = 0; i < size; ++i) {

R_F += F[i];

}

printf("F Sum = %e\n", R_F);

sycl::free(F_d, q_ct1);

free(F);

return 0;

}

For more information about in-order queue, please refer to https://software.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/optimization-and-programming-guide/extensions/queue-order-properties.html

We will no longer respond to this thread.

If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Thanks,

Alina_S_Intel · ‎06-17-2021

Since this is not a bug but expected behavior, we will no longer respond to this thread.

If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Viet-Duc · ‎06-17-2021

I understood the issue now. Thanks for your time.