Intel® oneAPI Data Parallel C++
Support for Intel® oneAPI DPC++ Compiler, Intel® oneAPI DPC++ Library, Intel ICX Compiler , Intel® DPC++ Compatibility Tool, and GDB*
584 Discussions

Sycl memset segfault when using "float3" data type

SteveP1
Beginner
1,807 Views

For certain size the memset fails probably because the memory is not aligned as sycl expects:
color_t* _d_p = (color_t*) sycl::malloc_device(size*sizeof(color_t), dpct::get_default_queue());
dpct::get_default_queue().memset(_d_p, 0, size*sizeof(color_t)).wait();

color_t is a struct of 3 floats with sizeof(color_t) = 12, I believe this is equivalent to the sycle mfloat3 (I get the same issue).
For example size=20000*20000 does not work while size=36942*20000 works

Here is the error I get:

#0  0x00007fffc55265b0 in __memset_sse2_unaligned_erms () from /lib64/libc.so.6
#1  0x00007fffb8851eb0 in set_zero ()
#2  0x0000000000000001 in ?? ()
#3  0x00007fffbd5a0e40 in Intel::OpenCL::DeviceBackend::Kernel::RunGroup(void const*, unsigned long const*, void*) const ()
   from /home/sci/spetruzza/intel/oneapi/compiler/2023.0.0/linux/lib/x64/libOclCpuBackEnd.so.2022.15.12.0
#4  0x00007fffbd5a148a in non-virtual thunk to Intel::OpenCL::DeviceBackend::Kernel::RunGroup(void const*, unsigned long const*, void*) const ()
   from /home/sci/spetruzza/intel/oneapi/compiler/2023.0.0/linux/lib/x64/libOclCpuBackEnd.so.2022.15.12.0
#5  0x00007fffc1c926d7 in Intel::OpenCL::CPUDevice::NDRange::ExecuteIteration(unsigned long, unsigned long, unsigned long, void*) ()
   from /home/sci/spetruzza/intel/oneapi/compiler/2023.0.0/linux/lib/x64/libcpu_device.so.2022.15.12.0
#6  0x00007fffc1f1d6b9 in tbb::detail::d1::start_for<Intel::OpenCL::TaskExecutor::BlockedRangeByDefa
--Type <RET> for more, q to quit, c to continue without paging--
ultTBB1d<Intel::OpenCL::TaskExecutor::NoProportionalSplit>, TaskLoopBody1D<Intel::OpenCL::TaskExecutor::BlockedRangeByDefaultTBB1d<Intel::OpenCL::TaskExecutor::NoProportionalSplit> >, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) ()
   from /home/sci/spetruzza/intel/oneapi/compiler/2023.0.0/linux/lib/x64/libtask_executor.so.2022.15.12.0
#7  0x00007fffc6b424ee in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (this=0x7fffc171f280, t=0x7fff88187400, waiter=...)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:322
#8  0x00007fffc6b249c8 in tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (this=0x7ffceb3b1b00, t=<optimized out>, waiter=...)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:458
#9  tbb::detail::r1::arena::process (this=0x7ffceb3b1b00, tls=...)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:137
#10 0x00007fffc6b240c6 in tbb::detail::r1::market::process (this=0x7ffceb3b1b00, j=...)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/market.cpp:599
#11 0x00007fffc6b2945c in tbb::detail::r1::rml::private_worker::run (this=0x7ffceb3b1b00)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:271
#12 0x00007fffc6b293e6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7ffceb3b1b00)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#13 0x00007fffc5d59a1a in start_thread () from /lib64/libpthread.so.0
#14 0x00007fffc557ed0f in clone () from /lib64/libc.so.6

The system memset (without sycl) does not segfault.

I wanted to use the same float3 datatype that I used in my CUDA code, but apparently sycl stores those in 16 bytes instead of 12 bytes.


Also, I need to use this kind of array with MKL cblas functions which expect float* as inputs. I have noticed weird behaviours when trying to cast sycl::float3* to float* in those cblas functions. That is the reason why I am now trying to roll back to a 12bytes data type. Do you have any suggestions on how to handle those float3 (but also char3) in the right manner when using sycl and also cblas MKL functions?

0 Kudos
11 Replies
NoorjahanSk_Intel
Moderator
1,742 Views

Hi,

 

Thanks for posting in Intel communities.

 

Could you please provide us with sample reproducer codes(DPC++, CUDA, oneMKL Blas) and steps if any so that we can try it from our end?

Also please provide OS details.

 

Thanks & Regards,

Noorjahan.

0 Kudos
SteveP1
Beginner
1,726 Views

To reproduce the memset issue the following code segfault on my system:

int main(){
  unsigned int dimx = 17002;
  unsigned int dimy = 17002;
  unsigned int size = dimx*dimy;

  sycl::float3* _d_p = (sycl::float3*) 
  sycl::malloc_device(size*sizeof(sycl::float3), dpct::get_default_queue());
  dpct::get_default_queue().memset(_d_p, 0, size*sizeof(sycl::float3)).wait();
  sycl::free(_d_p, dpct::get_default_queue());

  printf("done\n");
  return 0;
}

In this case if you replace float3 with mfloat3 it won't seg fault.

If you use dimx=dimy=25000 it will segfault with mfloat3 but not with float3

The error stack trace is the following:

Thread 141 "sycl_test" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffec33fc700 (LWP 5471)]
0x00007fffc57ba5b0 in __memset_sse2_unaligned_erms () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fffc57ba5b0 in __memset_sse2_unaligned_erms () from /lib64/libc.so.6
#1  0x00007fffb8d4beb0 in set_zero ()
#2  0x0000000000000001 in ?? ()
#3  0x00007fffbda9ae40 in Intel::OpenCL::DeviceBackend::Kernel::RunGroup(void const*, unsigned long const*, void*) const ()
   from /home/sci/spetruzza/intel/oneapi/compiler/2023.0.0/linux/lib/x64/libOclCpuBackEnd.so.2022.15.12.0
#4  0x00007fffbda9b48a in non-virtual thunk to Intel::OpenCL::DeviceBackend::Kernel::RunGroup(void const*, unsigned long const*, void*) const ()
   from /home/sci/spetruzza/intel/oneapi/compiler/2023.0.0/linux/lib/x64/libOclCpuBackEnd.so.2022.15.12.0
#5  0x00007fffc1f266d7 in Intel::OpenCL::CPUDevice::NDRange::ExecuteIteration(unsigned long, unsigned long, unsigned long, void*) ()
   from /home/sci/spetruzza/intel/oneapi/compiler/2023.0.0/linux/lib/x64/libcpu_device.so.2022.15.12.0
#6  0x00007fffc21b16b9 in tbb::detail::d1::start_for<Intel::OpenCL::TaskExecutor::BlockedRangeByDefaultTBB1d<Int
--Type <RET> for more, q to quit, c to continue without paging--
el::OpenCL::TaskExecutor::NoProportionalSplit>, TaskLoopBody1D<Intel::OpenCL::TaskExecutor::BlockedRangeByDefaultTBB1d<Intel::OpenCL::TaskExecutor::NoProportionalSplit> >, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) ()
   from /home/sci/spetruzza/intel/oneapi/compiler/2023.0.0/linux/lib/x64/libtask_executor.so.2022.15.12.0
#7  0x00007fffc6b5045b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (this=0x7fffc1c17b80, t=0x7fffb025ba00, waiter=...)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:322
#8  0x00007fffc6b39ee9 in tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (this=<optimized out>, t=<optimized out>, waiter=...)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:458
#9  tbb::detail::r1::arena::process (this=0x7ffc9b3b17b0, tls=...)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:137
#10 0x00007fffc6b39826 in tbb::detail::r1::market::process (this=0x7ffc9b3b17b0, j=...)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/market.cpp:599
#11 0x00007fffc6b3cb42 in tbb::detail::r1::rml::private_worker::run (this=0x7ffc9b3b17b0)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/libraries.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:271
#12 0x00007fffc6b3cad6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7ffc9b3b17b0)
    at /localdisk/ci/runner001/intel-innersource/001/_work/libraries.threading.infrastructure.onetbb-ci/librarie--Type <RET> for more, q to quit, c to continue without paging--
s.threading.infrastructure.onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#13 0x00007fffc5feda1a in start_thread () from /lib64/libpthread.so.0
#14 0x00007fffc5812d0f in clone () from /lib64/libc.so.6

I am running on a Linux OpenSuse 15.3 (Linux version 5.3.18-59.34-default (geeko@buildhost) (gcc version 7.5.0 (SUSE Linux))) with an Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz and >> 200GB or RAM.

I will try to reproduce also the cblas issue and post some code, but maybe that is secondary at this point. Please let me know.

0 Kudos
SteveP1
Beginner
1,720 Views

Regarding the cblas issue the following code reproduces the problem:

 

#include <iostream>
#include <sycl/sycl.hpp>
#include <dpct/dpct.hpp>
#include "oneapi/mkl.hpp"
#include "common.h"

int main(){

	unsigned int dimx = 20;
	unsigned int dimy = 20;
	unsigned int size = dimx*dimy;
		
	sycl::float3* _d_p = (sycl::float3*) sycl::malloc_device(size*sizeof(sycl::float3), dpct::get_default_queue());
	dpct::get_default_queue().memset(_d_p, 0, size*sizeof(sycl::float3)).wait();
	sycl::float3* _d_g = (sycl::float3*) sycl::malloc_device(size*sizeof(sycl::float3), dpct::get_default_queue());
	dpct::get_default_queue().memset(_d_g, 0, size*sizeof(sycl::float3)).wait();

	for(int i=0; i<size;i++){
		_d_p[i].x() = 1;
		_d_p[i].y() = 1;
		_d_p[i].z() = 1;
		_d_g[i].x() = 1;
		_d_g[i].y() = 1;
		_d_g[i].z() = 1;
	}	

	unsigned int N =size*3;

	float res = cblas_sdot(N, (float*)_d_p, 1., (float*)_d_g, 1.);
	printf("float3 result %f for %d elems\n", res, N);

	sycl::free(_d_p, dpct::get_default_queue());
	sycl::free(_d_g, dpct::get_default_queue());

	//mfloat3

	sycl::mfloat3* _d_p2 = (sycl::mfloat3*) sycl::malloc_device(size*sizeof(sycl::mfloat3), dpct::get_default_queue());
	dpct::get_default_queue().memset(_d_p2, 0, size*sizeof(sycl::mfloat3)).wait();
	sycl::mfloat3* _d_g2 = (sycl::mfloat3*) sycl::malloc_device(size*sizeof(sycl::mfloat3), dpct::get_default_queue());
	dpct::get_default_queue().memset(_d_p2, 0, size*sizeof(sycl::mfloat3)).wait();

	for(int i=0; i<size;i++){
		_d_p2[i][0] = 1;
		_d_p2[i][1] = 1;
		_d_p2[i][2] = 1;
		_d_g2[i][0] = 1;
		_d_g2[i][1] = 1;
		_d_g2[i][2] = 1;
	}	

	res = cblas_sdot(N, (float*)_d_p, 1., (float*)_d_g, 1.);
	printf("mfloat3 result %f for %d elems\n", res, N);

	sycl::free(_d_p2, dpct::get_default_queue());
	sycl::free(_d_g2, dpct::get_default_queue());

        return 0;
}

 

the output of this program is:

 

float3 result 900.000000 for 1200 elems
mfloat3 result 1200.000000 for 1200 elems

 

I believe the reason is because the sizeof(float3) > sizeof(mfloat3), but cblas MKL functions only support float* as input, so when using float3 (16 bytes instead of 12) results are incorrect.

If I use N=size*4 when using float3 the result of this test seems correct, but I am not sure if this is the solution I should adopt.

 

0 Kudos
NoorjahanSk_Intel
Moderator
1,628 Views

Hi,

 

Thanks for providing the details.

 

We did not observe any segmentation fault issue while trying memset code that you have provided.

Please find the below screenshot for more details:

NoorjahanSk_Intel_0-1678102504663.png

 

Regarding the cblas code, we are observing a segmentation fault at our end.

Please find the below screenshot for more details:

NoorjahanSk_Intel_1-1678102520657.png

>>res = cblas_sdot(N, (float*)_d_p, 1., (float*)_d_g, 1.);

Could you please let us know why you have passed _d_p and _d_g as inputs?

Could you please provide the proper reproducer so that we can try it from our end?

 

Thanks & Regards,

Noorjahan.

 

 

0 Kudos
SteveP1
Beginner
1,619 Views

Hi,

 

Regarding the memset seg fault, what should I attribute that to? Something wrong on the machine I am using (OS, or CPU not supported)? 

 

Regarding the cblas issue, there was a copy-paste mistake on my side, the second part should use  _d_p2 and _d_g2, as the other two variables are deallocated when the first experiment ends.

If you use:

res = cblas_sdot(N, (float*)_d_p2, 1., (float*)_d_g2, 1.);

You should be able to reproduce what I am talking about.

 

Steve

0 Kudos
SteveP1
Beginner
1,572 Views

The same memset issue comes with malloc_shared in place of malloc_device on this machine.

0 Kudos
NoorjahanSk_Intel
Moderator
1,527 Views

Hi,


We have tried it on Ubuntu 20 ,Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz and did not observe any isssues.


We have reported this issues to the concerned development team. They are looking into your issue.


Thanks & Regards,

Noorjahan.


0 Kudos
NoorjahanSk_Intel
Moderator
1,347 Views

Hi,


Thanks for your patience.


The memset issue raised by you is fixed in Intel(R) oneAPI DPC+/C++ Compiler 2023.1.0 version. We request you to try with this version and let us know if you have any issues.

Regarding your cblas issue, we are working on this and we will get back to you soon.


Thanks & Regards,

Noorjahan.


0 Kudos
SteveP1
Beginner
1,327 Views

Hi,

I can confirm that the issue with the memset seem to be resolved with the latest toolkit 2023.1.

Regarding the cblas calls, I have been in touch with some other developer at Intel and it seems like some issue was resolved regarding the automatic conversion of cublas calls to SYCL cblas using the dpct tool.

For example a simple sdot call would be converted to something like this one:

float *res_temp_ptr_ct4 = sycl::malloc_shared<float>(1, dpct::get_default_queue());

oneapi::mkl::blas::column_major::dot(*dpct::get_current_device().get_saved_queue(), N, (float *)_d_r, 1,(float *)_d_p, 1, res_temp_ptr_ct4).wait();

instead of the "cblas_sdot" that I was using. So I don't know yet if the problem I had would be solved by this new conversion.

FYI I am not using anymore float3 in my code, so I would have to roll back to that to try this again.

 

Anyway this automatic conversion only seem to work when pointing (--cuda-include-path=/path/to/cuda) the dpct to a CUDA-11.5 toolkit, but it does not work with CUDA 11.8 or 12.1.

0 Kudos
NoorjahanSk_Intel
Moderator
1,288 Views

Hi,


>>I can confirm that the issue with the memset seem to be resolved with the latest toolkit 2023.1.


Thanks for the confirmation.


>>some issue was resolved regarding the automatic conversion of cublas calls to SYCL cblas using the dpct tool


Could you please confirm whether your issue got resolved? If yes, can we go ahead and close this issue


Thanks & Regards,

Noorjahan.


0 Kudos
NoorjahanSk_Intel
Moderator
1,246 Views

Hi,


We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


 Thanks & Regards,

Noorjahan.


0 Kudos
Reply