Intel® oneAPI Data Parallel C++
Support for Intel® oneAPI DPC++ Compiler, Intel® oneAPI DPC++ Library, Intel ICX Compiler , Intel® DPC++ Compatibility Tool, and GDB*

DPC++ program took more time compare to c?

Srikanth1911
Beginner
974 Views

Hello sir im a beginner in data parallel C++ programming. i tried simple matrix multiplication program using DPC++ and calculating its time. it took around 0.2 sec in my linux system, and took around 0.075 sec in intel devcloud terminal. but same program in C it took around 0.0015 sec.

My question is why DPC++ took more time compare to c? 

0 Kudos
9 Replies
Srikanth1911
Beginner
967 Views

here im attaching my files 

0 Kudos
HemanthCH_Intel
Moderator
923 Views

Hi,

 

Thanks for posting in Intel Communities.

 

When you are using the malloc_device, we need to use the memcpy to copy the data from the host to the device(which is not performed in your code).

 

Could you please provide the following details to investigate more on your issue?

  1. C program which you are comparing with dpcpp?
  2. Steps to compile the C program?
  3. DPCPP version you are using?
  4. Share with us the log file by using the below command:
SYCL_PI_TRACE=1 ./a.out

 

Thanks & Regards,

Hemanth

 

0 Kudos
Srikanth1911
Beginner
901 Views

Hello sir.,

Here im attaching the required files you asked and made the correction you told... and it improve little bit in time it take 0.14sec

 

Thank you 

Srikanth K

 

0 Kudos
HemanthCH_Intel
Moderator
863 Views

Hi,


In the above zip file, we can't find the updated dpcpp code. Could you please provide the updated dpcpp code?


Thanks & Regards,

Hemanth


0 Kudos
HemanthCH_Intel
Moderator
842 Views

Hi,


We haven't heard back from you. Could you please provide the above-requested information?


Thanks & Regards,

Hemanth.


0 Kudos
Srikanth1911
Beginner
827 Views

Hello sir sorry for the late reply here im attaching the required file, in this program im using malloc share

0 Kudos
HemanthCH_Intel
Moderator
779 Views

Hi,


We are working on this internally and will get back to you soon.


Thanks & Regards,

Hemanth


0 Kudos
yzh_intel
Employee
725 Views

Hi,


I see that your workload is very small, for such small workload, it's generally not recommended to offload to GPU via SYCL.

As you can imagine, offloading will cause some amount of overhead, this include data transfer between host and device, and also some others.

For small workload, the overhead may be too big to counteract the benefit of offloading.

I encourage you increase your matrix size to see the impact.


For your particular case, can you try the following option? This option can remove the jitting overhead.

-fsycl-targets=spir64_gen -Xs "-device ??? ". Here, "???" is your device type. It could be "gen11", "gen12LP" etc.

More information can be found in the section of "Use AOT for Intel Graphics (Intel GPU)".

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/compilation/ahead-of-time-compilation.html


By using this command, I'm able to achieve similar timing between the SYCL code and native C implementation,

although more detailed profiling data shows that the kernel time only is much faster.

On a 11th Gen Intel(R) Core(TM) i7-1185G7, I get the following results.

Native C implementation:

With the option above, the SYCL code outputs:


And more detailed profiling data:


The following command and link may help you figure out which Intel Graphics processor you're using.

command: sycl-ls

link: https://www.intel.com/content/www/us/en/developer/articles/guide/intel-graphics-developers-guides.htmlhttps://www.intel.com/content/www/us/en/developer/articles/guide/intel-graphics-developers-guides.html


Thanks,


0 Kudos
yzh_intel
Employee
686 Views

Hi @Srikanth1911, have you got your questions answered ?


0 Kudos
Reply