DPC++ program took more time compare to c?

Srikanth1911 · ‎07-04-2022

Hello sir im a beginner in data parallel C++ programming. i tried simple matrix multiplication program using DPC++ and calculating its time. it took around 0.2 sec in my linux system, and took around 0.075 sec in intel devcloud terminal. but same program in C it took around 0.0015 sec.

My question is why DPC++ took more time compare to c?

Srikanth1911 · ‎07-05-2022

here im attaching my files

HemanthCH_Intel · ‎07-07-2022

Hi,

Thanks for posting in Intel Communities.

When you are using the malloc_device, we need to use the memcpy to copy the data from the host to the device(which is not performed in your code).

Could you please provide the following details to investigate more on your issue?

C program which you are comparing with dpcpp?
Steps to compile the C program?
DPCPP version you are using?
Share with us the log file by using the below command:

SYCL_PI_TRACE=1 ./a.out

Thanks & Regards,

Hemanth

Srikanth1911 · ‎07-10-2022

Hello sir.,

Here im attaching the required files you asked and made the correction you told... and it improve little bit in time it take 0.14sec

Thank you

Srikanth K

HemanthCH_Intel · ‎07-13-2022

Hi,

In the above zip file, we can't find the updated dpcpp code. Could you please provide the updated dpcpp code?

Thanks & Regards,

Hemanth

HemanthCH_Intel · ‎07-20-2022

Hi,

We haven't heard back from you. Could you please provide the above-requested information?

Thanks & Regards,

Hemanth.

Srikanth1911 · ‎07-24-2022

Hello sir sorry for the late reply here im attaching the required file, in this program im using malloc share

HemanthCH_Intel · ‎07-28-2022

Hi,

We are working on this internally and will get back to you soon.

Thanks & Regards,

Hemanth

yzh_intel · ‎08-05-2022

Hi,

I see that your workload is very small, for such small workload, it's generally not recommended to offload to GPU via SYCL.

As you can imagine, offloading will cause some amount of overhead, this include data transfer between host and device, and also some others.

For small workload, the overhead may be too big to counteract the benefit of offloading.

I encourage you increase your matrix size to see the impact.

For your particular case, can you try the following option? This option can remove the jitting overhead.

-fsycl-targets=spir64_gen -Xs "-device ??? ". Here, "???" is your device type. It could be "gen11", "gen12LP" etc.

More information can be found in the section of "Use AOT for Intel Graphics (Intel GPU)".

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/compilation/ahead-of-time-compilation.html

By using this command, I'm able to achieve similar timing between the SYCL code and native C implementation,

although more detailed profiling data shows that the kernel time only is much faster.

On a 11th Gen Intel(R) Core(TM) i7-1185G7, I get the following results.

Native C implementation:

With the option above, the SYCL code outputs:

And more detailed profiling data:

The following command and link may help you figure out which Intel Graphics processor you're using.

command: sycl-ls

link: https://www.intel.com/content/www/us/en/developer/articles/guide/intel-graphics-developers-guides.htmlhttps://www.intel.com/content/www/us/en/developer/articles/guide/intel-graphics-developers-guides.html

Thanks,

yzh_intel · ‎08-17-2022

Hi @Srikanth1911, have you got your questions answered ?