- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello sir im a beginner in data parallel C++ programming. i tried simple matrix multiplication program using DPC++ and calculating its time. it took around 0.2 sec in my linux system, and took around 0.075 sec in intel devcloud terminal. but same program in C it took around 0.0015 sec.
My question is why DPC++ took more time compare to c?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in Intel Communities.
When you are using the malloc_device, we need to use the memcpy to copy the data from the host to the device(which is not performed in your code).
Could you please provide the following details to investigate more on your issue?
- C program which you are comparing with dpcpp?
- Steps to compile the C program?
- DPCPP version you are using?
- Share with us the log file by using the below command:
SYCL_PI_TRACE=1 ./a.out
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello sir.,
Here im attaching the required files you asked and made the correction you told... and it improve little bit in time it take 0.14sec
Thank you
Srikanth K
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In the above zip file, we can't find the updated dpcpp code. Could you please provide the updated dpcpp code?
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you. Could you please provide the above-requested information?
Thanks & Regards,
Hemanth.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello sir sorry for the late reply here im attaching the required file, in this program im using malloc share
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on this internally and will get back to you soon.
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I see that your workload is very small, for such small workload, it's generally not recommended to offload to GPU via SYCL.
As you can imagine, offloading will cause some amount of overhead, this include data transfer between host and device, and also some others.
For small workload, the overhead may be too big to counteract the benefit of offloading.
I encourage you increase your matrix size to see the impact.
For your particular case, can you try the following option? This option can remove the jitting overhead.
-fsycl-targets=spir64_gen -Xs "-device ??? ". Here, "???" is your device type. It could be "gen11", "gen12LP" etc.
More information can be found in the section of "Use AOT for Intel Graphics (Intel GPU)".
By using this command, I'm able to achieve similar timing between the SYCL code and native C implementation,
although more detailed profiling data shows that the kernel time only is much faster.
On a 11th Gen Intel(R) Core(TM) i7-1185G7, I get the following results.
Native C implementation:
With the option above, the SYCL code outputs:
And more detailed profiling data:
The following command and link may help you figure out which Intel Graphics processor you're using.
command: sycl-ls
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page