Hello sir im a beginner in data parallel C++ programming. i tried simple matrix multiplication program using DPC++ and calculating its time. it took around 0.2 sec in my linux system, and took around 0.075 sec in intel devcloud terminal. but same program in C it took around 0.0015 sec.
My question is why DPC++ took more time compare to c?
Thanks for posting in Intel Communities.
When you are using the malloc_device, we need to use the memcpy to copy the data from the host to the device(which is not performed in your code).
Could you please provide the following details to investigate more on your issue?
- C program which you are comparing with dpcpp?
- Steps to compile the C program?
- DPCPP version you are using?
- Share with us the log file by using the below command:
Thanks & Regards,
I see that your workload is very small, for such small workload, it's generally not recommended to offload to GPU via SYCL.
As you can imagine, offloading will cause some amount of overhead, this include data transfer between host and device, and also some others.
For small workload, the overhead may be too big to counteract the benefit of offloading.
I encourage you increase your matrix size to see the impact.
For your particular case, can you try the following option? This option can remove the jitting overhead.
-fsycl-targets=spir64_gen -Xs "-device ??? ". Here, "???" is your device type. It could be "gen11", "gen12LP" etc.
More information can be found in the section of "Use AOT for Intel Graphics (Intel GPU)".
By using this command, I'm able to achieve similar timing between the SYCL code and native C implementation,
although more detailed profiling data shows that the kernel time only is much faster.
On a 11th Gen Intel(R) Core(TM) i7-1185G7, I get the following results.
Native C implementation:
With the option above, the SYCL code outputs:
And more detailed profiling data:
The following command and link may help you figure out which Intel Graphics processor you're using.