I tried the sycl GPU sort code as from url:
build under oneapi 2020.2 release, but the result shows oneapi dpc++ compiler 's performance have a huge gap compare to opencl 1.2.
testing hardware was : i7 11700K, with 512x512 array.
opencl 1.2 take 4-5 ms sort this array. but oneapi sycl take 6-7 ms. that's almost 40% overhead...
I highly doubt that oneapi dpc++ compiler have some performance issue , b/c different software stack for GPU should NOT have such big perf gap.
the source code was just in above link and it's an intel official samples.
Anybody can help explain why and how to make sycl hav equal perf as opencl1.2?
Thanks for reaching out to us.
>>build under oneapi 2020.2 release
Could you please try the latest version of oneapi (2021.3.0) DPCPP compiler and check if the issue stills persist?
Below is the link to download the latest version of oneAPI Basetool kit (you can get DPCPP compiler from the base toolkit):
Please state if this is the 1st/only sort time or if this is 2nd (and later) sort time(s). Note, the 1st time contains the JIT, resource allocation (and GPU memory allocation).
Finally I have some time to test the issue with latest oneapi 2021.3 toolkit.
The performance result is the same. dpcpp still very slow compare to opencl1.2.
After dig deeper into the issue, I feel it's the memory copy issue:
1.) in opencl1.2 intel i915 driver , it's will implements the zero memcpy between cpu and igpu. so perf no penalty.
2.).in dpcpp stack, the sycl syntax of buffer won't trigger the zero memcpy buffer some how, and even the opencl and sycl syntax (functionality ) almost equivalent, but dpcpp with sycl stack just suffer the pain from memory move between cpu and gpu. I don't know the real reason without the deep knowledge yet.
3.) if switch from sycl buffer into dpcpp 's USM, performance was much better , but still can't match opencl1.2 i915 stack yet.
Please help this issue, b/c it's so critial for oneapi stack, if perf have huge gap between opencl1.2 and oneapi, developer lost motivation to migrate to this new api stack and SORTING is so important for almost everything.
BTW, why I seeking a solution here for a onepai based gpu sorting , simple b/c it's not available in oneapi. There are no cuda based Thrust like framework for oneapi yet, CUB migration still a dream. and TBB won't support soring on GPU.
Really appreciate If anybody can show me some light on how to sorting with oneapi on GPU. ( maybe there was a decent solution already somewhere.).
Could you please provide us a sample reproducer for both opencl & sycl (USM & buffer models) versions and steps to reproduce the issue that you have followed to obtain the results so that we can work on it from our end?
Also please provide the following details
2. Hardware details
3. Are you using OpenCL runtime or level zero as backend ?
You can also use sorting algorithms from oneDPL. Please refer to the below link for more details.