- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi. In my experiments, oneAPI SYCL kernels always work slower at cpu device if comparing to ISPC.
For example, this code:
const size_t resolution = 2048;
size_t vector_size = resolution * resolution;
void oneapi_slopemap(queue& q, buffer<float, 1>& b_data, buffer<float, 1>& b_result)
{
float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
range<2> num_items2D { resolution, resolution };
q.submit([&](handler& h) {
auto data = b_data.get_access<access::mode::read>(h);
auto result = b_result.get_access<access::mode::write>(h);
h.parallel_for(num_items2D, [=](auto i)
{
int x = i[1];
int y = i[0];
int x0 = sycl::max<int>(x - 1, 0);
int x1 = sycl::min<int>(x + 1, resolution - 1);
int y0 = sycl::max<int>(y - 1, 0);
int y1 = sycl::min<int>(y + 1, resolution - 1);
float h7 = data[y0 * resolution + x0];
float h8 = data[y0 * resolution + x];
float h9 = data[y0 * resolution + x1];
float h4 = data[y * resolution + x0];
float h6 = data[y * resolution + x1];
float h1 = data[y1 * resolution + x0];
float h2 = data[y1 * resolution + x];
float h3 = data[y1 * resolution + x1];
float gradient_x = h9 + 3.0f * (h6 - h4) + h3 - h7 - h1;
float gradient_y = h7 + 3.0f * (h8 - h2) + h9 - h1 - h3;
float slope = slope_coeff * sycl::sqrt(gradient_x * gradient_x + gradient_y * gradient_y);
slope = sycl::atan(slope) * 57.2957795130823f;
result[y * resolution + x] = slope;
});
});
q.wait();
}
works 1.9 times slower than this ispc code:
export void slopemap(uniform float data[], uniform float result[], uniform int resolution, uniform int y)
{
uniform float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
uniform int y0 = max(y - 1, 0);
uniform int y1 = min(y + 1, resolution - 1);
foreach (x = 0 ... resolution)
{
int x0 = max(x - 1, 0);
int x1 = min(x + 1, resolution - 1);
float h7 = data[y0 * resolution + x0];
float h8 = data[y0 * resolution + x];
float h9 = data[y0 * resolution + x1];
float h4 = data[y * resolution + x0];
float h6 = data[y * resolution + x1];
float h1 = data[y1 * resolution + x0];
float h2 = data[y1 * resolution + x];
float h3 = data[y1 * resolution + x1];
float gradient_x = h9 + 3.0f * (h6 - h4) + h3 - h7 - h1;
float gradient_y = h7 + 3.0f * (h8 - h2) + h9 - h1 - h3;
float slope = slope_coeff * sqrt(gradient_x * gradient_x + gradient_y * gradient_y);
slope = atan(slope) * 57.2957795130823f;
result[y * resolution + x] = slope;
}
}
which is called this way:
float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
concurrency::parallel_for(0, resolution, [&](auto y)
{
ispc::slopemap(data.data(), result.data(), resolution, y);
});
I use Windows version of oneAPI, integrated with Visual Studio 2022. This is the compilation command used:
/O2 /fsycl-targets=spir64_x86_64 /fsycl-early-optimizations /Zi /D "NDEBUG" /D "_WINDLL" /D "_UNICODE" /D "UNICODE" /WX- /MD /std:c++20 /EHsc /W3 /nologo /Fo"x64\Release\"
And this is the ISPC compilation command:
ispc -O2 "%(Filename).ispc" -o "$(IntDir)%(Filename).obj" -h "$(ProjectDir)%(Filename).h" --target=sse4,avx2,avx512skx-x8 --opt=fast-math
Can the performance of CPU backend be improved and be closer to ISPC-performance? Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you test the sycl code without /Zi flag? And then tell us the performance of the sycl code with padding without this flag?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can you provide us the complete code? Including the main function and how to measure the performance.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure!
I've put all the SYCL code and benchmark code to a single file. Also I've included VS project and solution in case if this helps.
ISPC compile command:
ispc -O2 ispc_slopemap.ispc" -o ispc_slopemap.obj -h ispc_slopemap.h" --target=avx2 --opt=fast-math
(Usually I compile ISPC using Custom Build Tool in Visual studio for multiple targets such as sse42, avx2 and avx512, but, when I switched to DPC++ compiler, I got link errors for some reason (there is no such errors with Intel C++ compiler) until I left only single target and added .obj file manually to the project. If your CPU supports avx512 instruction set, you can change the compile command accordingly)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for providing the details. We will test it on our side.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just a thought on this topic. You compile your Sycl code for 64bit. I assume the kernel will use 64bit arithmetic for pointers in this case. ISPC uses by default 32bit pointer arithmetic. As you have a lot of reads this might explain the difference in performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much Spooner,
To check your assumption I've compiled ispc-file with `--addressing=64` parameter. And the numbers changed indeed.
So, these are benchmark results:
Method Min Med Avg
ISPC 32-bit addr 1.109 1.282 1.297
ISPC 64-bit addr 1.622 1.826 1.840
oneAPI@OpenCL 2.334 2.678 2.690
Still, even in this case OpenCL still is noticeably slower than ISPC.
However, is there a way to make OpenCL to use 32-bit addressing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can you provide a separate ISPC code and compilation command? Based on the code and compilation command you provided, I can't compile it correctly.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure!
This is the command line to compile ispc_normalmap.ispc:
ispc ispc_normalmap.ispc -o ispc_normalmap.obj -h ispc_normalmap.h -O2 --opt=fast-math
(it is assumed that the directory with ispc.exe is in your PATH variable, if you use Windows)
it generates header file and .obj file, which you need to add to your project in Visual Studio manually (since DPC++ doesn't support Custom Build Tool for some reason, unlike Intel C++ Compiler), for example, by drag-n-drop this file to your project name in Solution Explorer.
ispc_slopemap(ispc_data.data(), sycl_result.data());
ispc_slopemap(ispc_data.data(), ispc_result.data());
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the details. And we can build the ISPC code now. We are investigating the issue. Will let you know if there is any update.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'd like to let you know that we will improve the performance of sycl kernel in future compilers.
And now we have an workaround for you. The workaround is to modify the test code to avoid using "max/min" on x indices, just by padding "data" buffer with 2 extra elements in dim x. Attach the code sycl_performance_padding.cpp.
I tested the code on my side, the performance is improved, and it's better than ISPC.
Method Min Med Avg
ISPC 1.539 1.728 2.560
oneAPI@OpenCL 2.409 8.910 8.501
syclwithpadding 1.180 1.531 1.797
Can you test the code on your side?
Thanks,
Chen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
thank you for the update and for all your efforts. This is exciting news that performance of SYCL kernels will be improved!
Yes, padding helps, of course. It makes vectorization of code much easier, resulting in a plain stencil pattern which can be vectorized without using of gather instructions.
However, padding of data isn't always possible. Also, there is a lot of other algorithms, where gather and scatter can't be eliminated (and where ISPC still is ~70% faster than OpenCL CPU backend of oneAPI).
Also, wanted to ask, is your code correct? Shouldn't `resolution` be `resolution + 2` in the next lines within the kernel? Our data rows now are 2 elements wider than `resolution`.
```
float h7 = data[y0 * resolution + x0];
float h8 = data[y0 * resolution + x];
float h9 = data[y0 * resolution + x1];
float h4 = data[y * resolution + x0];
float h6 = data[y * resolution + x1];
float h1 = data[y1 * resolution + x0];
float h2 = data[y1 * resolution + x];
float h3 = data[y1 * resolution + x1];
```
Nonetheless, here are results from my computer:
```
Method Min Med Avg
ISPC 1.273 1.528 1.567
oneAPI@OpenCL 2.331 2.707 2.737
ISPC [with padding] 0.911 1.167 1.232
oneAPI@OpenCL [with padding] 2.072 2.518 2.586
```
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you are right, `resolution` be `resolution + 2` in the kernel.
But from your results, there is no significant performance improvement. The sycl performance with padding is still worse than the perfromance ISPC, right?
Can you please tell us the HW details, the version of compiler and the compilation command?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, ISPC still is noticeably faster.
After reboot, execution times of both ISPC code and oneAPI kernel code have improved somewhat and are `0.785 ms` and `1.764 ms ` (minimum time) respectively (versions with padding).
I use Windows 11, Visual Studio 17.8.3, oneAPI 2024.1 and Intel(R) oneAPI DPC++ Compiler 2024 shipped with it.
I compile SYCL code using VS and VC project, compile options are `/O2 /Zi /D "NDEBUG" /D "_UNICODE" /D "UNICODE" /WX- /MD /std:c++20 /EHsc /W3 /nologo /Fo"x64\Release\"`
Hardware: Intel Core i7-11800H @ 2.30GHz, 32 GB RAM.
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you provide us with the ISPC code with padding?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, here it is:
export void normalmap(uniform float data[], uniform float result[], uniform int resolution, uniform size_t y)
{
uniform float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
uniform size_t y0 = y > 0 ? y - 1 : 0;
uniform size_t y1 = y < resolution - 1 ? y + 1 : resolution - 1;
foreach (x = 1 ... resolution - 1)
{
size_t x0 = x - 1;
size_t x1 = x + 1;
float h7 = data[y0 * resolution + x0];
float h8 = data[y0 * resolution + x];
float h9 = data[y0 * resolution + x1];
float h4 = data[y * resolution + x0];
float h6 = data[y * resolution + x1];
float h1 = data[y1 * resolution + x0];
float h2 = data[y1 * resolution + x];
float h3 = data[y1 * resolution + x1];
float gradient_x = h9 + 3.0f * (h6 - h4) + h3 - h7 - h1;
float gradient_y = h7 + 3.0f * (h8 - h2) + h9 - h1 - h3;
float slope = slope_coeff * sqrt(gradient_x * gradient_x + gradient_y * gradient_y);
slope = atan(slope) * 57.2957795130823f;
result[y * resolution + x - 1] = slope;
}
}
Calling code is the same as before, but `resolution` parameter should be new, padded width, so, `resolution + 2` :
void ispc_slopemap(float* data, float* result)
{
concurrency::parallel_for(0, static_cast<int>(resolution), [&](int y)
{
ispc::normalmap(data, result, resolution + 2, y);
});
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you test the sycl code without /Zi flag? And then tell us the performance of the sycl code with padding without this flag?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for a delay in my reply.
I've tried to compile it without /Zi flag. Wow! Execution became MUCH faster! Didn't expect such effect.
With padding it is now even faster than ISPC:
```
ISPC [with padding] 0.911 1.167 1.232
oneAPI@OpenCL [with padding] 0.909 1.034 1.068
```
(If I recompile ISPC with using AVX512, which is supported by my CPU too, it still is a bit faster than OpenCL: 0.891 0.996 1.059, but this isn't a significant difference)
In example without padding, it still is a bit slower than ISPC, but this is fine too:
```
ISPC 1.305 1.669 1.731
oneAPI@OpenCL 1.495 1.799 1.863
```
Thank you very much for your assistance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page