Solved: Poor CPU OpenCL backend performance of SYCL kernels comparing to ISPC

blinkfrog · ‎11-05-2023

Hi. In my experiments, oneAPI SYCL kernels always work slower at cpu device if comparing to ISPC.

For example, this code:

const size_t resolution = 2048;
size_t vector_size = resolution * resolution;

void oneapi_slopemap(queue& q, buffer<float, 1>& b_data, buffer<float, 1>& b_result)
{
	float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
	range<2> num_items2D { resolution, resolution };
	q.submit([&](handler& h) {
		auto data = b_data.get_access<access::mode::read>(h);
		auto result = b_result.get_access<access::mode::write>(h);
		h.parallel_for(num_items2D, [=](auto i)
		{
			int x = i[1];
			int y = i[0];
			int x0 = sycl::max<int>(x - 1, 0);
			int x1 = sycl::min<int>(x + 1, resolution - 1);
			int y0 = sycl::max<int>(y - 1, 0);
			int y1 = sycl::min<int>(y + 1, resolution - 1);
			float h7 = data[y0 * resolution + x0];
			float h8 = data[y0 * resolution + x];
			float h9 = data[y0 * resolution + x1];
			float h4 = data[y * resolution + x0];
			float h6 = data[y * resolution + x1];
			float h1 = data[y1 * resolution + x0];
			float h2 = data[y1 * resolution + x];
			float h3 = data[y1 * resolution + x1];
			float gradient_x = h9 + 3.0f * (h6 - h4) + h3 - h7 - h1;
			float gradient_y = h7 + 3.0f * (h8 - h2) + h9 - h1 - h3;
			float slope = slope_coeff * sycl::sqrt(gradient_x * gradient_x + gradient_y * gradient_y);
			slope = sycl::atan(slope) * 57.2957795130823f;
			result[y * resolution + x] = slope;
		});
	});
	q.wait();
}

works 1.9 times slower than this ispc code:

export void slopemap(uniform float data[], uniform float result[], uniform int resolution, uniform int y)
{
    uniform float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
    uniform int y0 = max(y - 1, 0);
    uniform int y1 = min(y + 1, resolution - 1);
    foreach (x = 0 ... resolution)
    {
        int x0 = max(x - 1, 0);
        int x1 = min(x + 1, resolution - 1);
        float h7 = data[y0 * resolution + x0];
        float h8 = data[y0 * resolution + x];
        float h9 = data[y0 * resolution + x1];
        float h4 = data[y * resolution + x0];
        float h6 = data[y * resolution + x1];
        float h1 = data[y1 * resolution + x0];
        float h2 = data[y1 * resolution + x];
        float h3 = data[y1 * resolution + x1];
        float gradient_x = h9 + 3.0f * (h6 - h4) + h3 - h7 - h1;
        float gradient_y = h7 + 3.0f * (h8 - h2) + h9 - h1 - h3;
        float slope = slope_coeff * sqrt(gradient_x * gradient_x + gradient_y * gradient_y);
        slope = atan(slope) * 57.2957795130823f;
        result[y * resolution + x] = slope;
    }
}

which is called this way:

float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
concurrency::parallel_for(0, resolution, [&](auto y)
{
    ispc::slopemap(data.data(), result.data(), resolution, y);
});

I use Windows version of oneAPI, integrated with Visual Studio 2022. This is the compilation command used:

/O2 /fsycl-targets=spir64_x86_64 /fsycl-early-optimizations /Zi /D "NDEBUG" /D "_WINDLL" /D "_UNICODE" /D "UNICODE" /WX- /MD /std:c++20 /EHsc /W3 /nologo /Fo"x64\Release\"

And this is the ISPC compilation command:

ispc -O2 "%(Filename).ispc" -o "$(IntDir)%(Filename).obj" -h "$(ProjectDir)%(Filename).h" --target=sse4,avx2,avx512skx-x8 --opt=fast-math

Can the performance of CPU backend be improved and be closer to ISPC-performance? Thank you.

cw_intel · ‎06-12-2024

Hi,

Could you test the sycl code without /Zi flag? And then tell us the performance of the sycl code with padding without this flag?

View solution in original post

cw_intel · ‎11-06-2023

Hi,

Can you provide us the complete code? Including the main function and how to measure the performance.

Thanks

blinkfrog · ‎11-07-2023

Sure!

I've put all the SYCL code and benchmark code to a single file. Also I've included VS project and solution in case if this helps.

ISPC compile command:

ispc -O2 ispc_slopemap.ispc" -o ispc_slopemap.obj -h ispc_slopemap.h" --target=avx2 --opt=fast-math

(Usually I compile ISPC using Custom Build Tool in Visual studio for multiple targets such as sse42, avx2 and avx512, but, when I switched to DPC++ compiler, I got link errors for some reason (there is no such errors with Intel C++ compiler) until I left only single target and added .obj file manually to the project. If your CPU supports avx512 instruction set, you can change the compile command accordingly)

cw_intel · ‎11-07-2023

Thanks for providing the details. We will test it on our side.

Thanks.

Spooner · ‎11-08-2023

Hey,

Just a thought on this topic. You compile your Sycl code for 64bit. I assume the kernel will use 64bit arithmetic for pointers in this case. ISPC uses by default 32bit pointer arithmetic. As you have a lot of reads this might explain the difference in performance.

blinkfrog · ‎11-08-2023

Thank you very much Spooner,

To check your assumption I've compiled ispc-file with `--addressing=64` parameter. And the numbers changed indeed.

So, these are benchmark results:

    Method        Min     Med     Avg
ISPC 32-bit addr 1.109   1.282   1.297
ISPC 64-bit addr 1.622   1.826   1.840
oneAPI@OpenCL    2.334   2.678   2.690

Still, even in this case OpenCL still is noticeably slower than ISPC.

However, is there a way to make OpenCL to use 32-bit addressing?

cw_intel · ‎11-12-2023

Hi,

Can you provide a separate ISPC code and compilation command? Based on the code and compilation command you provided, I can't compile it correctly.

Thanks.

blinkfrog · ‎11-13-2023

Sure!

This is the command line to compile ispc_normalmap.ispc:

ispc ispc_normalmap.ispc -o ispc_normalmap.obj -h ispc_normalmap.h -O2 --opt=fast-math

(it is assumed that the directory with ispc.exe is in your PATH variable, if you use Windows)

it generates header file and .obj file, which you need to add to your project in Visual Studio manually (since DPC++ doesn't support Custom Build Tool for some reason, unlike Intel C++ Compiler), for example, by drag-n-drop this file to your project name in Solution Explorer.

(There is also error in original cpp file that I provided, line #102 is this:

ispc_slopemap(ispc_data.data(), sycl_result.data());

but should be this:

ispc_slopemap(ispc_data.data(), ispc_result.data());

Same for line #94.

But this error doesn't affect the performance, only affects availability of results in case if you may want to compare them during debugging.

And also some error in the algorithm itself which is irrelevant to performance and test itself. Also, the names of files and functions can be confusing (files are `normalmap`, while functions are `slopemap`), this is because in the very test I measured performance of calculating normal maps but then decided to switch to slope maps)

cw_intel · ‎11-14-2023

Thank you for the details. And we can build the ISPC code now. We are investigating the issue. Will let you know if there is any update.

Thanks.

cw_intel · ‎05-28-2024

Hi,

I'd like to let you know that we will improve the performance of sycl kernel in future compilers.

And now we have an workaround for you. The workaround is to modify the test code to avoid using "max/min" on x indices, just by padding "data" buffer with 2 extra elements in dim x. Attach the code sycl_performance_padding.cpp.

I tested the code on my side, the performance is improved, and it's better than ISPC.

Method Min Med Avg

ISPC 1.539 1.728 2.560

oneAPI@OpenCL 2.409 8.910 8.501

syclwithpadding 1.180 1.531 1.797

Can you test the code on your side?

Thanks,

Chen

blinkfrog · ‎05-29-2024

Hello,

thank you for the update and for all your efforts. This is exciting news that performance of SYCL kernels will be improved!

Yes, padding helps, of course. It makes vectorization of code much easier, resulting in a plain stencil pattern which can be vectorized without using of gather instructions.

However, padding of data isn't always possible. Also, there is a lot of other algorithms, where gather and scatter can't be eliminated (and where ISPC still is ~70% faster than OpenCL CPU backend of oneAPI).

Also, wanted to ask, is your code correct? Shouldn't `resolution` be `resolution + 2` in the next lines within the kernel? Our data rows now are 2 elements wider than `resolution`.
```
float h7 = data[y0 * resolution + x0];
float h8 = data[y0 * resolution + x];
float h9 = data[y0 * resolution + x1];
float h4 = data[y * resolution + x0];
float h6 = data[y * resolution + x1];
float h1 = data[y1 * resolution + x0];
float h2 = data[y1 * resolution + x];
float h3 = data[y1 * resolution + x1];
```

Nonetheless, here are results from my computer:

```

Method Min Med Avg

ISPC 1.273 1.528 1.567
oneAPI@OpenCL 2.331 2.707 2.737
ISPC [with padding] 0.911 1.167 1.232
oneAPI@OpenCL [with padding] 2.072 2.518 2.586

```

cw_intel · ‎05-29-2024

Yes, you are right, `resolution` be `resolution + 2` in the kernel.

But from your results, there is no significant performance improvement. The sycl performance with padding is still worse than the perfromance ISPC, right?

Can you please tell us the HW details, the version of compiler and the compilation command?

blinkfrog · ‎05-29-2024

Yes, ISPC still is noticeably faster.

After reboot, execution times of both ISPC code and oneAPI kernel code have improved somewhat and are `0.785 ms` and `1.764 ms ` (minimum time) respectively (versions with padding).

I use Windows 11, Visual Studio 17.8.3, oneAPI 2024.1 and Intel(R) oneAPI DPC++ Compiler 2024 shipped with it.

I compile SYCL code using VS and VC project, compile options are `/O2 /Zi /D "NDEBUG" /D "_UNICODE" /D "UNICODE" /WX- /MD /std:c++20 /EHsc /W3 /nologo /Fo"x64\Release\"`

Hardware: Intel Core i7-11800H @ 2.30GHz, 32 GB RAM.

Thank you.

cw_intel · ‎05-29-2024

Can you provide us with the ISPC code with padding?

blinkfrog · ‎05-29-2024

Sure, here it is:

export void normalmap(uniform float data[], uniform float result[], uniform int resolution, uniform size_t y)
{
    uniform float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
    uniform size_t y0 = y > 0 ? y - 1 : 0;
    uniform size_t y1 = y < resolution - 1 ? y + 1 : resolution - 1;	
    foreach (x = 1 ... resolution - 1)
    {
        size_t x0 = x - 1;
        size_t x1 = x + 1;
        float h7 = data[y0 * resolution + x0];
        float h8 = data[y0 * resolution + x];
        float h9 = data[y0 * resolution + x1];
        float h4 = data[y * resolution + x0];
        float h6 = data[y * resolution + x1];
        float h1 = data[y1 * resolution + x0];
        float h2 = data[y1 * resolution + x];
        float h3 = data[y1 * resolution + x1];
        float gradient_x = h9 + 3.0f * (h6 - h4) + h3 - h7 - h1;
        float gradient_y = h7 + 3.0f * (h8 - h2) + h9 - h1 - h3;
        float slope = slope_coeff * sqrt(gradient_x * gradient_x + gradient_y * gradient_y);
        slope = atan(slope) * 57.2957795130823f;
        result[y * resolution + x - 1] = slope;
    }
}

Calling code is the same as before, but `resolution` parameter should be new, padded width, so, `resolution + 2` :

void ispc_slopemap(float* data, float* result)
{
	concurrency::parallel_for(0, static_cast<int>(resolution), [&](int y)
	{
		ispc::normalmap(data, result, resolution + 2, y);
	});
}

cw_intel · ‎06-12-2024

Hi,

Could you test the sycl code without /Zi flag? And then tell us the performance of the sycl code with padding without this flag?

blinkfrog · ‎06-17-2024

Sorry for a delay in my reply.

I've tried to compile it without /Zi flag. Wow! Execution became MUCH faster! Didn't expect such effect.

With padding it is now even faster than ISPC:

```

ISPC [with padding] 0.911 1.167 1.232

oneAPI@OpenCL [with padding] 0.909 1.034 1.068

```

(If I recompile ISPC with using AVX512, which is supported by my CPU too, it still is a bit faster than OpenCL: 0.891 0.996 1.059, but this isn't a significant difference)

In example without padding, it still is a bit slower than ISPC, but this is fine too:

```

ISPC 1.305 1.669 1.731
oneAPI@OpenCL 1.495 1.799 1.863

```

Thank you very much for your assistance!

cw_intel · ‎06-19-2024

Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.