OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1714 Discussions

Poor CPU OpenCL backend performance of SYCL kernels comparing to ISPC

blinkfrog
Novice
1,034 Views

Hi. In my experiments, oneAPI SYCL kernels always work slower at cpu device if comparing to ISPC.

 

For example, this code:

 

 

 

const size_t resolution = 2048;
size_t vector_size = resolution * resolution;

void oneapi_slopemap(queue& q, buffer<float, 1>& b_data, buffer<float, 1>& b_result)
{
	float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
	range<2> num_items2D { resolution, resolution };
	q.submit([&](handler& h) {
		auto data = b_data.get_access<access::mode::read>(h);
		auto result = b_result.get_access<access::mode::write>(h);
		h.parallel_for(num_items2D, [=](auto i)
		{
			int x = i[1];
			int y = i[0];
			int x0 = sycl::max<int>(x - 1, 0);
			int x1 = sycl::min<int>(x + 1, resolution - 1);
			int y0 = sycl::max<int>(y - 1, 0);
			int y1 = sycl::min<int>(y + 1, resolution - 1);
			float h7 = data[y0 * resolution + x0];
			float h8 = data[y0 * resolution + x];
			float h9 = data[y0 * resolution + x1];
			float h4 = data[y * resolution + x0];
			float h6 = data[y * resolution + x1];
			float h1 = data[y1 * resolution + x0];
			float h2 = data[y1 * resolution + x];
			float h3 = data[y1 * resolution + x1];
			float gradient_x = h9 + 3.0f * (h6 - h4) + h3 - h7 - h1;
			float gradient_y = h7 + 3.0f * (h8 - h2) + h9 - h1 - h3;
			float slope = slope_coeff * sycl::sqrt(gradient_x * gradient_x + gradient_y * gradient_y);
			slope = sycl::atan(slope) * 57.2957795130823f;
			result[y * resolution + x] = slope;
		});
	});
	q.wait();
}

 

 

 

 works 1.9 times slower than this ispc code:

 

 

 

export void slopemap(uniform float data[], uniform float result[], uniform int resolution, uniform int y)
{
    uniform float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
    uniform int y0 = max(y - 1, 0);
    uniform int y1 = min(y + 1, resolution - 1);
    foreach (x = 0 ... resolution)
    {
        int x0 = max(x - 1, 0);
        int x1 = min(x + 1, resolution - 1);
        float h7 = data[y0 * resolution + x0];
        float h8 = data[y0 * resolution + x];
        float h9 = data[y0 * resolution + x1];
        float h4 = data[y * resolution + x0];
        float h6 = data[y * resolution + x1];
        float h1 = data[y1 * resolution + x0];
        float h2 = data[y1 * resolution + x];
        float h3 = data[y1 * resolution + x1];
        float gradient_x = h9 + 3.0f * (h6 - h4) + h3 - h7 - h1;
        float gradient_y = h7 + 3.0f * (h8 - h2) + h9 - h1 - h3;
        float slope = slope_coeff * sqrt(gradient_x * gradient_x + gradient_y * gradient_y);
        slope = atan(slope) * 57.2957795130823f;
        result[y * resolution + x] = slope;
    }
}​

 

 

 

which is called this way:

 

 

 

float slope_coeff = 0.1f * 2800.0f / (5000.0f / resolution);
concurrency::parallel_for(0, resolution, [&](auto y)
{
    ispc::slopemap(data.data(), result.data(), resolution, y);
});

 

 

 

I use Windows version of oneAPI, integrated with Visual Studio 2022. This is the compilation command used:

 

 

 

/O2 /fsycl-targets=spir64_x86_64 /fsycl-early-optimizations /Zi /D "NDEBUG" /D "_WINDLL" /D "_UNICODE" /D "UNICODE" /WX- /MD /std:c++20 /EHsc /W3 /nologo /Fo"x64\Release\" 

 

 

 

 And this is the ISPC compilation command:

 

 

 

ispc -O2 "%(Filename).ispc" -o "$(IntDir)%(Filename).obj" -h "$(ProjectDir)%(Filename).h" --target=sse4,avx2,avx512skx-x8 --opt=fast-math

 

 

 

Can the performance of CPU backend be improved and be closer to ISPC-performance? Thank you.

0 Kudos
8 Replies
cw_intel
Moderator
969 Views

Hi,

Can you provide us the complete code? Including the main function and how to measure the performance.


Thanks


0 Kudos
blinkfrog
Novice
951 Views

Sure!

 

I've put all the SYCL code and benchmark code to a single file.  Also I've included VS project and solution in case if this helps.

 

ISPC compile command:

ispc -O2 ispc_slopemap.ispc" -o ispc_slopemap.obj -h ispc_slopemap.h" --target=avx2 --opt=fast-math

 

(Usually I compile ISPC using Custom Build Tool in Visual studio for multiple targets such as sse42, avx2 and avx512, but, when I switched to DPC++ compiler, I got link errors for some reason (there is no such errors with Intel C++ compiler) until I left only single target and added .obj file manually to the project. If your CPU supports avx512 instruction set, you can change the compile command accordingly)

0 Kudos
cw_intel
Moderator
929 Views

Thanks for providing the details. We will test it on our side.


Thanks.


0 Kudos
Spooner
Beginner
917 Views
Hey,

Just a thought on this topic. You compile your Sycl code for 64bit. I assume the kernel will use 64bit arithmetic for pointers in this case. ISPC uses by default 32bit pointer arithmetic. As you have a lot of reads this might explain the difference in performance.
0 Kudos
blinkfrog
Novice
892 Views

Thank you very much Spooner,

 

To check your assumption I've compiled ispc-file with `--addressing=64` parameter. And the numbers changed indeed.

 

So, these are benchmark results:

    Method        Min     Med     Avg
ISPC 32-bit addr 1.109   1.282   1.297
ISPC 64-bit addr 1.622   1.826   1.840
oneAPI@OpenCL    2.334   2.678   2.690

Still, even in this case OpenCL still is noticeably slower than ISPC.

 

However, is there a way to make OpenCL to use 32-bit addressing?

0 Kudos
cw_intel
Moderator
806 Views

Hi,

Can you provide a separate ISPC code and compilation command? Based on the code and compilation command you provided, I can't compile it correctly.

​Thanks.


0 Kudos
blinkfrog
Novice
791 Views

Sure!

 This is the command line to compile ispc_normalmap.ispc:

 

ispc ispc_normalmap.ispc -o ispc_normalmap.obj -h ispc_normalmap.h -O2 --opt=fast-math

 

(it is assumed that the directory with ispc.exe is in your PATH variable, if you use Windows)

it generates header file and .obj file, which you need to add to your project in Visual Studio manually (since DPC++ doesn't support Custom Build Tool for some reason, unlike Intel C++ Compiler), for example, by drag-n-drop this file to your project name in Solution Explorer.

 

(There is also error in original cpp file that I provided, line #102 is this:

 

ispc_slopemap(ispc_data.data(), sycl_result.data());

 

but should be this:

 

ispc_slopemap(ispc_data.data(), ispc_result.data());

 

Same for line #94.
But this error doesn't affect the performance, only affects availability of results in case if you may want to compare them during debugging.
 
And also some error in the algorithm itself which is irrelevant to performance and test itself. Also, the names of files and functions can be confusing (files are `normalmap`, while functions are `slopemap`), this is because in the very test I measured performance of calculating normal maps but then decided to switch to slope maps)
0 Kudos
cw_intel
Moderator
763 Views

Thank you for the details. And we can build the ISPC code now. We are investigating the issue. Will let you know if there is any update.


Thanks.


Reply