OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

OpenCL achieve 800% CPU utilization

Biao_W_
Beginner
670 Views

Hi all, 

I am curious about the CPU implementation of OpenCL for Intel processors.

I run a small set of benchmark from clpeak on a i7-4770S (4 cores, hyperthreading enabled) under linux.

it shows the CPU utilization can achieve almost 800% (using top), meaning all CPU resource are utilized.

However, when I run the benchmark in clpeak individually, it shows maximum 400%.

Run benchmark consecutively can benefit from OpenCL runtime.

Is that mean when a workload is issued to OpenCL CPU runtime, it will not all of the cores but part of them.

Besides, is OpenCL CPU runtime using SIMD to execute consecutive workitems?

Appreciate in advance!

Best,

Biao

0 Kudos
5 Replies
Robert_I_Intel
Employee
670 Views

Biao,

OpenCL CPU runtime is very efficient at using all available CPU resources: all cores and all threads will be utilized. In addition, Kernels are compiled with AVX2 instructions in mind (a 256-bit form of SIMD). To achieve comparable performance in a regular C/C++ code, you will need to use 256-bit data types and intrinsics in addition to regular multithreading.

0 Kudos
Biao_W_
Beginner
670 Views

Hi Robert, 

Thanks for your reply.

In clinfo I see the Compute unit for this CPU is 8, does that mean each logic core (when hyperthreading enabled) is one compute unit?

What if the CPU does not support AVX2, will the Intel compiler try to compile the kernel with the latest SIMD for low end CPUs?

Finally, I have one kernel (SAO) from HEVC decoder. It is offloaded to CPU as OpenCL device. I also have a head coding SIMD AVX2 implementations for this kernel. The performance of OCL SAO on CPU is bad, compared to head coding SIMD. Is there any way to measure the efficiency of Intel OpenCL compiled SIMD instructions? See attached picture.

0 Kudos
Robert_I_Intel
Employee
670 Views

Hi Biao,

Yes, each logic core is one compute unit. I will ask our product engineers about the last couple of questions. Typically, OpenCL kernel on the CPU device should not perform better than the hand coded AVX2 code, however I am not sure what the typical overhead for using OpenCL should be. On your diagram, which kernels are hand coded and which ones are OpenCL kernels?

0 Kudos
Biao_W_
Beginner
670 Views

Hi Robert, 

Actually there are two kernels inside the figure, but let us focus on SAO kernel only.

The black bar surround by a green cycle is the performance of hand coded SAO kernel while the blue bars surrounded by the other green cycle is the performance of OpenCL CPU implementation. The latter one has two parts, because I run the kernel for two subsets of the input data.

 

0 Kudos
Robert_I_Intel
Employee
670 Views

Hi Biao,

Just got a response from our product engineer:

What if the CPU does not support AVX2, will the Intel compiler try to compile the kernel with the latest SIMD for low end CPUs?

Yes. OpenCL CPU run-time automatically detects and optimizes for supported vector extension. Supported CPUs can be found in release notes.

 

Is there any way to measure the efficiency of Intel OpenCL compiled SIMD instructions?

 

If you want to understand how efficiently OpenCL is uses HW, you can use profiler (we support Intel VTune). If the question about the quality of JIT code produced by OpenCL compiler, you can analyze JIT & LLVM IR using tools provided by OpenCL SDK. If the question is about why hand-coded application outperforms OpenCL app, then I suggest to use both (profiler + SDK tools) to analyze.

0 Kudos
Reply