Intel OpenCL Performance comparison to OpenMP on CPU using simple codes

zhaopeng · ‎11-16-2013

Hi,

I'd like to evaluate and choose the parallel computing tools on CPU. And OpenMP and OpenCL should be the preferred choices. So I write very simple example codes to give a comparision of performance. The test is done on OpenSUSE 12.3 with Intel Core I5-2500K (overclocking to 4GHz). And the Intel® SDK for OpenCL* Applications XE 2013 R2 is used.

The example Just do some nonsense computation on two random number arrays.

OpenMP codes:

[cpp]

    const size_t dim=1024*1024*32;

std::vector<float> a(dim);

std::vector<float> b(dim);

std::vector<float> c(dim);

for(size_t i=0; i<dim; i++){

a[i]=rand();

b[i]=rand();

}

#pragma omp parallel for

for(size_t i=0; i<dim; i++){

        c[i]=sqrt(sqrt(a[i])*sqrt(b[i]));

[/cpp]

OpenCL kernel codes

__kernel void testKernel(__global float* a, __global float* b, __global float* c)
{
size_t i=get_global_id(0);
c=sqrt(sqrt((double)a)*sqrt((double)b));
}

The codes is so simple that I can not find more space for optimization.

=============

For the first, GCC 4.7 is used to compile the codes and the performance is as follow.

OpenMP codes: 0.166845 s

OpenCL codes: 0.0827546 s

The behaved as my expectation that the OpenCL codes is faster.

========

Then the Intel C++ Compiler 14.0.1 is used.

OpenMP codes: 0.082298 s

OpenCL codes: 0.117268 s

It is interesting that the Intel OpenMP implementation is near twice faster than GCC, but the OpenCL codes is slower although the same implementation is used. Anyone has ideas why this could happen? Does it mean that the Intel OpenCL implementation needs more optimizations compared to OpenMP?

By the way are there any comprehensive benchmarks for OpenMP and OpenCL with different compilers and implementations?

Thanks!

Arik_N_Intel · ‎11-17-2013

Thanks for sharing this issue!

Can you please run basic vtune analysis on the two builds of the OpenCL code?

Thanks,

Arik

zhaopeng · ‎11-18-2013

HI,

Thanks for your advice.

The previous results were produced by benchmarking the OpenMP and OpenCL one by one in once run. Then by benchmarking them seperately the performances of OpenMP and OpenCL codes are almost identical. In my opinion they should not affect each other in sequential execution. Any ideas about this situation?

I am new to vtune and the analysis results are attached. It is appreciated that if you can show me any information in the results. Thanks!

zhaopeng · ‎11-18-2013

HI,

Thanks for your advice.

The previous results were produced by benchmarking the OpenMP and OpenCL one by one in once run. Then by benchmarking them seperately the performances of OpenMP and OpenCL codes are almost identical. In my opinion they should not affect each other in sequential execution. Any ideas about this situation?

I am new to vtune and the analysis results are attached. It is appreciated that if you can show me any information in the results. Thanks!

Dmitry_K_Intel · ‎11-18-2013

I assume you ran first OpenMP and then OpenCL? With this sequence OpenCL will run slower.

Both OpenMP and OpenCL with default settings assume full machine ownership. After OpenMP grabs the machine it holds it slightly more than really required because it waits for the next task to be scheduled. By the way, OpenCL does the same. So the best approach is not to mix OpenMP and OpenCL in the same process.

zhaopeng · ‎11-18-2013

Thanks for the information!

After OpenMP grabs the machine it holds it slightly more than really required because it waits for the next task to be scheduled.

Can you explain what is the next task exactly? Any C/C++ codes or another OpenMP or OpenCL call?

So the best approach is not to mix OpenMP and OpenCL in the same process

Usually, we want to use OpenMP for simple loop codes on CPU and OpenCL for complex and time consuming tasks on CPU or GPU. Is there any way to reduce the overheads of the ownership switch?

Thanks a lot!

Dmitry_K_Intel · ‎11-18-2013

When user uses "parallel" OpenMP construct a set of additional threads (worker threads) is created. Those threads may be at least in 3 states:

execute some user code, example inside "parallel for" section
wait for the next "parallel for" region (example)
sleep

If worker thread enters "sleep" state and user issues the next "parallel" section it will take some time for worker thread to join execution once more. So OpenMP implementations try to optimize threads sleeping by spin looping for some time waiting for the work before entering sleep state. Different OpenMP providers supply different set of knobs in their OpenMP implementation for tuning this behavior. Please look here (http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/hh_goto.htm#GUID-BD9B39A7-5885-4C6C-A047-93F22EB85740.htm, "Thread Sleep" Time section) to get Intel specific info.

Now what happens when user runs OpenCL immediately after OpenMP.

OpenCL creates its own set of worker threads to process parallel tasks while OpenMP worker threads are still spinlooping. So OpenMP and OpenCL start competing for the same set of HW resources and OpenCL threads get less time for doing their job.

What can be done.

1. In the case you need to run OpenMP and OpenCL interchangeably and not at the same time, you need somehow force OpenMP threads to stop polling and allow OpenCL threads to work. And vise versa. This is not a trivial task.

2. In the case you need to run OpenMP and OpenCL in parallel at the same time. You need to split CPU threads into 2 sets and run OpenMP on one set and OpenCL on another. This is also a non-trivial task.

zhaopeng · ‎11-19-2013

Great explanation!

I have checked the OpenMP thread block time and the default value is 0.2 second. If I set the value to zero, the OpenCL codes performance is even a little better than OpenMP.

How about to make OpenMP and OpenCL work together without performance loss by adjusting the OpenMP thread block time? When the OpenCL codes are executed after OpenMP immediately, the block time is set to zero. Otherwise, the default value is used.

Thanks!

Dmitry_K_Intel · ‎11-19-2013

zhaopeng wrote:
How about to make OpenMP and OpenCL work together without performance loss by adjusting the OpenMP thread block time? When the OpenCL codes are executed after OpenMP immediately, the block time is set to zero. Otherwise, the default value is used.

Thank you. We will definitely consider this.