Does OpenCL on Intel CPU device use CPU and HD graphic GPU at the same time?

Chi-wai_L_ · ‎05-29-2014

Hello,

I made a benchmark on my OpenCL application on i7-4771 and Xeon E5-2643 machine and the result is shocking me. E5-2643 spent much longer time to complete the test than i7-4771. I guess OpenCL uses both CPU and HD graphics GPU on i7-4771, but OpenCL can only use CPU on E5-2643 since it doesn't have HD graphics integrated.

I am not sure my guess is correct. So I am seeking answers here.

Thanks a lot!!

Dmitry_K_Intel · ‎05-29-2014

Intel OpenCL can use both CPU and GPU devices at the same time if application creates a shared CPU-GPU Context. But it is a reponcibility of the application to decide which tasks will be executed on which device.

Chi-wai_L_ · ‎06-03-2014

Dmitry,

Thanks for your reply.

So the call of I7 CPU and I7 HD graphics GPU must be explicit, two computing units won't be shared implicitly, right?

By comparing the spec of I7-4771 and E5-2643, E5-2643 is more powerful in terms of parallel computation. Is there any reason making E5-2643 spend more time on the same task than I7-4771?

Does E5-2643 fully support OpenCL?

Thanks again!

Jan_Hardenbergh · ‎07-08-2014

I'm trying to figure out the same problem or a similar problem. I think it depends on whether it is an E5-2643 or an E5-2643 v2. The processor string will tell you.

When you run OpenCL on the E5-2643, do you get both a CPU platform and a GPU platform?

Chi-wai_L_ · ‎07-08-2014

Hello Jan,

When I run OpenCL, I dont get any GPU platform on E5-2643.

After taking a few tests on i7-4771, Xeon E3-1275, Xeon E5-2643, Xeon X3470, I am quite sure that Intel HD Graphics is the main cause of performance difference. Results from my benchmark application on i7-4771 and E3-1275 (which embedded with Intel HD Graphics 4600 and P3000) is much better than that on E5-2643 and X3470 (which are without Intel HD Graphics)

Dmitry_K_Intel · ‎07-08-2014

Hi Jan and Chi-wan,

Sorry, I missed your questions last month. Intel exposes single OpenCL spec 1.2 platform for all supported devices. You can choose the right device by using clGetDeviceIDs() with right device_type parameter (CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_ACCELERATOR). Alternatively you can use CL_DEVICE_TYPE_ALL to get all supported devices list and query later each device specifically using clGetDeviceInfo().

After you choose the right device you should create a Context with either this device only or list of devices you want to cooperate. Note - Intel OpenCL platform supports device cooperation only if they share the same Context. By device cooperation I mean sharing buffers/images and clEvent dependencies tracing.

Regarding what device is better - it highly depends on algorithm used and amount of computations to be performed. This is why Intel provides you with multiple different devices with different capabilities - choose the device that best fit your needs or split your computations between different devices to benefit all of them.

Raghupathi_M_Intel · ‎07-10-2014

Alexey V. wrote:

When however I run a corresponding C++ code (see the third section below), it takes about 130 s for this calculation. Same C++ code run on Xeon e5-2630 takes 120 s.

OpenCL runs on 8 threads of i7, thus we expect that OpenCL should be about 8 times faster than C++. However this tests shows that the speedup is considerably larger: it is 24. Where does this boost come from? May it come from the Intel GPU even if it was not initialized? I don't have OpenCL installed on the Xeon machine, thus not sure how this test would work for a CPU with no embedded GPU.

Alexey,

Clearly there is no GPU device on the Intel platform from your output. So that is ruled out. Moreover, how are you measuring the timing? Did you turn on vectorization when you built your C++ code? Your OpenCL code is probably getting the speedup not just from the multiple threads but also from the vectorization.

Thanks,
Raghu

Alexey_V_ · ‎07-11-2014

Hi Dmitry,

could you please also help with a may be similar question. I run a simple test which performs 1e6 multiplications for a vector with 1e4 elements (see the kernel below). The host program uses AMD OpenCL APP SDK libraries, it initializes my Intel-i7 i7-3635QM as a CPU device (see the output of clGetDeviceInfo(...,CL_DEVICE_TYPE,...) below the kernel). This OpenCL calculation with 1e10 multiplications takes 5 sec to accomplish.

When however I run a corresponding C++ code (see the third section below), it takes about 130 s for this calculation. Same C++ code run on Xeon e5-2630 takes 120 s.

OpenCL runs on 8 threads of i7, thus we expect that OpenCL should be about 8 times faster than C++. However this tests shows that the speedup is considerably larger: it is 24. Where does this boost come from? May it come from the Intel GPU even if it was not initialized? I don't have OpenCL installed on the Xeon machine, thus not sure how this test would work for a CPU with no embedded GPU.

Thank you,

Alexey

-------

__kernel void test(__global const float *a,__global float *b)
{
	int gid = get_global_id(0);
	float c;
	c = a[gid];
	for (int i = 1; i< 1e6; i ++)
	{
		c = (int) (c * c) % 10000;
	}
	b[gid] = c;
}

-------

(output)

number of platforms: 2
Platform #0
Name = 'Intel(R) OpenCL'
Vendor = 'Intel(R) Corporation'
Version = 'OpenCL 1.2 '
Profile = 'FULL_PROFILE'
Run platform #0. No GPU device available.
CL_DEVICE_NAME = Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz
CL_DEVICE_TYPE = 2

Simulation time: 5.3 s

----

C++ code:

#include <iostream>
#include <time.h>
using namespace std;
int main(int argc, char** argv)
{
    double time_spent;
    clock_t begin;
    begin = clock();
    cout << "start..." << endl;
    const int ARRAY_SIZE = 1e4;
    float a[ARRAY_SIZE], b[ARRAY_SIZE];
    for (int gid = 0; gid < ARRAY_SIZE; gid++)
    {
        a[gid] = (float)gid+2;
        float c;
        c = a[gid];
        for (int i = 1; i< 2e5; i ++)
        {
            c = (int) (c * c) % 10000;
        }
        b[gid] = c;
    }
    time_spent = (double)(clock() - begin) / CLOCKS_PER_SEC;
    cout << "Simulation time: " << time_spent << " s" << endl;
    return 0;
}

Alexey_V_ · ‎07-11-2014

Raghu,

thank you for your comment. I am not familiar with vectorization concept - is this the right place to learn about it? What is an easy way to turn on vectorization when I build the code?

Alexey

PS I updated my previous post to show the entire C++ code, so you may see how I measure the time.

Dmitry_K_Intel · ‎07-12-2014

Hi Alexey.

Raghu is right - the second source of speedup is from vectorization. You can achieve the same with C/C++ code either by manual vectorization using compiler intrinsics or by using compiler pragmas.

The third source of speedup is language limitations. OpenCL "C" language imposes a set of limitations not existing in regular C/C++ in order to guarantee data parallelism and make all non-data parallel accesses as explicit as possible. As a result OpenCL "C" compiler can perform additional code optimization impossible for regular C/C++ compiler. You can achieve the same optimizations with regular C/C++ compiler by using additional compiler pragmas and instruct compiler with your assumptions.

You can also use another compiler pragma-based technologies to achieve the same effect like OpenMP. I would also recommend you to use Intel C/C++ compiler as it supports all this optimizations by default or with your explicit request using either command like options or pragmas.

Alexey_V_ · ‎07-14-2014

Dmitry, Raghu,

thank you for your explanations, now it is all clear

Alexey