Solved: Profiling in openCL and Intel HD graphics Vs dedicated graphics

shashwat_D_ · ‎11-11-2014

Hi All!!!
I'm new to openCL and willing to compare performance gain between c code and openCL kernels.
Can someone please elaborate which method among these 2 is better/correct for profiling openCL code when comparing performance with c reference code:

1. Using QueryPerformanceCounter()/__rdtsc() cycles (called inside getTime Function)

ret |= clFinish(command_queue); //Empty the queue

    getTime(&begin);
    ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, NULL);    //Profiling Disabled.
    ret |= clFinish(command_queue);
    getTime(&end);
    g_NDRangePureExecTimeSec = elapsed_time(&begin, &end);        //Performs: (end-begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE)

2. Using events profiling:

    ret = clEnqueueMarker(command_queue, &evt1);    //Empty the Queue
    ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, &evt1);
    ret |= clWaitForEvents(1, &evt1);
    ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_START, sizeof(cl_long), &begin, NULL);
    ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_END, sizeof(cl_long), &end, NULL);
    g_NDRangePureExecTimeSec = (cl_double)(end - begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE);    //nSec to Sec
    ret |= clReleaseEvent(evt1);

Furthermore I'm not using a dedicated graphics card and utilizing Intel HD 4600 integrated graphics for following piece of code:

__kernel void filter_rows(__global float *ip_img,\
                          __global float *op_img, \
                          int width, int height, \
                          int pitch,int N, \
                          __constant float *W)
{
    __private int i=get_global_id(0); 
    __private int j=get_global_id(1); 
    __private int k;
    __private float a;
    __private int image_offset = N*pitch +N;
    __private int curr_pix = j*pitch + i +image_offset;
    
    // apply filter
    a  = ip_img[curr_pix-8] * W[0 ];    
    a += ip_img[curr_pix-7] * W[1 ];    
    a += ip_img[curr_pix-6] * W[2 ];    
    a += ip_img[curr_pix-5] * W[3 ];    
    a += ip_img[curr_pix-4] * W[4 ];    
    a += ip_img[curr_pix-3] * W[5 ];    
    a += ip_img[curr_pix-2] * W[6 ];    
    a += ip_img[curr_pix-1] * W[7 ];    
    a += ip_img[curr_pix-0] * W[8 ];    
    a += ip_img[curr_pix+1] * W[9 ];    
    a += ip_img[curr_pix+2] * W[10];    
    a += ip_img[curr_pix+3] * W[11];    
    a += ip_img[curr_pix+4] * W[12];    
    a += ip_img[curr_pix+5] * W[13];    
    a += ip_img[curr_pix+6] * W[14];    
    a += ip_img[curr_pix+7] * W[15];    
    a += ip_img[curr_pix+8] * W[16];
    // write output
    op_img[curr_pix] = (float)a;
}

And similar code for column wise processing. I'm observing gain (openCL Vs optimized vectorized C-Ref) around 11x using method 1 and around 16x using method 2.
However I've noticed people claiming gains in the order of 200-300x, when using dedicated graphics cards.
So my questions are:

1. What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card will outperform Intel HD graphics?
2. Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?

Help is appreciated.

Robert_I_Intel · ‎11-11-2014

Hi Shashwat,

For performance measurement, the second method via event profiling is preferred. Not sure where the cubed version of CLOCK_PER_CYCLE comes from. See my note on profiling here: https://software.intel.com/en-us/forums/topic/326724 for profiling events supported.

If you want to use the first method you need to do the following: warm up GPU with ~30 or so invocations of the same kernel, then time the wall clock of about 100 kernels. See the following examples on how it is done: https://software.intel.com/en-us/INDE-OpenCL-Sobel and https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels

Your gain magnitude question is really complex: it really depends on how badly your single threaded CPU code is written :) See the following article on debunking the 100x GPU vs CPU claim: http://www.hwsw.hu/kepek/hirek/2010/06/p451-lee.pdf When comparing to the CPU keep in mind two things: parallelization (are you using all CPU cores?) and vectorization (are you using all vector processing firepower of a single CPU core?).

Now, to the comparison of Intel Integrated Graphics with NVidia or AMD cards. You really need to think whether you are comparing apples to apples, e.g. processors with Intel Integrated Graphics have multiple SKUs with 10, 20, and 40 EUs for the fourth generation processors and 12, 24, and 48 EUs for the fifth generation processors (and I can safely say there would be more EUs in the future :)). The graphics in these processors is designed to compete with entry level graphics cards from NVidia and AMD released at the same time. If you compare OpenCL performance of Intel Integrated Graphics with comparable hardware from NVidia and AMD, I believe we are on par or better performance wise and we will be much more cost effective and power effective. You can read more about Compute Architecture of Intel Processor Graphics here: 5th gen, 4th gen. Note, that you almost never transfer data to a GPU in the integrated graphics case, so you save time on data transfers.

You have a processor with Intel HD 4600 graphics: it should have 20 Execution Units (EU), each EU runs 7 hardware threads, each thread is capable of executing SIMD8, SIMD16 or SIMD32 instructions, each SIMD lane corresponding to one work item (WI) in OpenCL speak. SIMD16 is typical for simple kernels, like the one you are trying to optimize, so we are talking about 20*7*16=2240 work items executing in parallel. Keep in mind that each work item is capable of processing vector data types, e.g. float4, so you should definitely try rewriting your kernel to take advantage of them. I hope this also helps you compare with NVidia's offerings.

View solution in original post

Robert_I_Intel · ‎11-11-2014

Hi Shashwat,

For performance measurement, the second method via event profiling is preferred. Not sure where the cubed version of CLOCK_PER_CYCLE comes from. See my note on profiling here: https://software.intel.com/en-us/forums/topic/326724 for profiling events supported.

If you want to use the first method you need to do the following: warm up GPU with ~30 or so invocations of the same kernel, then time the wall clock of about 100 kernels. See the following examples on how it is done: https://software.intel.com/en-us/INDE-OpenCL-Sobel and https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels

Your gain magnitude question is really complex: it really depends on how badly your single threaded CPU code is written :) See the following article on debunking the 100x GPU vs CPU claim: http://www.hwsw.hu/kepek/hirek/2010/06/p451-lee.pdf When comparing to the CPU keep in mind two things: parallelization (are you using all CPU cores?) and vectorization (are you using all vector processing firepower of a single CPU core?).

Now, to the comparison of Intel Integrated Graphics with NVidia or AMD cards. You really need to think whether you are comparing apples to apples, e.g. processors with Intel Integrated Graphics have multiple SKUs with 10, 20, and 40 EUs for the fourth generation processors and 12, 24, and 48 EUs for the fifth generation processors (and I can safely say there would be more EUs in the future :)). The graphics in these processors is designed to compete with entry level graphics cards from NVidia and AMD released at the same time. If you compare OpenCL performance of Intel Integrated Graphics with comparable hardware from NVidia and AMD, I believe we are on par or better performance wise and we will be much more cost effective and power effective. You can read more about Compute Architecture of Intel Processor Graphics here: 5th gen, 4th gen. Note, that you almost never transfer data to a GPU in the integrated graphics case, so you save time on data transfers.

You have a processor with Intel HD 4600 graphics: it should have 20 Execution Units (EU), each EU runs 7 hardware threads, each thread is capable of executing SIMD8, SIMD16 or SIMD32 instructions, each SIMD lane corresponding to one work item (WI) in OpenCL speak. SIMD16 is typical for simple kernels, like the one you are trying to optimize, so we are talking about 20*7*16=2240 work items executing in parallel. Keep in mind that each work item is capable of processing vector data types, e.g. float4, so you should definitely try rewriting your kernel to take advantage of them. I hope this also helps you compare with NVidia's offerings.

Maxim_S_Intel · ‎11-12-2014

OpenCL* Applications - Performance Debugging Intro : https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro#2