Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Manish_K_
Beginner
113 Views

Memcpy performance using opencl kernel

Hi,

I have written a simple memcpy kernel as written below:

I am analyzing its performance on GPU using vtune.

__kernel void deinterlace_Y(__read_only image2d_t YIn, __write_only image2d_t YOut)
{

/* Doing operation of Memcpy */

int2 coord_src = (int2)(get_global_id(0), get_global_id(1));

const sampler_t smp = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE;

uint4 pixel4 = read_imageui(YIn, smp, coord_src);

write_imageui(YOut, coord_src, pixel4);

}

I observe the below stats for Execution units:

    EU Array

Active Stalled Idle

24.6% 18.1% 57.2%

Also my computing threads started number is 24,525,023, which is quite high.I don't know how to reduce the number of threads started here and result in increased performance.

I can't understand how to improve its performance. I have gone through this link on optimizationshttps://int2-software.intel.com/en-us/articles/optimizing-simple-opencl-kernels. At this link all the optimizations are related to buffers where we can read 16 elements from memory in one go. But in my case since I am using Texture memory reads or image API's I don't know the way to increase the performance

0 Kudos
3 Replies
Manish_K_
Beginner
113 Views

Hi Robert Loffe,

Could you please help here ?

Thanks!

Robert_I_Intel
Employee
113 Views

Hi Manish,

Sorry for the late reply: I was out of the office yesterday. Could you please let me know the following:

1. What is your OS?

2. What is the processor you are working on?

3. Which version of the graphics driver do you use?

4. What is the global size for your kernel?

5. What, if any, is your local size?

6. What is the version of the Vtune that you are using?

Also, if it possible to attach a full source code of your sample, please do so.

Manish_K_
Beginner
113 Views

Hi Robert,

I am using below configuration and code:

1. What is your OS?  Windows 7 64 bit

2. What is the processor you are working on?  i7 4770 CPU @3.4 GHz

3. Which version of the graphics driver do you use? 10.18.14.4222

4. What is the global size for your kernel? 1920x1080

5. What, if any, is your local size? I have set it to NULL

6. What is the version of the Vtune that you are using? Vtune 2015 update 2

Here are the parts of the code which I am using to run this kernel:

/*Setting the params */

    img_fmt.image_channel_order = CL_R;
    img_fmt.image_channel_data_type = CL_UNSIGNED_INT8;

    p->origin[0] = 0;  //width of image
    p->origin[1] = 0;  //height of image
    p->origin[2] = 0;  //req for 3D image

    p->region[0] = width;
    p->region[1] = height;
    p->region[2] = 1;  // For a 2D image this is set to 1

    p->global_ws[0][0] = width;
    p->global_ws[0][1] = (height);

  /* ingest frame to global memory on the device: */    
  ret = clEnqueueWriteImage(p->command_queue, p->memobj_in_luma, CL_FALSE, p->origin, \
                             p->region,0,0,(void *) in,  0, NULL, &evt[0]);

/*enqueue the kernel*/
  ret = clEnqueueNDRangeKernel(p->command_queue, p->deint_kernel_y, 2, 0, \
                               p->global_ws[0], NULL, 1, &evt[0], &evt[3]);

/* Read back from device*/

  ret = clEnqueueReadImage(p->command_queue, p->memobj_out_luma, CL_FALSE, \
    p->origin,p->region,0, 0,(void *)(out),0, 0, 0);

/* KERNEL CODE */

__kernel void deinterlace_Y(__read_only image2d_t YIn, __write_only image2d_t YOut)
{

/* Doing operation of Memcpy */

int2 coord_src = (int2)(get_global_id(0), get_global_id(1));

const sampler_t smp = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE;

uint4 pixel4 = read_imageui(YIn, smp, coord_src);

write_imageui(YOut, coord_src, pixel4);

}