- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have written a simple memcpy kernel as written below:
I am analyzing its performance on GPU using vtune.
__kernel void deinterlace_Y(__read_only image2d_t YIn, __write_only image2d_t YOut)
{
/* Doing operation of Memcpy */
int2 coord_src = (int2)(get_global_id(0), get_global_id(1));
const sampler_t smp = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE;
uint4 pixel4 = read_imageui(YIn, smp, coord_src);
write_imageui(YOut, coord_src, pixel4);
}
I observe the below stats for Execution units:
EU Array
Active Stalled Idle
24.6% 18.1% 57.2%
Also my computing threads started number is 24,525,023, which is quite high.I don't know how to reduce the number of threads started here and result in increased performance.
I can't understand how to improve its performance. I have gone through this link on optimizationshttps://int2-software.intel.com/en-us/articles/optimizing-simple-opencl-kernels. At this link all the optimizations are related to buffers where we can read 16 elements from memory in one go. But in my case since I am using Texture memory reads or image API's I don't know the way to increase the performance
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robert Loffe,
Could you please help here ?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Manish,
Sorry for the late reply: I was out of the office yesterday. Could you please let me know the following:
1. What is your OS?
2. What is the processor you are working on?
3. Which version of the graphics driver do you use?
4. What is the global size for your kernel?
5. What, if any, is your local size?
6. What is the version of the Vtune that you are using?
Also, if it possible to attach a full source code of your sample, please do so.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robert,
I am using below configuration and code:
1. What is your OS? Windows 7 64 bit
2. What is the processor you are working on? i7 4770 CPU @3.4 GHz
3. Which version of the graphics driver do you use? 10.18.14.4222
4. What is the global size for your kernel? 1920x1080
5. What, if any, is your local size? I have set it to NULL
6. What is the version of the Vtune that you are using? Vtune 2015 update 2
Here are the parts of the code which I am using to run this kernel:
/*Setting the params */
img_fmt.image_channel_order = CL_R;
img_fmt.image_channel_data_type = CL_UNSIGNED_INT8;
p->origin[0] = 0; //width of image
p->origin[1] = 0; //height of image
p->origin[2] = 0; //req for 3D image
p->region[0] = width;
p->region[1] = height;
p->region[2] = 1; // For a 2D image this is set to 1
p->global_ws[0][0] = width;
p->global_ws[0][1] = (height);
/* ingest frame to global memory on the device: */
ret = clEnqueueWriteImage(p->command_queue, p->memobj_in_luma, CL_FALSE, p->origin, \
p->region,0,0,(void *) in, 0, NULL, &evt[0]);
/*enqueue the kernel*/
ret = clEnqueueNDRangeKernel(p->command_queue, p->deint_kernel_y, 2, 0, \
p->global_ws[0], NULL, 1, &evt[0], &evt[3]);
/* Read back from device*/
ret = clEnqueueReadImage(p->command_queue, p->memobj_out_luma, CL_FALSE, \
p->origin,p->region,0, 0,(void *)(out),0, 0, 0);
/* KERNEL CODE */
__kernel void deinterlace_Y(__read_only image2d_t YIn, __write_only image2d_t YOut) { /* Doing operation of Memcpy */ int2 coord_src = (int2)(get_global_id(0), get_global_id(1)); const sampler_t smp = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE; uint4 pixel4 = read_imageui(YIn, smp, coord_src); write_imageui(YOut, coord_src, pixel4);
}

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page