Hi Manish,

Manish_K_ · ‎02-10-2015

Hi,

Do we have any texture memory in the intel HD 4600 card or in other Iris series?

If yes, what is the size of this memory? How can I analyse this in the vtune that I am able to fully utilize the texture memory?

Regards,

Manish

Surbhi_M_Intel · ‎02-11-2015

Hi Manish,

Since your question is related to v-tune, I would recommend you to post your question in Vtune to get a quick answer.

Thanks,
-Surbhi

Robert_I_Intel · ‎02-11-2015

Hi Manish,

We don't have dedicated texture memory in integrated graphics. What we have is L3 Cache. The size for HD4600 is 256Kb (https://software.intel.com/sites/default/files/managed/f3/13/Compute_Architecture_of_Intel_Processor_Graphics_Gen7dot5_Aug2014.pdf )

In Vtune, look for Computing Task Purpose/Source Computing Task (GPU) grouping and look for Sampler numbers. Take a look at this article first: https://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-getting-started-with-opencl-performance-analysis-on-intel-hd-graphics

And here is a good video from Julia Fedorova, the creator of GPU analysis in Vtune: https://www.youtube.com/watch?v=cIChYU014u8

Maybe the following diagram can help (it is for a newer generation of chips, but the idea is similar):

Manish_K_ · ‎02-12-2015

Hi Robert

Thanks for the information.

I saw the Julia's Video and there were things to learn.

However, I am not able to grasp the concept shown at time 8:14 in the video.

She explains to change the memory access pattern to have better performance.

How many pixels were earlier getting executed per work item and how many are getting executed after the change?

Could you please explain it more?

Regards,

Manish

Robert_I_Intel · ‎02-12-2015

Manish,

I think the concept Julia is trying to explain is best explained in my videos on simple OpenCL kernel optimizations: https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels . Relevant sample zip files are at the bottom of that article (Sobel.zip and Modulate.zip).

The samples start with processing one byte (uchar) in a kernel. They then progress to processing sixteen bytes (uchar16). Then processing a tile of 16 by 16 bytes by looping 16 times over uchar16 processing.

Two basic ideas here:

1) decrease the number of work items to a manageable amount (typically, 16 work items pack into a thread, there are 7 threads per execution units (EU), let's say you have 20 EUs in your Intel GPU, then your global size should be ~8-12 times available hardware resources or ~8-12 * 16 wi/t * 7 t/EU * 20 EUs. Keep in mind, that a kernel could also be compiled SIMD8 (heavy use of private memory) or SIMD32 (really short ones), so the number of optimal global size will adjust accordingly.

2) our architecture works best when each work item reads ~16 bytes at a time, so the best data types to use are uchar16, uint4, int4, float4 - you get the idea. Sometimes, it is worth it to process several of vector data in a loop, e.g. 4, 8, or 16: how big the loop is and whether it is even necessary depends on how much computation you are doing between reading and writing, since you may become compute bound, and then you need to switch to shorter vectors and smaller loops, or even avoid loops altogether.

Robert_I_Intel · ‎02-13-2015

Manish,

Correct! But remember, since each work item processes a tile of 16 by 16 pixels, the actual size of your image is 3584 by 1600. Again, these are rules of thumb. The best global size depends on your kernel, the amount of pixels/data you are processing in one work item, and the size of your input data.

Robert_I_Intel · ‎02-13-2015

Manish,

In my Sobel example each work item processes 16 by 16 pixel tile at the end of all optimizations - it is not a maximum, it is what I found performant. If you process 16 bytes of data, you might be under-utilizing the work item: it really depends on the size of your input data and the amount of computation you are doing on those 16 bytes.

To test the sobel filter, I used the following parameters: 10 gpu intel 2048 2048 show_CL

You can run the sample under Vtune: I don't have Vtune traces close by.

Manish_K_ · ‎02-16-2015

Robert,

Could you please provide me the ppm file you are using ?

I am not getting the download link for the same?

Thanks.

Manish_K_ · ‎02-18-2015

Thanks for Robert for your all the support till now.

I was able to run your code with a ppm file. Could you please tell me the maximum fps number which you are getting with

your kernel "Sobel_v6_uchar16_to_float16_vload_16_unroll". I am getting a number which is something like this 428644 for image size 1080x720. Could you please let me know whether this is the right?

Robert_I_Intel · ‎02-18-2015

Manish,

With 1080 by 720 things start failing for some reason. Try 1024 by 1024:

Test Image Size:  1024 x  1024
Iterations, each test: 30
Sobel                                                                   Time/NDrng  StdDev-Time Speedup Estimated-BW StdDev-BW  Pixel-TPut
Name                                                                    [millisecs] [%]                 [GB/sec]     [%]        [Gpix/sec]
-----------                                                             ---------   ----------  ------- ----------   ---------- -----------
Sobel_v1_uchar                                                              0.768        6.8        1.0      1.370         6.3       1.365
Sobel_v2_uchar16                                                            0.280        4.9        2.7      3.747         4.8       3.738
Sobel_v3_uchar16_to_float16                                                 0.192       43.5        4.0      5.953        20.5       5.466
Sobel_v4_uchar16_to_float16_vload                                           0.187       19.7        4.1      5.775        14.8       5.616
Sobel_v5_uchar16_to_float16_vload_16                                        0.125        5.1        6.1      8.390         4.9       8.370
Sobel_v6_uchar16_to_float16_vload_16_unroll                                 0.138        8.6        5.6      7.663         7.1       7.618
Sobel_v7_uchar16_to_float16_vload_16_unroll_mad                             0.155       10.9        4.9      6.821         9.6       6.753

-------done--------------------------------------------------------

Manish_K_ · ‎02-20-2015

Thanks Robert.

Could you please let me know the GPU series or number? I am using HD 4600.

Also, I have given 1048x1048 or 2048x2048files to this filter. This code crashes.

However, I can see the smoothaveragetime numbers in debug mode for each kernel.

Robert_I_Intel · ‎02-22-2015

Hi Manish,

The cod should work fine on HD4600 graphics. Try giving it the power of 2 sizes, e.g. 256x256, 512x512, 1024x1024, etc. Start with small numbers first.

Texture memory in intel HD4600 Graphics