- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Do we have any texture memory in the intel HD 4600 card or in other Iris series?
If yes, what is the size of this memory? How can I analyse this in the vtune that I am able to fully utilize the texture memory?
Regards,
Manish
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Manish,
Since your question is related to v-tune, I would recommend you to post your question in Vtune to get a quick answer.
Thanks,
-Surbhi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Manish,
We don't have dedicated texture memory in integrated graphics. What we have is L3 Cache. The size for HD4600 is 256Kb (https://software.intel.com/sites/default/files/managed/f3/13/Compute_Architecture_of_Intel_Processor_Graphics_Gen7dot5_Aug2014.pdf )
In Vtune, look for Computing Task Purpose/Source Computing Task (GPU) grouping and look for Sampler numbers. Take a look at this article first: https://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-getting-started-with-opencl-performance-analysis-on-intel-hd-graphics
And here is a good video from Julia Fedorova, the creator of GPU analysis in Vtune: https://www.youtube.com/watch?v=cIChYU014u8
Maybe the following diagram can help (it is for a newer generation of chips, but the idea is similar):
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Robert
Thanks for the information.
I saw the Julia's Video and there were things to learn.
However, I am not able to grasp the concept shown at time 8:14 in the video.
She explains to change the memory access pattern to have better performance.
How many pixels were earlier getting executed per work item and how many are getting executed after the change?
Could you please explain it more?
Regards,
Manish
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Manish,
I think the concept Julia is trying to explain is best explained in my videos on simple OpenCL kernel optimizations: https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels . Relevant sample zip files are at the bottom of that article (Sobel.zip and Modulate.zip).
The samples start with processing one byte (uchar) in a kernel. They then progress to processing sixteen bytes (uchar16). Then processing a tile of 16 by 16 bytes by looping 16 times over uchar16 processing.
Two basic ideas here:
1) decrease the number of work items to a manageable amount (typically, 16 work items pack into a thread, there are 7 threads per execution units (EU), let's say you have 20 EUs in your Intel GPU, then your global size should be ~8-12 times available hardware resources or ~8-12 * 16 wi/t * 7 t/EU * 20 EUs. Keep in mind, that a kernel could also be compiled SIMD8 (heavy use of private memory) or SIMD32 (really short ones), so the number of optimal global size will adjust accordingly.
2) our architecture works best when each work item reads ~16 bytes at a time, so the best data types to use are uchar16, uint4, int4, float4 - you get the idea. Sometimes, it is worth it to process several of vector data in a loop, e.g. 4, 8, or 16: how big the loop is and whether it is even necessary depends on how much computation you are doing between reading and writing, since you may become compute bound, and then you need to switch to shorter vectors and smaller loops, or even avoid loops altogether.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Manish,
Correct! But remember, since each work item processes a tile of 16 by 16 pixels, the actual size of your image is 3584 by 1600. Again, these are rules of thumb. The best global size depends on your kernel, the amount of pixels/data you are processing in one work item, and the size of your input data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Manish,
In my Sobel example each work item processes 16 by 16 pixel tile at the end of all optimizations - it is not a maximum, it is what I found performant. If you process 16 bytes of data, you might be under-utilizing the work item: it really depends on the size of your input data and the amount of computation you are doing on those 16 bytes.
To test the sobel filter, I used the following parameters: 10 gpu intel 2048 2048 show_CL
You can run the sample under Vtune: I don't have Vtune traces close by.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Robert,
Could you please provide me the ppm file you are using ?
I am not getting the download link for the same?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for Robert for your all the support till now.
I was able to run your code with a ppm file. Could you please tell me the maximum fps number which you are getting with
your kernel "Sobel_v6_uchar16_to_float16_vload_16_unroll". I am getting a number which is something like this 428644 for image size 1080x720. Could you please let me know whether this is the right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Manish,
With 1080 by 720 things start failing for some reason. Try 1024 by 1024:
Test Image Size: 1024 x 1024 Iterations, each test: 30 Sobel Time/NDrng StdDev-Time Speedup Estimated-BW StdDev-BW Pixel-TPut Name [millisecs] [%] [GB/sec] [%] [Gpix/sec] ----------- --------- ---------- ------- ---------- ---------- ----------- Sobel_v1_uchar 0.768 6.8 1.0 1.370 6.3 1.365 Sobel_v2_uchar16 0.280 4.9 2.7 3.747 4.8 3.738 Sobel_v3_uchar16_to_float16 0.192 43.5 4.0 5.953 20.5 5.466 Sobel_v4_uchar16_to_float16_vload 0.187 19.7 4.1 5.775 14.8 5.616 Sobel_v5_uchar16_to_float16_vload_16 0.125 5.1 6.1 8.390 4.9 8.370 Sobel_v6_uchar16_to_float16_vload_16_unroll 0.138 8.6 5.6 7.663 7.1 7.618 Sobel_v7_uchar16_to_float16_vload_16_unroll_mad 0.155 10.9 4.9 6.821 9.6 6.753 -------done--------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Robert.
Could you please let me know the GPU series or number? I am using HD 4600.
Also, I have given 1048x1048 or 2048x2048files to this filter. This code crashes.
However, I can see the smoothaveragetime numbers in debug mode for each kernel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Manish,
The cod should work fine on HD4600 graphics. Try giving it the power of 2 sizes, e.g. 256x256, 512x512, 1024x1024, etc. Start with small numbers first.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page