Solved: Seeing different performance numbers than sample code

Ben_Rush · ‎04-12-2016

I'm experimenting with the sample applications found here: https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels. And in particular I'm looking at the simple modulate example. I'm finding results that differ from the tutorials and I'm hoping I can treat this as a learning example as I'm still very green, and just having a discussion can help me flesh out my understanding.

First, here is a snapshot of the VTune Analyzer's output from just running the Modulate sample (for simplicity's sake, I just uploaded it to my local web server as it's a large image): http://www.ben-rush.net/sample.PNG.

The first thing I notice is that the fastest implementation is the Modulate_v2_uchar16 call. In fact, it beats out the one that does the float16 and loop unrolling by almost a factor of 2. That isn't what I expected given the fact the tutorial seems to indicate the vload16/vstore16 calls should be optimized for reading/writing across the cache lines. And also, I'd expect loop unrolling to improve performance as well. But something is amiss.

The second thing I notice is the fact that the GPU Compute Threads Dispatch is very high for the Modulate_v1_uchar kernel, but low for the rest. Also, this is the order in which they were executed (I sorted the table at the top by the GPU time, so it might be confusing).

Modulate_v1_uchar
Modulate_v2_uchar16
Modulate_v3_uchar16_vload
Modulate_v4_uchar16_vload_to_float16
Modulate_v5_uchar16_vload_to_float16_16
Modulate_v6_uchar16_vload_to_float16_16_unroll

.So that explains why the Modulate_v1_uchar is slow, but I cannot see why the rest would be almost as slow. As in, as I'm developing real-world applications, I'd like to know how to look at the VTune analyzer output and understand why my stuff is performing slowly. In this particular instance, looking at these examples, I cannot figure out why Modulate_v6_uchar16_vload_to_float16_16_unroll is almost as slow as Modulate_v1_uchar.

It also seems concerning to me that so many EUs seem stalled and/or Idle for all these examples.

I guess I just don't know how to diagnose why things are behaving the way they are, and would love some insight from an expert.

I'm running on a Intel Core i7-6700K @ 4 GHz. Skylake.

Robert_I_Intel · ‎04-13-2016

Ok, Ben, so the kernels I was presenting were written and optimized for Haswell GT3e processor. You have a Skylake GT2 part, which has no EDRAM available on GT3e. What that means is that instead of using 2K by 2K images, which fit nicely in EDRAM's 128 MB, you need images that will fit in the LLC cache, of which you have 8MB. Since 2K by 2K is 4MB for input and 4MB for output, you wouldn't fit in LLC, and so you bandwidth goes down: you are trying to access data from regular DRAM. You could try 1K by 1K (1024 1024) or even 1.5K by 1.5K (1536 1536) and see what kind of results you get. It would also be interesting to click on the Architecture Diagram tab in Vtune.

On applicable optimizations:

1. Since Haswell, both hardware and software has improved. Skylake has two FPUs per EU that are also capable of integer operation, so conversion to float16, which makes sense on Haswell, doesn't make sense on Skylake. Loop unrolling have been implemented in the OpenCL compiler, so that does not make sense anymore either. Vloads were added to take care of a compiler issue, which has been fixed.

2. I think what happens with pure performance of v5 and v6: you are trying to read even more data from DRAM, which is already clogged, even in v2 kernel.

Please take a look at the following for more architectural details:

https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf

I don't have Skylake myself yet, but I will try to borrow it from someone and try a couple of things myself. Let me know how things go.

View solution in original post

Robert_I_Intel · ‎04-13-2016

Ok, Ben, so the kernels I was presenting were written and optimized for Haswell GT3e processor. You have a Skylake GT2 part, which has no EDRAM available on GT3e. What that means is that instead of using 2K by 2K images, which fit nicely in EDRAM's 128 MB, you need images that will fit in the LLC cache, of which you have 8MB. Since 2K by 2K is 4MB for input and 4MB for output, you wouldn't fit in LLC, and so you bandwidth goes down: you are trying to access data from regular DRAM. You could try 1K by 1K (1024 1024) or even 1.5K by 1.5K (1536 1536) and see what kind of results you get. It would also be interesting to click on the Architecture Diagram tab in Vtune.