I have succeeded in profiling an application which is being ported to KNL by
amplxe-cl -collect hotspots
and copying the results files to my windows laptop for "open result." It was a hassle to get this far.
but haven't made much progress in finding out about more detailed options. amplxe-cl tells me advanced-hotspots is not supported.
amplxe-gui fails, apparently due to latency of communication over internet (even early in the morning, when dslreports says I have 14ms latency). It doesn't even succeed in configuring a project. It takes over 20 seconds minimum to open a new window or even to echo my typed characters. So I see why command line seems to be in general use for MIC.
The principal developer's question is why the application shows satisfactory performance on a 262x262x262 grid but not on a 64x64x32 grid. The vectorized C loops are blocked into aligned data blocks of data type double, length 8 (whole cache lines), one loop per assignment. Judging by asm view there is a lot of loop fusion, but none reported in optrpt. One of the heaviest loops reports all aligned unmasked instructions with estimated speedup 9.4. In the asm view there appear to be occasionally consecutive dependent instructions from different for loops in the source code, with code from hundreds of source lines fused into the single large time-consuming loop. There are hot spots which appear to take longer when the data set is reduced by a factor of 20, but they still spend the time in the one big aligned loop.
I don't know whether KNL has any loop stream detector like effects, but evidently the hot loop is far too big for that, as it spans addresses 0x402079 to 0x4049e7. psxe2016 and 2017 appear to have similar performance.
I suppose the threads may work too close together (not on the same cache line, but close). Not surprisingly, the small grid doesn't gain significant performance past 32 threads, while the large grid works well at 64 cores, 1 thread per core.
The developers don't appear to have concern about the memory models, perhaps because they are satisfied with performance of their larger data sets.
Answering on VTune part of the question: starting from VTune 2016 Update 4 it supports KNL out-of-the-box with both user level (like hotspots) and driver-based (like advanced-hotspots) analysis types. The GUI right on KNL should be slow so the way we recommend is to use VTune command line on compute node and manipulate with the result on a host machine. BTW - you can use GUI on host machine to construct command line for KNL target - Choose "local" node under "Arbitrary Targets" on "Analysis Target" tab of "Configure Project dialog" and choose CPU "Intel Processor code named Knights Landing" as it shown on the picture below:
And then you can go to Analysis type tab and choose predefined or crate your own customer event collection and generate command line using "Command Line.." button.
So it makes sense to use general exploration and/or memory access analysis to see the code execution efficiency.
Also you can consider using Intel Advisor that will show you vector code efficiency. If you take 2017 version it will definitely be KNL-aware.
Thanks & Regards, Dmitry
BTW - there is Intel Advisor tool that will allow you to
Thanks for the recommendation of VTune 2016.4. That has not yet been installed on the development platform, but I will ask for it.
I will also try if my own host machine (KNC only) is able to generate command line options for running on the remote KNL.
I could not persuade Advisor 2017 to run so I thought maybe it doesn't support command line. It doesn't appear in any case that the issue in question is one which Advisor deals with. Until I got the VTune hotspots result I thought it might be a question of remainder loops, but VTune implies that it is not (in view of the data being pre-blocked into aligned cache lines).
Since newer VTune Amplifier XE 2017 is already available - it would be good to install this version (though we are releasing 2017 Update 1 early next week so if you wait several days you can grab the newest release.
Thanks & Regards, Dmitry
I've been using 2017 VTune for some time now and it works well on KNL except for the exceptionally long time to finalize. I haven't configured the system to perform the finalization on one of my different Xeon hosts, this is something I should get around to.
Recommendation for Intel regarding KNL finalization:
Please consider adding a feature to VTune such that a command line option can specify MPI be used to perform finalization on a different rank. IOW create a two rank application, rank 0 == KNL app/Vtune GUI, rank 1 == finalization host. Or conversely, flip that around: rank 0 == VTune GUI on say Xeon, and rank 1==KNL. The second format would permit faster exploration of post finalization results.