I already made a post on this subject https://software.intel.com/en-us/forums/intel-c-compiler/topic/702712 so I will try not to duplicate.
I have a KNL job which appears to be affected by spills due to the Intel C++ compiler fusing loops so aggressively that it encounters severe register pressure. It runs roughly 50% faster with numactl setting for fast memory, compared with default cached mode, but VTune still shows hotspots associated with seemingly avoidable spills (attached VTune screen shot in that post).
I thought that I might inhibit the compiler from fusing loops by inserting #pragma nofusion but this has no effect (at least, -S produces identical asm file). -O2 is far worse than -O3. From past experience with AVX/AVX2, I thought the compiler might be capable of limiting fusion to where it doesn't thrash hardware resources. On earlier architectures, the number of fill buffers available was an issue, but it may not be on KNL. Still, the compiler doesn't seem to worry about exhausting the 32 registers at the stage where it performs fusion. I might have guessed, since 6-8 array data streams stored per loop was optimum on past machines, that AVX512 might handle around 20 (but not hundreds in one loop, as it tries to do with aggressive fusion). I suppose some of the spills may replace explicit array assignments, but eliminating a program defined store and reload seems useless if the compiler replaces it by a spill. Only 10 of the data streams are long enough for prefetch to come into play, but I don't know that the compiler keys in on that. opt-prefetch appears to make no difference to generated code.
In order to compile on Windows, I must add the option -Qgcc-dialect:490. I found no full documentation about that. There are both gcc-style macros and C++-14 style pragmas so maybe Intel is to be complimented for supporting such a combination. The developers used -ansi_alias but that doesn't appear to have any effect even on Windows (with -Qstd=c99). Nor does the option -fp:fast=2 which they used appear to have any effect.
I can't use the VTune GUI mode; it seems there is even more erratic communication at the remote site between the KNL and their head login node than between their ISP and mine. It seemed better over 4G than over wired internet. So maybe the VTune GUI is only possible for a terminal directly connected into a KNL. I didn't find out how to run Advisor from command line.
>>So maybe the VTune GUI is only possible for a terminal directly connected into a KNL.
True, but finalizing the hardware generated events is very time consuming. I've been told, but haven't tried, that you can run the VTune without finalizing on the KNL, then use the resultant VTune files from an instance of VTune run on a different machine (e.g Xeon host) to perform the finalize and exploration of the results.
I have to match the sampling rate to the run time, tar up the project and move it to a local host to analyze. The scp sessions can easily stall if too much data is collected (and that gets expensive on 4G).
>> (and that gets expensive on 4G).
Apparently you are using a cell phone hotspot for remote access. Do you have access to two remote systems? (KNL, Xeon)
IOW copy the data between two locally connected remote systems (not using your 4G bandwidth).
The KNL is behind a Xeon node facing the outside world. I don't know what that local connection is.
When I'm out of wired internet territory (doesn't that include much of the USA midwest and south?), I use a mobile hotspot connection but that doesn't seem a great disadvantage compared with wire/fiber internet in respect of keyboard response and VTune project uploads.
Having heard that 5G introduction is a year away, the internet providers are probably reluctant to bury more fiber. I don't think they're losing money on metered 4G usage.
If you are willing to experiment you might try a spin-off of what I use in my office. Ignore the fact that I am not using 4G hotspot and am locally connected.
The "usual" workstation I use is a Windows 7 Pro x64 on Core I72600K...
... with a 39" SEIKI 4K monitor (3840x2160)
My KNL system runs CentOS 7.2 (as does another Xeon E5-2620v2 with dual KNC).
The above have somewhat visually impaired single 1920x1080P and analog monitor.
While the directly connected monitors are suitable for command line use, I find the display space too limiting for program development (Eclipse) and for use in viewing VTune results.
My solution to this is to install XMing on the Windows system, then PuTTY into the Linux boxes giving each a 4K window. As to how much 4G usage this will cause for you, I cannot say. XMing may have some tuning knobs that you can use to reduce the screen update frequency (much like VNC has, which is an alternate means).