We have an application that's currently running great in native mode on the KNC platform.
We now have a KNL system for R&D and have recompiled our native KNC application for the KNL platform. When testing this unmodified codebase, we're noticing a 2x performance degradation on KNL. KNL is setup in Quadrant cluster mode and Cache mode for memory.
Our application is not memory bandwidth hungry and we've tested many different OMP_NUM_THREAD configurations to no avail.
The main loop of the application is using OMP with a single critical section at the end. However, this runs very fast natively on the KNC platform.
(Intel Compiler - icpc)
KNC Compiler flags = -O3 -std=c++11 -openmp -mmic
KNL Compiler flags = -O3 -std=c++11 -qopenmp -xMIC-AVX512 -fma -align -finline-functions
We've run the standard tests and we know we can do a better job vectorizing loops but we were expecting better performance out of the box with an application that is already running great on KNC.
What could be causing this? Is it a pure vectorization issue?
It may be a vectorization issue, though it is difficult to give any definite answer without looking at the code (maybe it is publically available?). Did you try to check the compiler vectorization report or use Intel Advisor?
To rule out any platform configuration issues you can use micperf tool from Intel Xeon Phi Processor Software Package (https://software.intel.com/en-us/articles/xeon-phi-software).
Thanks for the reply. Yes, we've been using the Intel Advisor tool and it's a hit or miss sometimes with suggestions.
We ran micperf against one node and everything looks tip-top in terms of performance. It's pretty impressive.
As I stated previously, our application (proprietary) runs very well under KNC. Unfortunately, we cannot share the codebase.
Debugging performance problems is a balance of opportunism and systematic analysis.
As a quick "opportunistic" check:
If the Flat mode test is not useful, then you need to start gathering data for systematic analyses. Useful data typically includes:
I am still concerned that the "we've tested many different OMP_NUM_THREAD configurations" may not have achieved what you need to.
In particular you very likely need to be using KMP_HW_SUBSET instead of, or as well as, OMP_NUM_THREADS. (Documentation of KMP_HW_SUBSET [complete with calling it KMP_HW_SUBSETS throughout :-(] is at https://software.intel.com/en-us/node/694293) ;
Also, it may be worth checking out my rant on plotting scaling results :-)
Bear in mind that on KNC you needed two thread/core to achieve maximum issue rate, whereas on KNL that is no longer true, so running one or two threads/core rather than four is more likely to perform well on KNL, and also remember that the replicated entity is a tile of two cores sharing L2 cache, so locality can affect up to eight threads.
For OpenMP codes, it is important to monitor OpenMP overheads (that typically indicate load imbalance).
VTune's (relatively new) OpenMP analyses now show these as load imbalance attribute them to parallel regions, and show you what performance you could achieve if you could fix the problem. https://software.intel.com/en-us/node/544172 should help.
OS - CentOS Linux release 7.3.1611 (Core)
Kernel - 3.10.0-514.6.1.el7.x86_64 #1 SMP Wed Jan 18 13:06:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Intel Xeon Phi Software Package is installed (version - 1.5.0)
After a lot of digging I have narrowed in on what seems to be the problem. My application has one line that multiplies 24 floating-point numbers. From what I have seen it seems like the magnitude of the numbers can dramatically hurt the runtime. One of my numbers has a much lower magnitude than the rest. When I exclude this number from my calculation I get about a 30x speed up. I am playing with the different compiler options but non of them seem to help.
Any insight would be very helpful. Thanks.
Is this number stored as a subnormal (denormal)?
Jim and Sergey,
Turns out I was chasing my tail with that last post. I didn't realize at the time that the speed up from removing the value came from downstream. I am going back to the drawing board and will report back with what I find.
Thanks, I am learning a lot from everybody's suggestions.
In my experience the "-O3" flag hurts performance for KNL systems. Use "-Os" instead and see how it performs. The bottleneck in the micro-architecture for KNL is in the instruction decoding( or so Agner Fogs' manual claims and it seems to be true), so you want to minimize the size of the binary in terms of instructions instead of keeping things in registers at the cost of more instructions(greater binary size).
As for the affinity business, try logging into a node and run htop. See how well things are distributed, and the pattern of threads ramping up an closing down are. For fun and giggles, try running your program as "perf stat -d <Your program>" and post the stats on that. That usually helps diagnosis.
Just general suggestions. This might or might not help your situation.
Update: if you try "-Os", and you don't care about the IEEE standard for floating point arithmetic and a few other technical things make sure to add the "-ffast-math" flag for better performance at the cost of ignoring the IEEE standard, and possibly a lot of nightmares. I can't remember if "-O3" turns on this flag automatically(this is true either for GCC or ICC, can't remember which).
>>It is not a good recommendation to believe in somebody's claims without verifying in a set of real tests outcomes of using -O3 and -Os options.
I agree. At no point did I claim Agner made suggestions about optimization flags, he did NOT. He did, however, comment that the decoding part of the KNL Micro-architecture is the bottleneck. Reading that gave me a reason, perhaps the reason, why compiling with -Os instead of -O3 had been making my performance slightly, but noticeably, better.
>>With option -Os processing completed by ~9% faster then with option -O3. So, it is faster but Not by 2x!... Please do your own verifications if interested.
Again, no such claim was made. I did not claim that -Os would improve performance by 2x, or that it would improve performance at all. Just that testing it is worth the time of the potential slight, but noticeable gain.
Having said all that, thank you for giving us real tangible numbers for both the speed and the binary size. I appreciate your work; which, in this instance, happens to validate my views.