topic >>...Our application is not in Software Archive

KNC to KNL - 2x Slower Performance - Same Code

Eugene_G_ — Tue, 28 Feb 2017 21:53:29 GMT

We have an application that's currently running great in native mode on the KNC platform.

We now have a KNL system for R&D and have recompiled our native KNC application for the KNL platform. When testing this unmodified codebase, we're noticing a 2x performance degradation on KNL. KNL is setup in Quadrant cluster mode and Cache mode for memory.

Our application is not memory bandwidth hungry and we've tested many different OMP_NUM_THREAD configurations to no avail.
The main loop of the application is using OMP with a single critical section at the end. However, this runs very fast natively on the KNC platform.

(Intel Compiler - icpc)
KNC Compiler flags = -O3 -std=c++11 -openmp -mmic
KNL Compiler flags = -O3 -std=c++11 -qopenmp -xMIC-AVX512 -fma -align -finline-functions

We've run the standard tests and we know we can do a better job vectorizing loops but we were expecting better performance out of the box with an application that is already running great on KNC.

What could be causing this? Is it a pure vectorization issue?

It may be a vectorization

Jan_Z_Intel — Wed, 01 Mar 2017 16:42:55 GMT

It may be a vectorization issue, though it is difficult to give any definite answer without looking at the code (maybe it is publically available?). Did you try to check the compiler vectorization report or use Intel Advisor?

To rule out any platform configuration issues you can use micperf tool from Intel Xeon Phi Processor Software Package (https://software.intel.com/en-us/articles/xeon-phi-software).

Thanks for the reply. Yes, we

Eugene_G_ — Wed, 01 Mar 2017 17:05:29 GMT

Thanks for the reply. Yes, we've been using the Intel Advisor tool and it's a hit or miss sometimes with suggestions.
We ran micperf against one node and everything looks tip-top in terms of performance. It's pretty impressive.

As I stated previously, our application (proprietary) runs very well under KNC. Unfortunately, we cannot share the codebase.

Debugging performance

McCalpinJohn — Wed, 01 Mar 2017 17:29:05 GMT

Debugging performance problems is a balance of opportunism and systematic analysis.

As a quick "opportunistic" check:

If the code fits into the GDDR5 memory on KNC, then it should fit into MCDRAM in "Flat" mode on KNL. Testing your code on a system booted in Flat-Quadrant mode would eliminate uncertainties relating to using the MCDRAM as cache.

If the Flat mode test is not useful, then you need to start gathering data for systematic analyses. Useful data typically includes:

Parallel scaling for each code using 1 thread per core, 2 threads per core, 3 threads per core, 4 threads per core.
Whole-program performance counter measurements where available.
- VTune is the easiest way to get these analyses, but "perf stat" may be available.
- Repeat these for each core and thread count on each platform & compare the scaling.
Sampling-based runtime profile comparisons.
- Historically this has been done with "gprof", but VTune provides a much more integrated approach.
- For OpenMP codes, it is important to monitor OpenMP overheads (that typically indicate load imbalance).
Sampling-based performance-counter profile comparisons.
- VTune is the preferred approach here.

I am still concerned that the

James_C_Intel2 — Wed, 01 Mar 2017 17:31:02 GMT

I am still concerned that the "we've tested many different OMP_NUM_THREAD configurations" may not have achieved what you need to.

In particular you very likely need to be using KMP_HW_SUBSET instead of, or as well as, OMP_NUM_THREADS. (Documentation of KMP_HW_SUBSET [complete with calling it KMP_HW_SUBSETS throughout :-(] is at https://software.intel.com/en-us/node/694293) ;

Also, it may be worth checking out my rant on plotting scaling results :-)

Bear in mind that on KNC you needed two thread/core to achieve maximum issue rate, whereas on KNL that is no longer true, so running one or two threads/core rather than four is more likely to perform well on KNL, and also remember that the replicated entity is a tile of two cores sharing L2 cache, so locality can affect up to eight threads.

For OpenMP codes, it is

James_C_Intel2 — Wed, 01 Mar 2017 17:34:56 GMT

For OpenMP codes, it is important to monitor OpenMP overheads (that typically indicate load imbalance).

VTune's (relatively new) OpenMP analyses now show these as load imbalance attribute them to parallel regions, and show you what performance you could achieve if you could fix the problem. https://software.intel.com/en-us/node/544172 should help.

Thank you for the suggestions

Eugene_G_ — Wed, 01 Mar 2017 17:51:43 GMT

Thank you for the suggestions. I will go investigate and report back with my findings.

Please also share the details

Jan_Z_Intel — Thu, 02 Mar 2017 10:09:32 GMT

Please also share the details of your system software (OS with exact kernel version and if Intel Xeon Phi Software Package is installed).

OS - CentOS Linux release 7.3

Eugene_G_ — Thu, 02 Mar 2017 16:33:18 GMT

OS - CentOS Linux release 7.3.1611 (Core)

Kernel - 3.10.0-514.6.1.el7.x86_64 #1 SMP Wed Jan 18 13:06:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Intel Xeon Phi Software Package is installed (version - 1.5.0)

It looks perfect. I'm afraid

Jan_Z_Intel — Thu, 02 Mar 2017 20:04:25 GMT

It looks perfect. I'm afraid that was the last 'fast&easy' check before diving into more systematic approach suggested by John and Jim.

>>...Our application is not

SergeyKostrov — Fri, 03 Mar 2017 17:43:00 GMT

>>...Our application is not memory bandwidth hungry and we've tested many different OMP_NUM_THREAD configurations... 1. Analyze how your OpenMP threads pinned to cores / processors. 2. Execute cpuinfo utility. This is how Cache sharing part of the report looks like for Intel(R) Xeon Phi(TM) 7210 ... Processor name : Intel(R) Xeon Phi(TM) 7210 Packages (sockets) : 1 Cores : 64 Processors (CPUs) : 256 Cores per package : 64 Threads per core : 4 ... ===== Cache sharing ===== Cache Size Processors L1 32 KB (0,64,128,192)(1,65,129,193)(2,66,130,194)(3,67,131,195)(4,68,132,196)(5,69,133,197)(6,70,134,198)(7,71,135,199)(8,72,136,200)(9,73,137,201)(10,74,138,202)(11,75,139,203)(12,76,140,204)(13,77,141,205)(14,78,142,206)(15,79,143,207)(16,80,144,208)(17,81,145,209)(18,82,146,210)(19,83,147,211)(20,84,148,212)(21,85,149,213)(22,86,150,214)(23,87,151,215)(24,88,152,216)(25,89,153,217)(26,90,154,218)(27,91,155,219)(28,92,156,220)(29,93,157,221)(30,94,158,222)(31,95,159,223)(32,96,160,224)(33,97,161,225)(34,98,162,226)(35,99,163,227)(36,100,164,228)(37,101,165,229)(38,102,166,230)(39,103,167,231)(40,104,168,232)(41,105,169,233)(42,106,170,234)(43,107,171,235)(44,108,172,236)(45,109,173,237)(46,110,174,238)(47,111,175,239)(48,112,176,240)(49,113,177,241)(50,114,178,242)(51,115,179,243)(52,116,180,244)(53,117,181,245)(54,118,182,246)(55,119,183,247)(56,120,184,248)(57,121,185,249)(58,122,186,250)(59,123,187,251)(60,124,188,252)(61,125,189,253)(62,126,190,254)(63,127,191,255) L2 1 MB (0,1,64,65,128,129,192,193)(2,3,66,67,130,131,194,195)(4,5,68,69,132,133,196,197)(6,7,70,71,134,135,198,199)(8,9,72,73,136,137,200,201)(10,11,74,75,138,139,202,203)(12,13,76,77,140,141,204,205)(14,15,78,79,142,143,206,207)(16,17,80,81,144,145,208,209)(18,19,82,83,146,147,210,211)(20,21,84,85,148,149,212,213)(22,23,86,87,150,151,214,215)(24,25,88,89,152,153,216,217)(26,27,90,91,154,155,218,219)(28,29,92,93,156,157,220,221)(30,31,94,95,158,159,222,223)(32,33,96,97,160,161,224,225)(34,35,98,99,162,163,226,227)(36,37,100,101,164,165,228,229)(38,39,102,103,166,167,230,231)(40,41,104,105,168,169,232,233)(42,43,106,107,170,171,234,235)(44,45,108,109,172,173,236,237)(46,47,110,111,174,175,238,239)(48,49,112,113,176,177,240,241)(50,51,114,115,178,179,242,243)(52,53,116,117,180,181,244,245)(54,55,118,119,182,183,246,247)(56,57,120,121,184,185,248,249)(58,59,122,123,186,187,250,251)(60,61,124,125,188,189,252,253)(62,63,126,127,190,191,254,255) ... 3. Best performance is achieved when KMP_AFFINITY is set to scatter or balanced and OMP_NUM_THREAD is set to 64. I've marked processor numbers to demonstrate it: ... ===== Cache sharing ===== Cache Size Processors L1 32 KB (**0**,64,128,192)(**1**,65,129,193)(**2**,66,130,194)(**3**,67,131,195)(**4**,68,132,196)(**5**,69,133,197)... ... L2 1 MB (**0,1**,64,65,128,129,192,193)(**2,3**,66,67,130,131,194,195)(**4,5**,68,69,132,133,196,197)... ...

After a lot of digging I have

Eugene_G_ — Tue, 07 Mar 2017 00:14:41 GMT

After a lot of digging I have narrowed in on what seems to be the problem. My application has one line that multiplies 24 floating-point numbers. From what I have seen it seems like the magnitude of the numbers can dramatically hurt the runtime. One of my numbers has a much lower magnitude than the rest. When I exclude this number from my calculation I get about a 30x speed up. I am playing with the different compiler options but non of them seem to help.

Any insight would be very helpful. Thanks.

Is this number stored as a

jimdempseyatthecove — Tue, 07 Mar 2017 18:45:00 GMT

Is this number stored as a subnormal (denormal)?

See:

https://software.intel.com/en-us/forums/intel-c-compiler/topic/611390
https://software.intel.com/pt-br/node/680305

Edit:

https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/705927

Jim Dempsey

It looks like some FPU

SergeyKostrov — Wed, 08 Mar 2017 17:10:58 GMT

It looks like some FPU exceptions are affecting your processing. >>...From what I have seen it seems like the magnitude of the numbers can dramatically hurt the runtime. One of my numbers has >>a much lower magnitude than the rest. When I exclude this number from my calculation I get about a 30x speed up... Q1: Could you post a couple of FP numbers ( a good one and a bad one ) to demonstrate their ranges? Q2: Did you try to turn on 'Flush Denormal Results to Zero' ( -ftz ) compiler option?

Jim and Sergey,

Eugene_G_ — Wed, 08 Mar 2017 17:27:07 GMT

Jim and Sergey,

Turns out I was chasing my tail with that last post. I didn't realize at the time that the speed up from removing the value came from downstream. I am going back to the drawing board and will report back with what I find.

Thanks, I am learning a lot from everybody's suggestions.

In my experience the "-O3"

Chronus_Taizen — Sun, 12 Mar 2017 07:24:00 GMT

In my experience the "-O3" flag hurts performance for KNL systems. Use "-Os" instead and see how it performs. The bottleneck in the micro-architecture for KNL is in the instruction decoding( or so Agner Fogs' manual claims and it seems to be true), so you want to minimize the size of the binary in terms of instructions instead of keeping things in registers at the cost of more instructions(greater binary size).

As for the affinity business, try logging into a node and run htop. See how well things are distributed, and the pattern of threads ramping up an closing down are. For fun and giggles, try running your program as "perf stat -d <Your program>" and post the stats on that. That usually helps diagnosis.

Just general suggestions. This might or might not help your situation.

Update: if you try "-Os", and you don't care about the IEEE standard for floating point arithmetic and a few other technical things make sure to add the "-ffast-math" flag for better performance at the cost of ignoring the IEEE standard, and possibly a lot of nightmares. I can't remember if "-O3" turns on this flag automatically(this is true either for GCC or ICC, can't remember which).

Cheers.

>>...or so Agner Fogs' manual

SergeyKostrov — Wed, 15 Mar 2017 17:22:56 GMT

>>...or so Agner Fogs' manual claims and it seems to be true.. It is not a good recommendation to believe in somebody's claims without verifying in a set of real tests outcomes of using -O3 and -Os options. Also, I didn't have any performance issues with -O3 option on a KNL system but I will verify if -Os option improves performance.

Here are results of a very

SergeyKostrov — Wed, 15 Mar 2017 18:54:48 GMT

Here are results of a very simple verification for matrix multiplication using MKL cblas_sgemm and Classic Matrix Multiplication algorithm ( CMMA / transposed based ): [ Test with -O3 option ] [guest@... WorkTest]$ icpc -O3 -xMIC-AVX512 -qopenmp -mkl -fp-model fast=2 -fma -unroll=4 test13.c -o test13.out [guest@... WorkTest]$ [guest@... WorkTest]$ ./test13.out Matrix A[ 16384 x 16384 ] Matrix B[ 16384 x 16384 ] Matrix C[ 16384 x 16384 ] Number of OpenMP threads: 64 MKL - Completed in: 6.6331376 seconds CMMA - Completed in: 99.3659613 seconds [guest@... WorkTest]$ [guest@... WorkTest]$ ls -l total 232 -rw-r--r-- 1 guest guest 10812 Mar 15 11:21 test13.c -rwxrwxr-x 1 guest guest 210979 Mar 15 11:21 test13.out [ Test with -Os option ] [guest@... WorkTest]$ icpc -Os -xMIC-AVX512 -qopenmp -mkl -fp-model fast=2 -fma -unroll=4 test13.c -o test13.out [guest@... WorkTest]$ [guest@... WorkTest]$ ./test13.out Matrix A[ 16384 x 16384 ] Matrix B[ 16384 x 16384 ] Matrix C[ 16384 x 16384 ] Number of OpenMP threads: 64 MKL - Completed in: 6.6278768 seconds CMMA - Completed in: 90.3714654 seconds [guest@... WorkTest]$ [guest@... WorkTest]$ ls -l total 224 -rw-r--r-- 1 guest guest 10812 Mar 15 11:21 test13.c -rwxrwxr-x 1 guest guest 202685 Mar 15 11:27 test13.out [ Conclusion ] With option -Os processing completed by ~9% faster then with option -O3. So, it is faster but Not by 2x!... Please do your own verifications if interested.

>>...so you want to minimize

SergeyKostrov — Wed, 15 Mar 2017 19:05:00 GMT

>>...so you want to minimize the size of the binary... [ Test with -O3 option - Binary Size ] ... -rwxrwxr-x 1 guest guest 210979 Mar 15 11:21 test13.out ... [ Test with -Os option - Binary Size ] ... -rwxrwxr-x 1 guest guest 202685 Mar 15 11:27 test13.out ... With option -Os the binary size is only ~3.9% smaller and, as I've already mentioned, processing was completed ~9% faster.

>>It is not a good

Chronus_Taizen — Thu, 16 Mar 2017 04:25:17 GMT

>>It is not a good recommendation to believe in somebody's claims without verifying in a set of real tests outcomes of using -O3 and -Os options.

I agree. At no point did I claim Agner made suggestions about optimization flags, he did NOT. He did, however, comment that the decoding part of the KNL Micro-architecture is the bottleneck. Reading that gave me a reason, perhaps the reason, why compiling with -Os instead of -O3 had been making my performance slightly, but noticeably, better.

>>With option -Os processing completed by ~9% faster then with option -O3. So, it is faster but Not by 2x!... Please do your own verifications if interested.

Again, no such claim was made. I did not claim that -Os would improve performance by 2x, or that it would improve performance at all. Just that testing it is worth the time of the potential slight, but noticeable gain.

Having said all that, thank you for giving us real tangible numbers for both the speed and the binary size. I appreciate your work; which, in this instance, happens to validate my views.