On the host I'm running

Fiona_R_ · ‎02-17-2015

Hi Folks,

I'm seeing some very strange performance with a computational chemistry code (CP2K) that I've previously benchmarked extensively on the Xeon Phi. I'm running in native mode for these tests.

Since our cards were upgraded to MPSS version 3.4.2 my mixed mode MPI/OpenMP version of the code is running 10-20 times slower than it used to.

I've compared identical code compiled with the same compiler version (14.0.2) on two different Xeon Phi cards. One card is running MPSS 3.3.2 the other is running 3.4.2. The code built under 3.4.2 is always much slower. The binary built under 3.3.2 runs in the expected time on both cards. The only thing that's different is the version of MPSS being used on the cards.

So my question is is there any way to link against a different version of MPSS from that which is running on Xeon Phi card? I'd really like to be able to work out where (i.e. which library or object file) the slow down is coming from.

I've copied the entire /usr/linux-k1om-4.7 tree over from the machine running 3.3.2 (resolving all links) and then tried adding the appropriate paths to LIBRARY_PATH, LD_LIBRARY_PATH MIC_LIBRARY_PATH etc but that doesn't work. It seems that x86_64-k1om-linux-ld gets it's search list from somewhere other than my environment.

Any help or suggestions would be appreciated.

Thanks in advance.

Fiona

Frances_R_Intel · ‎02-17-2015

Ok, let me see if I have this right - if you take code that you compiled on a system with MPSS 3.3.2 and move it over to a system with MPSS 3.4.2, then that code runs faster than code that you compiled on the system with MPSS 3.4.2. In both cases you compiled with the 14.0.2 compiler.

Just to be clear -

You talk in your post both about running in native mode and about running in hybrid mode (MPI/OpenMP). Do you mean that you are running one or more MPI ranks on the coprocessor and using OpenMP inside each rank? You are not running one or more MPI ranks on the host and offloading the OpenMP work to the coprocessor?
It is the case that you see the slow down only in the hybrid mode?
You have timed both the code running on the host and on the coprocessor and see the slowdown only on the coprocessor?
When you ran the code compiled for MPSS 3.3.2 on the MPSS 3.4.2 system, did you get any warnings about version numbers on the relocatable libraries? Did you copy any relocatable libraries from the coprocessor on the MPSS 3.3.2 system to the coprocessor on the MPSS 3.4.2 system? (I know you said you copied /usr/linux-k1om-4.7 on one host to /usr/linux-k1om-4.7 on the other host, but here I am talking specifically about copying files to the coprocessor.)
Are there any environment differences between the runs of the two executables or any compiler flag differences?
Are you using the Intel version of MPI or another version?
Are you making calls to the MKL library?
If you have access to VTune, could you check to see where it thinks the extra time is being spent?
Is it possible to provide a small test case showing the difference? I realize this might be asking a lot, but it would be very helpful.

As far as compiling on the MPSS 3.4.2 system as if you are compiling on the MPSS 3.3.2. system, I'm not sure that is a good idea, especially while it isn't clear what the problem is. I am really not convinced that is where the problem is. But in any event, you would need /opt/mpss/<version>, you would need to have those libraries installed on the coprocessor, you would need to mess around with /opt/intel/composer_xe_2015/pkg_bin/intel64_mic/x86_64-linux.env and I don't know what else. There be dragons here.

Frances_R_Intel · ‎02-17-2015

And just in case any of these ring a bell with you, the following are listed as known issues in the release notes for 3.4 (I didn't list the ones having to do with OFED or COI):

[Coprocessor OS] Performance degradation seen when stack size limit is set to unlimited
[Performance] DAPL/SCIF performance is slow on message size 8k...128k
[Performance] KNC buffer allocation slower for 64 MB data
[Performance] Slow data transfer speeds might be noticed from MIC to host
[Performance] Transparent Huge Pages doesn't give performance

Fiona_R_ · ‎02-18-2015

Frances Roth (Intel) wrote:

Ok, let me see if I have this right - if you take code that you compiled on a system with MPSS 3.3.2 and move it over to a system with MPSS 3.4.2, then that code runs faster than code that you compiled on the system with MPSS 3.4.2. In both cases you compiled with the 14.0.2 compiler.

That is correct. FWIW I've also tried the 15.0.0 and 15.0.1 compiler on the system running MPSS 3.4.2 and the code still runs slowly. I only reverted back to version 14.0.2 because it was the version common to both systems.

Frances Roth (Intel) wrote:

You talk in your post both about running in native mode and about running in hybrid mode (MPI/OpenMP). Do you mean that you are running one or more MPI ranks on the coprocessor and using OpenMP inside each rank? You are not running one or more MPI ranks on the host and offloading the OpenMP work to the coprocessor?

I'm always running in native mode, thus for the MPI/OpenMP version I have e.g. 60 MPI ranks, each running e.g. 2 OpenMP threads on the coprocessor. My code can be run as a pure MPI code, pure OpenMP and mixed mode OpenMP/MPI.

Frances Roth (Intel) wrote:

It is the case that you see the slow down only in the hybrid mode?

No, a pure OpenMP version of the code also goes slower on the machine running MPSS 3.4.2. An OpenMP binary built on the machine running 3.3.2 but run on the machine with MPSS 3.4.2 runs in the expected time. The pure MPI version (running on 60 procs) runs in the expected time on both machines. Essentially the code seems to slow down as soon as I use more than one thread.

Frances Roth (Intel) wrote:

You have timed both the code running on the host and on the coprocessor and see the slowdown only on the coprocessor?

Yes, I've checked that the host performance is unaffected. The slowdown definitely only happens on the co-processor. We see a slight (1-2%) slowdown when using compiler version 15.X but I don't mind that.

Frances Roth (Intel) wrote:

When you ran the code compiled for MPSS 3.3.2 on the MPSS 3.4.2 system, did you get any warnings about version numbers on the relocatable libraries? Did you copy any relocatable libraries from the coprocessor on the MPSS 3.3.2 system to the coprocessor on the MPSS 3.4.2 system? (I know you said you copied /usr/linux-k1om-4.7 on one host to /usr/linux-k1om-4.7 on the other host, but here I am talking specifically about copying files to the coprocessor.)

No, no warnings appeared. FWIW I have also tried copying across the entire mkl/ directory from the machine with MPSS 3.3.2 and linking against that instead. I've also tried copying the OpenMP shared library libiomp5.so from the machine with MPSS 3.3.2 over to our Xeon Phi to see if that helped. It didn't.

The code is built statically - as soon as I used -mmic I can only obtain a static binary.

Frances Roth (Intel) wrote:

Are there any environment differences between the runs of the two executables or any compiler flag differences?

Identical compiler flags were used. I've attached the output from "set" on both systems. dommic_mic0_env.txt is the machine running 3.3.2, phi_mic0_env.txt is from the machine running 3.4.2. They look very similar and as far as I can tell there's nothing that should cause a slow down.

Frances Roth (Intel) wrote:

Are you using the Intel version of MPI or another version?

I'm using Intel MPI (impi/4.1.3.048) on both systems. As I see the slowdown with a pure OpenMP build I think we can probably eliminate the MPI library.

Frances Roth (Intel) wrote:

Are you making calls to the MKL library?

Yes, in particular for FFTW. I'm using the same version of MKL on both systems.

Frances Roth (Intel) wrote:

If you have access to VTune, could you check to see where it thinks the extra time is being spent?

I've not tried VTune but I have looked at the output from the codes in built timers. Basically the slowdown appears to be spread across all routines in the code.

I've should say I've seen similar slow performance with older versions of MPSS / mpirun where we needed to manually set where the threads (for the OpenMP version) or each MPI process and its associated threads ran (for the MPI/OpenMP version). I discussed this here http://www.epcc.ed.ac.uk/blog/2013/07/03/task-placement-intel-xeon-phi. I've tried this approach with my current build and it makes no difference so I don't believe there's an issue with where the threads are being placed. I've also confirmed via export KMP_AFFINITY=verbose that the threads are getting placed on different cores.

Frances Roth (Intel) wrote:

Is it possible to provide a small test case showing the difference? I realize this might be asking a lot, but it would be very helpful.

I don't have a small test case as I only see the slow down with this particular code. Other codes appear to be unaffected since we upgraded to MPSS 4.3.2. It just seems to be threaded builds of CP2K that run slowly.

I'm happy to upload the entire code tree with such that someone should be able to easily replicate the problem but it will be over 50MB is that likely to be a problem? 450213

JJK · ‎04-22-2015

I'm always interested in code like this :)

Can you contact me for a URL where I can download the code? I can easily run it on a Xeon Phi 5110P using either MPSS 3.3 or 3.4

Fiona_R_ · ‎04-27-2015

Some code along with compilation instructions etc which show the problem can be downloaded from:

http://www2.epcc.ed.ac.uk/~fiona/cp2k.tar.gz

I've submitted this to IPS too so if I receive a solution via that route I'll be sure to update this thread in case it helps other people with similar issues.

JJK · ‎04-29-2015

Hi,

I've grabbed your code and ran it on my Xeon Phi 5110P with MPSS 3.4.3 and got the same results as you did (for the dbcsr sample); then I recompiled the code on a host with icc+ifort 14 and MPSS 3.4.3 and reran the code. I got the same result as for your '_fast' case:

 ../../../exe/phi-intel-mic-14.x/dbcsr_performance_driver.ssmp < test.perf | grep perf
   perf total      =     6.99E+09    2.34E+09    3.68E+09   10.30E+09 FLOPS
   perf per node   =     6.99E+09    2.34E+09    3.68E+09   10.30E+09 FLOPS
   perf per thread =    58.24E+06   19.50E+06   30.67E+06   85.82E+06 FLOPS

which leads me to believe this is a compiler issue, not an MPSS issue. I've seen performance drops before when comparing icc 14 vs 15. With the latest service pack for icc 15 (or possibly icc 16beta) the performance drop should be gone, regardless of the version of the MPSS stack used.

Fiona_R_ · ‎05-05-2015

Hi Jan,

I've tested compiler versions 14.0.2, 14.0.3, 14.0.4, 15.0.0, 15.0.1 and 15.0.2 all go slow for my threaded builds. 16.0.0 beta throws an internal compiler error with segmentation fault at compile time so I can't even build the code with it.

So far I'm aware of several people with the same problem with CP2K and also with other codes. A number of people at Intel have also re-compiled the test case and like you find it runs at normal speed for them which is very odd.

I have access to another Xeon Phi machine so will try building there and see what happens.

Fiona

JJK · ‎05-06-2015

can you try it with my builds? you can find builds for 14.1 and 14.3 at

http://www.nikhef.nl/~janjust/cp2k/

(the .tar.gz files contain only the .ssmp binaries)

I get "normal" results with both the 14.1 and 14.3 builds on a host running MPSS 3.4.2

Fiona_R_ · ‎05-07-2015

I've tried both versions and both run at "normal" speed on my system, e.g. for 14.1 I get:

[fiona@phi-mic1 perf]$ /home-hydra/h012/fiona/from_jan/14.1/dbcsr_performance_driver.ssmp < test.perf
   perf total      =     8.01E+09    2.31E+09    4.75E+09   11.28E+09 FLOPS
   perf per node   =     8.01E+09    2.31E+09    4.75E+09   11.28E+09 FLOPS
   perf per thread =   133.55E+06   38.45E+06   79.17E+06  187.93E+06 FLOPS

and 14.3

[fiona@phi-mic1 perf]$ /home-hydra/h012/fiona/from_jan/14.3/dbcsr_performance_driver.ssmp < test.perf
   perf total      =     7.93E+09    2.36E+09    4.60E+09   11.27E+09 FLOPS
   perf per node   =     7.93E+09    2.36E+09    4.60E+09   11.27E+09 FLOPS
   perf per thread =   132.25E+06   39.27E+06   76.71E+06  187.79E+06 FLOPS

I've seen exactly the same when I built the code on a machine running with MPSS 3.3.2 using the same compiler versions available on our system. Prior to moving from MPSS 3.3.2 to 3.4.2 (we're now running 3.4.3) our system also performed "normally".

I assume you used exactly the same ARCH files as I used for my builds?

The issue seems to be down to which machine you compile on and something to do with its particular setup. Whether it's the MPSS version or something else in the machine environment/libraries I'm not sure. MPSS jumped out as that was the thing that had changed on our machine and I have access to identical compiler versions on the other machines.

JJK · ‎05-07-2015

that is correct, I used the exact same ARCH and build rules; I can give you a build log of you like.

I guess we can rule out a runtime issue, it seems like a compile time thing. Which exact OS are you using and which packages are installed? For example, on my build hosts the fftw library is not installed separately, but the one from the Intel MKL libs is used.

Fiona_R_ · ‎05-07-2015

On the host I'm running Scientific Linux release 6.5 (Carbon), uname -a gives:

Linux phi.hydra 2.6.32-431.1.2.el6.x86_64 #1 SMP Thu Dec 12 13:59:19 CST 2013 x86_64 x86_64 x86_64 GNU/Linux

On the Xeon Phi we have the follows (snipped from micinfo)

		Flash Version 		 : 2.1.02.0390
		SMC Firmware Version	 : 1.16.5078
		SMC Boot Loader Version	 : 1.8.4326
		uOS Version 		 : 2.6.38.8+mpss3.4.3

I use Intel's MKL for the FFT computations but have also tried build CP2K with out using any MKL (i.e. self-built, BLAS, LAPACK, FFTW etc and the performance is still poor) which suggests that MKL isn't the issue.

Fiona_R_ · ‎05-07-2015

I believe I've solved the problem. The MPSS version was a red-herring. It looks like a compiler version issue afterall.

The issue appears to lie with the 15.X compiler version. The 14.X versions appear to perform normally.

I've been using "source /composer_path/bin/compilervars.sh intel64" to switch between compiler versions but it appears that this was not fully resetting my paths etc and thus some elements of the 15.X compiler always remained despite ifort -v always telling me I was using the compiler version etc that I expected.

I've created a reset script to unset all Intel related variables in my environment prior to sourcing the compilervars.sh script and suddenly everything is behaving normally.

Intel Premier Support are now looking into the performance issue for me.

Thank you very much for all your suggestions and for taking the time to help with this one.

JJK · ‎05-07-2015

Excellent to hear that!

I'm curious about the icc/ifort v15 solution, as I've been bitten by performance issues in the v15 suite as well.

Frances_R_Intel · ‎05-07-2015

Fiona,

I am also glad to hear you have found out where the problem was located and that the problem has been passed on to the people at Intel Premier Support. Out of curiosity, have you tried the beta version of the 2016 compiler?

As to switching between compiler versions, as you have found, the setup script is not designed for that. What is really needed is something like the module utility. Depending on your Linux distribution, it might be installed already or you might need to download it from sourceforge. Unfortunately, to make it work, you need to write good module configuration files that will remove and add just exactly those parts of each environment variable needed as you switch back and forth. It gets really confusing if you want to switch between tool X1 and tool X2 but you also have tool Y loaded which sets environment variable FOO to 6 if you are using X1 and to 9 if you are using X2. Add to this the need to also look out for environment variables used by your local tools and you can see why Intel would choose to distribute just a simple shell script. You have your own shell script now which meets your needs but I mention this in case others might have run into this problem as well.

Frances

Fiona_R_ · ‎05-08-2015

Hi Francis,

Many thanks for your advice, all very helpful.

I did try the 16.0.0 beta on a machine I had access to briefly and my code fails to compile with the compiler throwing an internal error and segmentation violation. I'll need to wait until our system has this installed before I can retry it.

Fiona

Is there any way to link against an older version of MPSS