Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
17060 Discussions

Intel MKL linpack benchmark gets killed on Xeon Phi

JJK
New Contributor III
1,927 Views

Hi all,

I've got a weird problem: I wanted to test the GLOPS performance of the Xeon Phi's that are entrusted to me: 2 x Xeon Phi 5110P, 1x Xeon Phi 7120 . I read that the linpack benchmark is included in Intel's MKL libs and that a Xeon Phi version is included. So I grabbed the binaries and ran them on my Xeon Phi's.

On the 7120 (with mpss 3.3.2) the benchmark runs fine:

Thu Feb 12 16:58:54 CET 2015
Intel(R) Optimized LINPACK Benchmark data

Current date/time: Thu Feb 12 16:58:54 2015

CPU frequency:    1.238 GHz
Number of CPUs: 1
Number of cores: 244
Number of threads: 244

Parameters are set to:

Number of tests: 14
Number of equations to solve (problem size) : 2048  4096  6144  8192  10240 12288 14336 16384 18432 20480 22528 24576 26624 28672
Leading dimension of array                  : 2112  6208  6208  8256  10304 12352 14400 18496 18496 20544 22592 26688 26688 28736
Number of trials to run                     : 3     3     3     3     3     3     3     3     3     3     3     3     3     3    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     4     4     4    

Maximum memory requested that can be used=6591927552, at the size=28672
Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
2048   2112   4       62.4610  89.8029 
4096   6208   4       254.9105 260.5183
6144   6208   4       399.6637 404.3374
8192   8256   4       484.3184 491.6444
10240  10304  4       577.4737 587.8460
12288  12352  4       639.3712 643.3008
14336  14400  4       696.0603 701.3388
16384  18496  4       744.9810 748.8416
18432  18496  4       788.7247 791.7044
20480  20544  4       818.3679 820.8570
22528  22592  4       846.7491 848.7561
24576  26688  4       868.7217 870.2109
26624  26688  4       884.2233 885.7552
28672  28736  4       896.8622 896.9412

Residual checks PASSED

End of test

 

However, on both 5110P's (with mpss 3.4.2) the benchmark gets killed before it is complete!

mic0 $ cd linpack/
mic0 $ export LD_LIBRARY_PATH=$PWD
mic0 $ ./runme_mic 
This is a SAMPLE run script for SMP LINPACK. Change it to reflect
the correct number of CPUs/threads, problem input files, etc..
Fri Feb 13 10:01:12 CET 2015
Intel(R) Optimized LINPACK Benchmark data

Current date/time: Fri Feb 13 10:01:12 2015

CPU frequency:    1.053 GHz
Number of CPUs: 1
Number of cores: 240
Number of threads: 240

Parameters are set to:

Number of tests: 14
Number of equations to solve (problem size) : 2048  4096  6144  8192  10240 12288 14336 16384 18432 20480 22528 24576 26624 28672
Leading dimension of array                  : 2112  6208  6208  8256  10304 12352 14400 18496 18496 20544 22592 26688 26688 28736
Number of trials to run                     : 3     3     3     3     3     3     3     3     3     3     3     3     3     3    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     4     4     4    

Maximum memory requested that can be used=6591927552, at the size=28672

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
2048   2112   4      0.596      9.6303   4.795780e-12 3.950479e-02   pass
2048   2112   4      0.073      78.7107  4.795780e-12 3.950479e-02   pass
2048   2112   4      0.074      77.8766  4.795780e-12 3.950479e-02   pass
4096   6208   4      0.214      214.2289 2.216840e-11 4.613649e-02   pass
4096   6208   4      0.203      225.7619 2.216840e-11 4.613649e-02   pass
4096   6208   4      0.204      224.5814 2.216840e-11 4.613649e-02   pass
6144   6208   4      0.457      338.6425 3.562570e-11 3.301736e-02   pass
6144   6208   4      0.445      347.2770 3.562570e-11 3.301736e-02   pass
6144   6208   4      0.446      346.9953 3.562570e-11 3.301736e-02   pass
8192   8256   4      0.900      407.1775 7.232445e-11 3.782865e-02   pass
8192   8256   4      0.869      421.7898 7.232445e-11 3.782865e-02   pass
8192   8256   4      0.867      422.8278 7.232445e-11 3.782865e-02   pass
10240  10304  4      1.449      494.0793 1.010026e-10 3.389721e-02   pass
10240  10304  4      1.373      521.5753 1.010026e-10 3.389721e-02   pass
10240  10304  4      1.371      522.2989 1.010026e-10 3.389721e-02   pass
12288  12352  4      2.241      552.0942 1.454923e-10 3.393283e-02   pass
12288  12352  4      2.184      566.5285 1.454923e-10 3.393283e-02   pass
12288  12352  4      2.185      566.1465 1.454923e-10 3.393283e-02   pass
14336  14400  4      3.313      592.9472 2.006193e-10 3.448820e-02   pass
14336  14400  4      3.228      608.5453 2.006193e-10 3.448820e-02   pass
14336  14400  4      3.224      609.3674 2.006193e-10 3.448820e-02   pass
16384  18496  4      4.621      634.5835 2.524725e-10 3.324476e-02   pass
16384  18496  4      4.462      657.1922 2.524725e-10 3.324476e-02   pass
16384  18496  4      4.461      657.3274 2.524725e-10 3.324476e-02   pass
./runme_mic: line 45:  5271 Killed                  ./xlinpack_$arch lininput_$arch
Done: Fri Feb 13 10:05:15 CET 2015

 

How can I debug this? a 'gdb' run shows nothing, it just states that all threads get killed. The "runme_mic" script is from the MKL kit itself:

#!/bin/sh
[....]
echo "This is a SAMPLE run script for SMP LINPACK. Change it to reflect"
echo "the correct number of CPUs/threads, problem input files, etc.."

#    Setting up affinity for better threading performance
export KMP_AFFINITY=explicit,granularity=fine,proclist=[1-$(($(cat /proc/cpuinfo|grep proc|wc -l)-1)),0]

arch=mic
{
  date
  ./xlinpack_$arch lininput_$arch
  echo -n "Done: "
  date
} | tee lin_$arch.txt

What's going wrong ? how can I debug this? I've tried it with binaries from both the Intel v14 and Intel v15 compilers.

 

0 Kudos
9 Replies
jimdempseyatthecove
Honored Contributor III
1,927 Views

Run micsmc on the host and monitor memory usage.

1) First look at available RAM before starting program
2) Second, look at available RAM as the program runs

You might find that 1) shows less available RAM at program start. This may be due to the RAM disk having too many files loaded into it.

*** do this for both MICs and observe the difference ***

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,927 Views

I also notice that the outputs of the two tests are different. Is the second one writing its outputs (one per test) to files on the RAMDISK?

Jim Dempsey

0 Kudos
JJK
New Contributor III
1,927 Views

Hi Jim,

 

thanks for the pointer on the RAM usage -  I was/am autoinstalling some RPMs on  the Phi's and they were eating up just enough memory on the 5110P's to cause the linpack benchmark to fail. On the 7120 there's 16 GB of RAM and the problem never occurs.

With the ramdisk (root partition) as small as possible I can now successfully run the linpack benchmark on the 5110P's as well.

It might be worth mentioning more explicitly in the documentation that all RPMs that are auto-installed (I'm using

# cat /etc/mpss/conf.d/rpm.conf 
Overlay rpm /var/mpss/rpm on

) will have a direct effect on the RAM available to the applications running on the Phi.

For reference, with the RPMs loaded 'micsmc -m' reports

# micsmc -m

mic0 (mem):
   Free Memory: ............. 7158.93 MB
   Total Memory: ............ 7697.61 MB
   Memory Usage: ............ 538.68 MB

and without

# micsmc -m

mic0 (mem):
   Free Memory: ............. 7397.18 MB
   Total Memory: ............ 7697.61 MB
   Memory Usage: ............ 300.43 MB

That's just enough to cause linpack to crash, even though it states that it's grabbing "only" 6591927552 bytes (= 6286.55 MB) of memory.

Ticket closed.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,927 Views

Don't forget that you also have 240 x size of stack. If you are in the habit of being overly generous with your stack size, it can eat up memory fast.

Jim Dempsey

0 Kudos
Raphael_F_
Beginner
1,927 Views

I am having the same issue. The smallest number I can see in the micsmc -m Free Memory is around 80 MB. All I can get from the output is the following:

Maximum memory requested that can be used=5247612160, at the size=24576

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check

Is there something I have to change in the script? I already tried export KMP_STACKSIZE=1M; and ulimit -s unlimited.

0 Kudos
Frances_R_Intel
Employee
1,927 Views

Your KMP_STACKSIZE is way too small.You should start with the default - which I believe is 4 M - and go up from there, as needed. Obviously that is a problem if you really only have 80 MB to play with. 

So, what is your free memory, with nothing running on the coprocessor? 

0 Kudos
Raphael_F_
Beginner
1,927 Views

Running nothing, I get the following:

Free Memory: ............. 4678.87 MB
   Total Memory: ............ 5740.88 MB
   Memory Usage: ............ 1062.02 MB

 

0 Kudos
JJK
New Contributor III
1,927 Views

Looks like you've got a 3000 series Phi with 6 GB of RAM; I actually have no idea how to get linpack to run on that - hopefully someone from Intel will be able to tell us :)

 

0 Kudos
Frances_R_Intel
Employee
1,927 Views

With no user codes running, a memory usage of 1062 MB is very high. If you are installing extra rpm files, you might want to go back through them and see what you can do without. You might also want to run 'top' and look for any programs you didn't realize were there, that are consuming a lot of extra memory. With the kernel and the daemons, you should find maybe a dozen programs using any significant memory. The mpssd and coi_daemon will probably be the largest things you find. Be on the lookout for anything using more memory than that.

As far as adapting the Linpack benchmarks for systems with smaller memory, './xlinpack_mic -e' should print out the extended help which tells you how to modify the input files. Basically, omit any of the tests where 8 * problem size * leading dimension won't fit comfortably on your system.

0 Kudos
Reply