Problem with Hpcc, MPIFFT hangs at large scale

xuzheng97 · ‎10-09-2010

Hi,

I successfully compiled Hpcc with mkl and run it on 96 cores with N=120000.

The following two page helps me a lot.
http://origin-software.intel.com/en-us/articles/performance-tools-for-software-developers-use-of-intel-mkl-in-hpcc-benchmark/
http://software.intel.com/en-us/forums/showthread.php?t=77727&o=d&s=lr

But when I tested on 192 cores with N=208000, the program hang at MPIFFT part.
The hpcc processes were still alive but no further output.
The configuration of hpccinf.txt is from hpcc website delivered result.

The following is my output:

########################################################################
This is the DARPA/DOE HPC Challenge Benchmark version 1.4.1 October 2003
Produced by Jack Dongarra and Piotr Luszczek
Innovative Computing Laboratory
University of Tennessee Knoxville and Oak Ridge National Laboratory
See the source files for authors of specific codes.
Compiled on Oct 9 2010 at 04:36:49
Current time (1286615973) is Sat Oct 9 05:19:33 2010
Hostname: 'node01'
########################################################################
================================================================================
HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 208000
NB : 168
PMAP : Row-major process mapping
P : 6
Q : 32
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 2.220446e-16
- Computational tests pass if scaled residuals are less than 16.0
Begin of MPIRandomAccess section.
Running on 192 processors
Total Main table size = 2^35 = 34359738368 words
PE Main table size = (2^35)/192 = 178956971 words/PE MAX
Default number of updates (RECOMMENDED) = 137438953472
CPU time used = 147.169627 seconds
Real time used = 148.418957 seconds
0.926020208 Billion(10^9) Updates per second [GUP/s]
0.004823022 Billion(10^9) Updates/PE per second [GUP/s]
Verification: CPU time used = 107.623639 seconds
Verification: Real time used = 108.852249 seconds
Found 63008 errors in 34359738368 locations (passed).
Current time (1286616234) is Sat Oct 9 05:23:54 2010
End of MPIRandomAccess section.
Begin of StarRandomAccess section.
Main table size = 2^27 = 134217728 words
Number of updates = 536870912
CPU time used = 32.956989 seconds
Real time used = 32.959574 seconds
0.016288770 Billion(10^9) Updates per second [GUP/s]
Found 0 errors in 134217728 locations (passed).
Node(s) with error 0
Minimum GUP/s 0.016122
Average GUP/s 0.016524
Maximum GUP/s 0.016834
Current time (1286616300) is Sat Oct 9 05:25:00 2010
End of StarRandomAccess section.
Begin of SingleRandomAccess section.
Node(s) with error 0
Node selected 89
Single GUP/s 0.037591
Current time (1286616328) is Sat Oct 9 05:25:28 2010
End of SingleRandomAccess section.
Begin of MPIRandomAccess_LCG section.
Running on 192 processors
Total Main table size = 2^35 = 34359738368 words
PE Main table size = (2^35)/192 = 178956971 words/PE MAX
Default number of updates (RECOMMENDED) = 137438953472
CPU time used = 144.796987 seconds
Real time used = 145.966314 seconds
0.941579942 Billion(10^9) Updates per second [GUP/s]
0.004904062 Billion(10^9) Updates/PE per second [GUP/s]
Verification: CPU time used = 103.628247 seconds
Verification: Real time used = 104.397066 seconds
Found 65536 errors in 34359738368 locations (passed).
Current time (1286616579) is Sat Oct 9 05:29:39 2010
End of MPIRandomAccess_LCG section.
Begin of StarRandomAccess_LCG section.
Main table size = 2^27 = 134217728 words
Number of updates = 536870912
CPU time used = 33.080971 seconds
Real time used = 33.084620 seconds
0.016227205 Billion(10^9) Updates per second [GUP/s]
Found 0 errors in 134217728 locations (passed).
Node(s) with error 0
Minimum GUP/s 0.016004
Average GUP/s 0.016432
Maximum GUP/s 0.016728
Current time (1286616646) is Sat Oct 9 05:30:46 2010
End of StarRandomAccess_LCG section.
Begin of SingleRandomAccess_LCG section.
Node(s) with error 0
Node selected 89
Single GUP/s 0.037095
Current time (1286616673) is Sat Oct 9 05:31:13 2010
End of SingleRandomAccess_LCG section.
Begin of PTRANS section.
M: 104000
N: 104000
MB: 168
NB: 168
P: 6
Q: 32
TIME M N MB NB P Q TIME CHECK GB/s RESID
---- ----- ----- --- --- --- --- -------- ------ -------- -----
WALL 104000 104000 168 168 6 32 2.83 PASSED 30.526 0.00
CPU 104000 104000 168 168 6 32 2.83 PASSED 30.612 0.00
WALL 104000 104000 168 168 6 32 2.93 PASSED 29.512 0.00
CPU 104000 104000 168 168 6 32 2.92 PASSED 29.597 0.00
WALL 104000 104000 168 168 6 32 3.04 PASSED 28.430 0.00
CPU 104000 104000 168 168 6 32 3.03 PASSED 28.524 0.00
WALL 104000 104000 168 168 6 32 2.86 PASSED 28.430 0.00
CPU 104000 104000 168 168 6 32 2.85 PASSED 30.355 0.00
WALL 104000 104000 168 168 6 32 2.93 PASSED 28.430 0.00
CPU 104000 104000 168 168 6 32 2.92 PASSED 29.617 0.00
Finished 5 tests, with the following results:
5 tests completed and passed residual checks.
0 tests completed and failed residual checks.
0 tests skipped because of illegal input values.
END OF TESTS.
Current time (1286616720) is Sat Oct 9 05:32:00 2010
End of PTRANS section.
Begin of StarDGEMM section.
Scaled residual: 0.00407323
Node(s) with error 0
Minimum Gflop/s 10.504862
Average Gflop/s 11.179782
Maximum Gflop/s 11.460022
Current time (1286616851) is Sat Oct 9 05:34:11 2010
End of StarDGEMM section.
Begin of SingleDGEMM section.
Node(s) with error 0
Node selected 178
Single DGEMM Gflop/s 11.575138
Current time (1286616969) is Sat Oct 9 05:36:09 2010
End of SingleDGEMM section.
Begin of StarSTREAM section.
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 75111111, Offset = 0
Total memory required = 1.6789 GiB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 343454 microseconds.
(= 343454 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (GB/s) Avg time Min time Max time
Copy: 3.3981 0.3554 0.3537 0.3571
Scale: 3.3042 0.3696 0.3637 0.3738
Add: 3.3802 0.5340 0.5333 0.5353
Triad: 3.6174 0.5037 0.4983 0.5103
-------------------------------------------------------------
Results Comparison:
Expected : 86625702996855472128.000000 17325140599371094016.000000 23100187465828126720.000000
Observed : 86625703071556763648.000000 17325140607756191744.000000 23100187473264709632.000000
Solution Validates
-------------------------------------------------------------
Node(s) with error 0
Minimum Copy GB/s 3.309907
Average Copy GB/s 3.372523
Maximum Copy GB/s 3.401539
Minimum Scale GB/s 3.281044
Average Scale GB/s 3.363702
Maximum Scale GB/s 3.391891
Minimum Add GB/s 3.368173
Average Add GB/s 3.418982
Maximum Add GB/s 3.486998
Minimum Triad GB/s 3.472837
Average Triad GB/s 3.526081
Maximum Triad GB/s 3.629542
Current time (1286616989) is Sat Oct 9 05:36:29 2010
End of StarSTREAM section.
Begin of SingleSTREAM section.
Node(s) with error 0
Node selected 74
Single STREAM Copy GB/s 8.495534
Single STREAM Scale GB/s 8.332597
Single STREAM Add GB/s 11.092103
Single STREAM Triad GB/s 11.051167
Current time (1286616996) is Sat Oct 9 05:36:36 2010
End of SingleSTREAM section.
Begin of MPIFFT section.

The program hang here and no further output.

Also my Make.em64t is as following:
#
SHELL = /bin/sh
#
CD = cd
CP = cp
LN_S = ln -s
MKDIR = mkdir
RM = /bin/rm -f
TOUCH = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH = $(arch)
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir = ../../..
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
#
HPLlib = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the C compiler where to find the Message Passing library
# header files, MPlib is defined to be the name of the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir = /opt/intel/impi/4.0.0.028
MPinc = -I$(MPdir)/include64
MPlib =
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the C compiler where to find the Linear Algebra library
# header files, LAlib is defined to be the name of the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir = /opt/intel/mkl/lib/em64t
LAinc = -I/opt/intel/mkl/include/fftw
LAlib = $(LAdir)/libfftw2x_cdft_DOUBLE_lp64.a $(LAdir)/libfftw2xc_intel.a -Wl,--start-group $(LAdir)/libmkl_intel_lp64.a $(LAdir)/libmkl_sequential.a $(LAdir)/libmkl_core.a $(LAdir)/libmkl_blacs_intelmpi_lp64.a $(LAdir)/libmkl_cdft_core.a -Wl, --end-group -lpthread
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section if and only if you are not planning to use
# a BLAS library featuring a Fortran 77 interface. Otherwise, it is
# necessary to fill out the F2CDEFS variable with the appropriate
# options. **One and only one** option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_ : all lower case and a suffixed underscore (Suns,
# Intel, ...), [default]
# -DNoChange : all lower case (IBM RS6000),
# -DUpCase : all upper case (Cray),
# -DAdd__ : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int : Fortran 77 INTEGER is a C int, [default]
# -DF77_INTEGER=long : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle : The string address is passed at the string loca-
# tion on the stack, and the string length is then
# passed as an F77_INTEGER after all explicit
# stack arguments, [default]
# -DStringStructPtr : The address of a structure is passed by a
# Fortran 77 string, and the structure is of the
# form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal : A structure is passed by value for each Fortran
# 77 string, and the structure is of the form:
# struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle : Special option for Cray machines, which uses
# Cray fcd (fortran character descriptor) for
# interoperation.
#
F2CDEFS = -DF77_INTEGER=long -DUSING_FFTW -DMKL_INT=long -DLONG_IS_64BITS -DRA_SANDIA_OPT2 -DHPCC_FFT_235
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS call the cblas interface;
# -DHPL_CALL_VSIPL call the vsip library;
# -DHPL_DETAILED_TIMING enable detailed timers;
#
# By default HPL will:
# *) not copy L before broadcast,
# *) call the BLAS Fortran 77 interface,
# *) not display detailed timing information.
#
HPL_OPTS =
#
# ----------------------------------------------------------------------
#
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC = mpiicc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS) -O2 -xSSE4.2 -ip -ansi-alias -fno-alias
#
# On some platforms, it is necessary to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER = mpiicc
LINKFLAGS = $(CCFLAGS)
#
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo
#
# ----------------------------------------------------------------------

Also -DF77_INTEGER=int have been tried.

Thanks

xuzheng97 · ‎10-09-2010

By the way, I did not do fftw2xc_patch.diff part because I am not using hpcc1.3.1.
Would this be the reason?

xuzheng97 · ‎10-09-2010

I just tested fftw2xc_patch.diff part but it did not affect anything at all

Gennady_F_Intel · ‎10-10-2010

Kevin, what MKL version do you use?

xuzheng97 · ‎10-10-2010

Gennady,

I tried both mkl 10.2.5.035 and correspoing mkl in Intel Compiler Suite version 11.1.072 but all hang.

Thanks

Vladimir_Petrov__Int · ‎10-11-2010

Hi Kevin,

Increasing the core count (with the corresponding increases of parameter N) results in a longer vector used for MPI FFT. As you may have already noticed HPCC is not very well suited for lengthes bigger than MAX_INT.

My first guess is that you are crossing this bound going from 96 to 192 cores.

Please provide the following info:
- how you built the MPI FFTW wrappers
- compile line for file mpifft.c
- link line for the hpcc exe
- mpiexec line
in order for me to give you a better pice of advice.

BTW, are you setting OMP_NUM_THREADS to 1? This is what you have to do unless you have your hpcc exe compiled and linked with -mt_mpi (supposing you are using Intel MPI)

Best regards,
-Vladimir

xuzheng97 · ‎10-11-2010

Vladimir,

Thanks.

Yes, large N indeed increase MPI FFT time. Bases on samll N experience, FFT time should be less than HPL's. I waited more than 3 hours while HPL only cost about 1 hour.
MAX_INT=2^32=2,147,483,648?
For 96 cores MPIFFT N=2,654,208,000, for 192 cores MPIFFT N=5,374,771,200.
It seems both of themexceed MAX_INT while 96 cores passed the test.

I have also tried -DF77_INTEGER=long, so I am not sure whether MAX_INT=2^64?

But indeed I found Intel 768 cores resulton hpcc website with MPIFFTN=22,932,357,120.
And my parameters configurationwere all based on Intel's result.

- how you built the MPI FFTW wrappers
make libem64t PRECISION=MKL_DOUBLE
Withor without fftw2xc_patch.diff were both tried.

- compile line for file mpifft.c

mpiicc -o ../../../../FFT/mpifft.o -c ../../../../FFT/mpifft.c -I../../../../include -DUSING_FFTW -DMKL_INT=long -DLONG_IS_64BITS -DRA_SANDIA_OPT2 -DHPCC_FFT_235 -I../../../include -I../../../include/em64t -I/mydirectory/hpcc-1.4.1/mkl/include/fftw -I/opt/intel/impi/4.0.0.028/include64 -O2 -xSSE4.2 -ip -ansi-alias -fno-alias

- link line for the hpcc exe
mpiicc -DUSING_FFTW -DMKL_INT=long -DLONG_IS_64BITS -DRA_SANDIA_OPT2 -DHPCC_FFT_235 -I../../../include -I../../../include/em64t -I/opt/intel/mkl/include/fftw -I/opt/intel/impi/4.0.0.028/include64 -O2 -xSSE4.2 -ip -ansi-alias -fno-alias -o ../../../../hpcc ../../../lib/em64t/libhpl.a /opt/intel/mkl/lib/em64t/libfftw2x_cdft_DOUBLE_lp64.a /opt/intel/mkl/lib/em64t/libfftw2xc_intel.a -Wl,--start-group /opt/intel/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/mkl/lib/em64t/libmkl_sequential.a /opt/intel/mkl/lib/em64t/libmkl_core.a /opt/intel/mkl/lib/em64t/libmkl_blacs_intelmpi_lp64.a /opt/intel/mkl/lib/em64t/libmkl_cdft_core.a -Wl, --end-group -lpthread

- mpiexec line
mpiexec -perhost 12 -n 192 ./hpcc

I did not set OMP_NUM_THREADS and also it seems thathpcc exe was compiled and linked with -mt_mpi.
I will try to set OMP_NUM_THREADS=1 soon and update it soon.

Thanks & Best Regards

Vladimir_Petrov__Int · ‎10-11-2010

Kevin,

The problem is here

- how you built the MPI FFTW wrappers
make libem64t PRECISION=MKL_DOUBLE

Please add "interface=ilp64" like this:
make libem64t PRECISION=MKL_DOUBLE interface=ilp64
which let's hpcc pass 64-bit int's to MKL.

Of course this is a mistake in our knowledge base article. Thank you for locating it!

Best regards,
-Vladimir

xuzheng97 · ‎10-12-2010

Vladimir,

I built MPI MKL FFTW library as you guided and turned correspoing lib from *lp.a to ilp.a.

Itpassed MPIFFT section but exhausted all system memory during StarFFT section.
The output is following:

Begin of MPIFFT section.
Warning: problem size too large: 135000*192*192
Number of nodes: 192
Vector size: 4976640000
Generation time: 1.172
Tuning: 3.690
Computing: 9.326
Inverse FFT: 10.957
max(|x-x0|): 2.914e-15
Gflop/s: 85.944
Current time (1286865020) is Tue Oct 12 02:30:20 2010
End of MPIFFT section.
Begin of StarFFT section.

Here the system has 16 nodes and each node has 2*X5670, 24G memory, 4x QDR IB.
It seems same configuration with Intel's on http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=414.
And I am using HPL N=200000, PTRANS N=100000 and NB=168 P=6 Q=32 which is even smaller than the configuration on the website.

Would you help to give further help on this?

Thanks & Best Regards

Vladimir_Petrov__Int · ‎10-12-2010

Kevin,

It's good to here that the MPIFFT section passes now.

As to the StarFFT section, unfortunately large memory consumption is a known problem of older versions of MKL.
It is fixed in version 10.3.0.

Best regards,
-Vladimir

xuzheng97 · ‎10-12-2010

Vladimir,

Oh, I am using MKL in Intel Compiler Suite 11.1.072.

I will try 10.3 soon.

Thanks for your kind help.

Best Regards