Problem with hpcc, intel mpi and intel mkl: PTRANS failed

Guillaume_De_Nayer · ‎10-01-2010

Hi,

We have a little cluster (4 nodes; each node 12 cores). I'm trying to test hpcc on it. So I have read:
http://origin-software.intel.com/en-us/articles/performance-tools-for-software-developers-use-of-intel-mkl-in-hpcc-benchmark/
and have done step by step the things. My make.arch is:

#
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL = /bin/sh
#
CD = cd
CP = cp
LN_S = ln -s
MKDIR = mkdir
RM = /bin/rm -f
TOUCH = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH = $(arch)
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir = ../../..
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
#
HPLlib = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the C compiler where to find the Message Passing library
# header files, MPlib is defined to be the name of the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir = /opt/intel/impi/4.0.0
MPinc = -I$(MPdir)/include64
MPlib = -L$(MPdir)/lib64
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the C compiler where to find the Linear Algebra library
# header files, LAlib is defined to be the name of the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir = /opt/intel/mkl/10.2.5.035/lib/em64t
LAdir_local = ~/tmp/flops_test/mkl_local/lib/em64t
LAinc = -I/opt/intel/mkl/10.2.5.035/include -I/opt/intel/mkl/10.2.5.035/include/fftw
LAlib = $(LAdir)/libmkl_solver_ilp64.a -Wl,--start-group $(LAdir)/libmkl_intel_ilp64.a $(LAdir)/libmkl_intel_thread.a $(LAdir)/libmkl_core.a $(LAdir)/libmkl_blacs_intelmpi_ilp64.a $(LAdir_local)/libfftw2x_cdft_DOUBLE.a $(LAdir_local)/libfftw2xc_intel.a $(LAdir)/libmkl_cdft_core.a -Wl,--end-group -openmp -lpthread

#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
#
F2CDEFS = -DAdd_ -DF77_INTEGER=int -DStringSunStyle
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib) -lm
#
# - Compile time options -----------------------------------------------
#
HPL_OPTS = -DUSING_FFTW -DMKL_INT=long -DLONG_IS_64BITS
#
# ----------------------------------------------------------------------
#
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC = mpicc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS) -O2 -xSSE4.2 -ansi-alias -ip
#
LINKER = mpicc
LINKFLAGS =
#
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo
#
# ----------------------------------------------------------------------

It compiles without error. Then I take the input file and starts the program with mpirun -np 4 hpcc on our master. I get:
mpirun -np 8 hpcc

WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the current machine only.
rank 6 in job 1 master_34154 caused collective abort of all ranks
exit status of rank 6: killed by signal 9
rank 0 in job 1 master_34154 caused collective abort of all ranks
exit status of rank 0: killed by signal 11

and in the output file is:
########################################################################
This is the DARPA/DOE HPC Challenge Benchmark version 1.4.1 October 2003
Produced by Jack Dongarra and Piotr Luszczek
Innovative Computing Laboratory
University of Tennessee Knoxville and Oak Ridge National Laboratory

See the source files for authors of specific codes.
Compiled on Oct 1 2010 at 09:16:38
Current time (1285919752) is Fri Oct 1 09:55:52 2010

Hostname: 'master'
########################################################################
================================================================================
HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 10240
NB : 128
PMAP : Row-major process mapping
P : 2
Q : 4
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 2.220446e-16
- Computational tests pass if scaled residuals are less than 16.0

Begin of MPIRandomAccess section.
Running on 8 processors (PowerofTwo)
Total Main table size = 2^26 = 67108864 words
PE Main table size = 2^23 = 8388608 words/PE
Default number of updates (RECOMMENDED) = 268435456
CPU time used = 10.461410 seconds
Real time used = 18.035589 seconds
0.014883653 Billion(10^9) Updates per second [GUP/s]
0.001860457 Billion(10^9) Updates/PE per second [GUP/s]
Verification: CPU time used = 1.407786 seconds
Verification: Real time used = 1.413994 seconds
Found 0 errors in 67108864 locations (passed).
Current time (1285919771) is Fri Oct 1 09:56:11 2010

End of MPIRandomAccess section.
Begin of StarRandomAccess section.
Main table size = 2^23 = 8388608 words
Number of updates = 33554432
CPU time used = 1.022845 seconds
Real time used = 1.023517 seconds
0.032783467 Billion(10^9) Updates per second [GUP/s]
Found 0 errors in 8388608 locations (passed).
Node(s) with error 0
Minimum GUP/s 0.032325
Average GUP/s 0.032776
Maximum GUP/s 0.033038
Current time (1285919773) is Fri Oct 1 09:56:13 2010

End of StarRandomAccess section.
Begin of SingleRandomAccess section.
Node(s) with error 0
Node selected 2
Single GUP/s 0.050258
Current time (1285919775) is Fri Oct 1 09:56:15 2010

End of SingleRandomAccess section.
Begin of MPIRandomAccess_LCG section.
Running on 8 processors (PowerofTwo)
Total Main table size = 2^26 = 67108864 words
PE Main table size = 2^23 = 8388608 words/PE
Default number of updates (RECOMMENDED) = 268435456
CPU time used = 11.008327 seconds
Real time used = 18.597349 seconds
0.014434071 Billion(10^9) Updates per second [GUP/s]
0.001804259 Billion(10^9) Updates/PE per second [GUP/s]
Verification: CPU time used = 1.382789 seconds
Verification: Real time used = 1.386738 seconds
Found 0 errors in 67108864 locations (passed).
Current time (1285919795) is Fri Oct 1 09:56:35 2010

End of MPIRandomAccess_LCG section.
Begin of StarRandomAccess_LCG section.
Main table size = 2^23 = 8388608 words
Number of updates = 33554432
CPU time used = 1.036842 seconds
Real time used = 1.037567 seconds
0.032339536 Billion(10^9) Updates per second [GUP/s]
Found 0 errors in 8388608 locations (passed).
Node(s) with error 0
Minimum GUP/s 0.032164
Average GUP/s 0.032349
Maximum GUP/s 0.032528
Current time (1285919797) is Fri Oct 1 09:56:37 2010

End of StarRandomAccess_LCG section.
Begin of SingleRandomAccess_LCG section.
Node(s) with error 0
Node selected 7
Single GUP/s 0.048609
Current time (1285919798) is Fri Oct 1 09:56:38 2010

End of SingleRandomAccess_LCG section.
Begin of PTRANS section.
M: 5120
N: 5120
MB: 128
NB: 128
P: 2
Q: 4
TIME M N MB NB P Q TIME CHECK GB/s RESID
---- ----- ----- --- --- --- --- -------- ------ -------- -----

So it seems to bug into PTRANS. How can I solve this problem ?
Thx a lot,
Regards

Guillaume_De_Nayer · ‎10-01-2010

I did a mistake:
I just copy the make.UNKNOWN from the hpl setup. I f I modify the -DF77_INTEGER=int into -DF77_INTEGER=long, the PTRANS test runs without problem.

Now I get a *** glibc detected *** hpcc: free(): invalid pointer: 0x0000003647952a88 *** a the beginning of StarFFT section.
Any Ideas ?

Regards,

Vladimir_Petrov__Int · ‎10-01-2010

Hi,

I suggest that you change LAdir in order to move the fftw2x wrapppers before the MKL interface library like this:

LAlib = $(LAdir_local)/libfftw2x_cdft_DOUBLE.a $(LAdir_local)/libfftw2xc_intel.a $(LAdir)/libmkl_solver_ilp64.a -Wl,--start-group $(LAdir)/libmkl_intel_ilp64.a $(LAdir)/libmkl_intel_thread.a $(LAdir)/libmkl_core.a $(LAdir)/libmkl_blacs_intelmpi_ilp64.a $(LAdir)/libmkl_cdft_core.a -Wl,--end-group -openmp -lpthread

The reason behind this is that MKL's interface layer (i.e. libmkl_intel_ilp64.a) contains pre-built FFTW3 wrappers and their implementations of fftw_free() are not compatible with FFTW2 used by HPCC.

Best regards,
-Vladimir

Guillaume_De_Nayer · ‎10-01-2010

perfect! it works! thx a lot!