- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I want to run HPL on 2P 36 cores node,
export OMP_PROC_BIND=TRUE export OMP_PLACES=cores export OMP_NUM_THREADS=2 mpirun -np 18 --map-by l3cache ./xhpl
but HPL runs with 18 threads instead 18 cores per each of two tasks and each task uses only one core:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 42052 user 20 0 3084748 69668 29116 R 100.0 0.0 0:42.47 xhpl 42058 user 20 0 3084744 69716 29168 R 100.0 0.0 0:42.47 xhpl 42062 user 20 0 3084748 73748 29100 R 100.0 0.0 0:42.47 xhpl 42063 user 20 0 3086792 71724 29128 R 100.0 0.0 0:42.44 xhpl 42067 user 20 0 3084748 73664 29024 R 100.0 0.0 0:42.46 xhpl 42050 user 20 0 53.3g 22.9g 29148 R 99.7 12.2 0:42.46 xhpl 42051 user 20 0 3086796 71744 29144 R 99.7 0.0 0:42.45 xhpl 42053 user 20 0 3084748 69692 29140 R 99.7 0.0 0:42.46 xhpl 42054 user 20 0 3084744 69720 29172 R 99.7 0.0 0:42.45 xhpl 42055 user 20 0 3084744 73768 29124 R 99.7 0.0 0:42.46 xhpl 42056 user 20 0 3084744 73792 29148 R 99.7 0.0 0:42.46 xhpl 42057 user 20 0 3084744 69596 29048 R 99.7 0.0 0:42.45 xhpl 42059 user 20 0 3084744 73724 29080 R 99.7 0.0 0:42.46 xhpl 42060 user 20 0 3088844 75856 29160 R 99.7 0.0 0:42.45 xhpl 42061 user 20 0 3084744 69672 29124 R 99.7 0.0 0:42.46 xhpl 42064 user 20 0 3084744 69700 29152 R 99.7 0.0 0:42.46 xhpl 42065 user 20 0 3086796 71720 29120 R 99.7 0.0 0:42.47 xhpl 42066 user 20 0 3084744 73708 29064 R 99.7 0.0 0:42.45 xhpl
What I doing wrong?
A some parts of my Make.intel64 file:
MPdir = $(I_MPI_ROOT) MPinc = -I$(MPdir)/intel64/include MPlib = $(MPdir)/intel64/lib/release_mt/libmpi.so LAdir = $(MKLROOT)/lib/intel64 LAinc = -I$(MKLROOT)/include LAlib = -mkl=cluster CC = mpiicc CCNOOPT = $(HPL_DEFS) CCFLAGS = -qopenmp -xSKYLAKE-AVX512 -fomit-frame-pointer -O3 -funroll-loops $(HPL_DEFS) LINKER = mpiicc LINKFLAGS = $(CCFLAGS)
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Depending on its default setting (which varies by distro), 100% in top may indicate that all threads in each process are running fully . Anyway, MKL will choose for itself the optimum number of threads, regardless of the calling application settings, if you don't set MKL specific environment variables. If all your time is spent in MKL, the top reports will indicate how those functions are running, not the calling code. More detail on this is more likely to see useful discussion in the cluster https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology or MKL https://software.intel.com/en-us/forums/intel-math-kernel-library forum sections.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The "mpirun -np 18" command tells the system to launch 18 MPI tasks. The OMP_NUM_THREADS=2 tells each task to use 2 threads. This will give a total of 36 application threads, which is the correct number for maximum performance on a 2s node. I would not count on the output of "top" or "ps" to tell you the hardware utilization -- "perf stat" is a better tool for that.
For a 2s system, I typically (but not always!) find one MPI task per socket to give the best performance. Something like:
export OMP_NUM_THREADS=18 export KMP_AFFINITY=scatter export I_MPI_DEBUG=5 perf stat -o "perf_output.txt" -a -A mpirun -np 2 ./xhpl
The output from "perf stat" will be extensive, including separate counts for each logical processor, but looking at the "CPU Cycles" and "Instructions Retired" for all of the logical processors will make it clear exactly how many actually got used for the benchmark run.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page