Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2234 Discussions

Looking for some benchmarks to run

Greg_R_
Beginner
4,558 Views

I have downloaded the latest version of the Intel API toolkits. We have a cluster of 75 nodes, each with 256 CPUs. I have a user whose job will not run with more than 32 nodes. Is there some simple benchmarks that I can run to verify my installation? Thank

0 Kudos
14 Replies
HemanthCH_Intel
Moderator
4,539 Views

Hi,

 

Thanks for posting in Intel Communities.

 

You can use the below command to run the Intel MPI Benchmarks:

 

 

source /opt/intel/oneapi/setvars.sh
mpirun -n 3 IMB-MPI1

 

 

For more information please refer to the below link:

https://www.intel.com/content/www/us/en/developer/articles/technical/intel-mpi-benchmarks.html

 

Thanks & Regards,

Hemanth

 

0 Kudos
Greg_R_
Beginner
4,534 Views

[ramos@atlantis1 oneAPI]$ source setvars.sh

:: initializing oneAPI environment ...
-bash: BASH_VERSION = 4.4.20(1)-release
args: Using "$@" for setvars.sh arguments:
:: advisor -- latest
:: ccl -- latest
:: clck -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: inspector -- latest
:: intelpython -- latest
:: ipp -- latest
:: ippcp -- latest
:: ipp -- latest
:: itac -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vpl -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

[ramos@atlantis1 oneAPI]$ mpirun -n 3 IMB_MPI1
[proxy:0:0@atlantis1] HYD_spawn (../../../../../src/pm/i_hydra/libhydra/spawn/in tel/hydra_spawn.c:151): execvp error on file IMB_MPI1 (No such file or directory )
[proxy:0:0@atlantis1] HYD_spawn (../../../../../src/pm/i_hydra/libhydra/spawn/in tel/hydra_spawn.c:151): execvp error on file IMB_MPI1 (No such file or directory )
[proxy:0:0@atlantis1] HYD_spawn (../../../../../src/pm/i_hydra/libhydra/spawn/in tel/hydra_spawn.c:151): execvp error on file IMB_MPI1 (No such file or directory )
[ramos@atlantis1 oneAPI]$ find . -name IMB_MPI1
[ramos@atlantis1 oneAPI]$

0 Kudos
Greg_R_
Beginner
4,529 Views

I downloaded the Benchmark software and installed it. I have a cluster of 75 nodes, each with 256 cores. I have a user 

who compiled his code and says it fails if he uses more than 35 nodes. I don't think there is anything wrong with the

one API installation. So in short, I am looking for some examples I can submit via PBS to run on all nodes.. Thanks.

0 Kudos
HemanthCH_Intel
Moderator
4,506 Views

Hi,


Sorry for the inconvenience. There is a typo in my previous response. Please use "IMB-MPI1" instead of "IMB_MPI1".


So, could you please try the below command:

mpirun -n 35 -ppn 1 IMB-MPI1


Thanks & Regards,

Hemanth



0 Kudos
Greg_R_
Beginner
4,479 Views

When I run that command, I get some initial results, and then it blows up

 

[ramos@atlantis1 mpi-benchmarks-master]$ mpirun -n 35 -ppn 1 IMB-MPI1
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.4, MPI-1 part
#----------------------------------------------------------------
# Date : Tue Jun 7 12:04:44 2022
# Machine : x86_64
# System : Linux
# Release : 4.18.0-348.23.1.el8_5.x86_64
# Version : #1 SMP Tue Apr 12 11:20:32 EDT 2022
# MPI Version : 3.1
# MPI Thread Environment:


# Calling sequence was:

# IMB-MPI1

# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_local
# Reduce_scatter
# Reduce_scatter_block
# Allgather
# Allgatherv
# Gather
# Gatherv
# Scatter
# Scatterv
# Alltoall
# Alltoallv
# Bcast
# Barrier

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 33 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.30 0.00
1 1000 0.30 3.31
2 1000 0.31 6.54
4 1000 0.31 13.05
8 1000 0.29 27.53
16 1000 0.30 53.32
32 1000 0.29 109.59
64 1000 0.29 218.47
128 1000 0.37 343.02
256 1000 0.37 686.27
512 1000 0.40 1274.70
1024 1000 0.43 2372.65
2048 1000 0.59 3475.84
4096 1000 0.74 5554.96
8192 1000 0.98 8348.95
16384 1000 2.17 7535.06
32768 1000 2.50 13104.62
65536 640 3.55 18446.99
131072 320 5.69 23026.18
262144 160 9.73 26935.07
524288 80 24.39 21498.83
1048576 40 30.97 33854.44
2097152 20 65.21 32158.61
4194304 10 270.05 15531.62

#---------------------------------------------------
# Benchmarking PingPing
# #processes = 2
# ( 33 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.47 0.00
1 1000 0.47 2.11
2 1000 0.49 4.08
4 1000 0.46 8.60
8 1000 0.47 17.17
16 1000 0.47 34.31
32 1000 0.46 68.96
64 1000 0.47 136.21
128 1000 0.59 218.05
256 1000 0.59 436.37
512 1000 0.60 858.40
1024 1000 0.64 1591.50
2048 1000 0.93 2202.87
4096 1000 1.14 3589.92
8192 1000 1.61 5090.79
16384 1000 3.18 5149.85
32768 1000 3.80 8627.66
65536 640 5.53 11849.65
131072 320 9.18 14272.19
262144 160 16.01 16368.76
524288 80 29.35 17865.81
1048576 40 56.30 18624.12
2097152 20 112.66 18614.68
4194304 10 386.57 10850.04

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
# ( 33 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 0.59 0.59 0.59 0.00
1 1000 0.57 0.57 0.57 3.50
2 1000 0.59 0.59 0.59 6.79
4 1000 0.60 0.60 0.60 13.44
8 1000 0.59 0.59 0.59 27.22
16 1000 0.57 0.57 0.57 55.69
32 1000 0.59 0.59 0.59 109.35
64 1000 0.60 0.60 0.60 214.46
128 1000 0.67 0.67 0.67 380.01
256 1000 0.68 0.68 0.68 753.11
512 1000 0.70 0.70 0.70 1465.66
1024 1000 0.73 0.73 0.73 2814.71
2048 1000 0.98 0.98 0.98 4170.18
4096 1000 1.19 1.19 1.19 6859.13
8192 1000 1.61 1.61 1.61 10164.56
16384 1000 3.39 3.39 3.39 9680.06
32768 1000 4.89 4.89 4.89 13396.62
65536 640 5.71 5.71 5.71 22965.16
131072 320 9.44 9.44 9.44 27780.56
262144 160 16.45 16.45 16.45 31875.15
524288 80 29.67 29.67 29.67 35344.58
1048576 40 56.83 56.83 56.83 36904.07
2097152 20 111.73 111.75 111.74 37534.04
4194304 10 361.45 361.48 361.46 23206.47

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 4
# ( 31 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 0.66 0.66 0.66 0.00
1 1000 0.66 0.66 0.66 3.03
2 1000 0.66 0.66 0.66 6.09
4 1000 0.65 0.65 0.65 12.33
8 1000 0.65 0.65 0.65 24.44
16 1000 0.65 0.65 0.65 49.05
32 1000 0.71 0.71 0.71 90.57
64 1000 0.70 0.70 0.70 182.29
128 1000 0.87 0.87 0.87 294.64
256 1000 0.87 0.87 0.87 587.54
512 1000 1.10 1.10 1.10 930.47
1024 1000 1.19 1.19 1.19 1723.63
2048 1000 1.69 1.70 1.70 2414.31
4096 1000 2.48 2.48 2.48 3300.90
8192 1000 3.74 3.75 3.75 4370.03
16384 1000 3.57 3.57 3.57 9181.22
32768 1000 4.19 4.19 4.19 15637.31
65536 640 5.84 5.84 5.84 22444.42
131072 320 9.42 9.43 9.42 27802.98
262144 160 16.37 16.41 16.40 31948.49
524288 80 29.78 29.94 29.90 35023.99
1048576 40 56.00 56.65 56.48 37020.90
2097152 20 124.93 127.55 126.81 32882.75
4194304 10 863.94 881.47 876.38 9516.57

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 8
# ( 27 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 0.64 0.65 0.65 0.00
1 1000 0.64 0.64 0.64 3.12
2 1000 0.65 0.65 0.65 6.12
4 1000 0.64 0.64 0.64 12.45
8 1000 0.64 0.64 0.64 25.15
16 1000 0.64 0.64 0.64 49.91
32 1000 0.68 0.68 0.68 94.37
64 1000 0.69 0.69 0.69 186.49
128 1000 0.87 0.87 0.87 293.63
256 1000 0.98 0.98 0.98 523.78
512 1000 0.99 0.99 0.99 1029.85
1024 1000 1.06 1.06 1.06 1937.55
2048 1000 1.41 1.41 1.41 2899.51
4096 1000 1.83 1.83 1.83 4478.37
8192 1000 2.69 2.70 2.70 6071.90
16384 1000 3.55 3.55 3.55 9232.56
32768 1000 4.18 4.18 4.18 15673.95
65536 640 5.84 5.84 5.84 22426.70
131072 320 9.56 9.59 9.58 27326.06
262144 160 21.97 22.08 22.03 23747.16
524288 80 34.56 34.90 34.76 30041.94
1048576 40 55.67 56.77 56.37 36941.68
2097152 20 135.42 143.22 140.47 29285.19
4194304 10 1562.39 1698.75 1672.93 4938.12

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 16
# ( 19 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 0.58 0.59 0.59 0.00
1 1000 0.59 0.59 0.59 3.39
2 1000 0.61 0.61 0.61 6.57
4 1000 0.59 0.59 0.59 13.54
8 1000 0.59 0.59 0.59 27.11
16 1000 0.60 0.60 0.60 53.18
32 1000 0.64 0.64 0.64 100.12
64 1000 0.64 0.65 0.65 198.13
128 1000 0.89 0.89 0.89 286.85
256 1000 0.93 0.93 0.93 550.18
512 1000 1.00 1.00 1.00 1022.05
1024 1000 1.05 1.05 1.05 1941.63
2048 1000 1.43 1.43 1.43 2856.06
4096 1000 1.89 1.90 1.89 4320.68
8192 1000 2.97 2.98 2.98 5496.80
16384 1000 3.52 3.53 3.53 9283.26
32768 1000 4.16 4.17 4.17 15711.38
65536 640 5.88 5.90 5.89 22215.97
131072 320 9.46 9.55 9.51 27460.67
262144 160 16.43 16.74 16.61 31320.70
524288 80 27.13 28.04 27.64 37394.73
1048576 40 52.38 56.53 54.74 37097.75
2097152 20 114.09 135.47 127.76 30961.82
4194304 10 1062.18 1732.35 1495.45 4842.33

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 32
# ( 3 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 1.34 1.34 1.34 0.00
1 1000 0.65 0.65 0.65 3.06
2 1000 2.11 2.11 2.11 1.89
4 1000 0.64 0.64 0.64 12.49
8 1000 0.64 0.65 0.64 24.78
16 1000 0.64 0.64 0.64 49.68
32 1000 0.74 0.74 0.74 86.07
64 1000 0.75 0.75 0.75 170.67
128 1000 1.12 1.13 1.13 226.27
256 1000 1.15 1.16 1.16 441.25
512 1000 1.23 1.24 1.23 828.08
1024 1000 1.34 1.35 1.34 1522.47
2048 1000 1.85 1.86 1.85 2200.08
4096 1000 2.65 2.67 2.66 3065.25
8192 1000 4.07 4.12 4.10 3978.60
[atlantis1:3316716:0:3316716] ud_ep.c:270 Fatal: UD endpoint 0x208f090 to <no debug data>: unhandled timeout error
[atlantis1:3316724:0:3316724] ud_ep.c:270 Fatal: UD endpoint 0x208f1a0 to <no debug data>: unhandled timeout error
[atlantis1:3316723:0:3316723] ud_ep.c:270 Fatal: UD endpoint 0x208f1d0 to <no debug data>: unhandled timeout error
[atlantis1:3316744:0:3316744] ud_ep.c:270 Fatal: UD endpoint 0x15fd530 to <no debug data>: unhandled timeout error
[atlantis1:3316733:0:3316733] ud_ep.c:270 Fatal: UD endpoint 0x15fd530 to <no debug data>: unhandled timeout error
[atlantis1:3316719:0:3316719] ud_ep.c:270 Fatal: UD endpoint 0x208f1d0 to <no debug data>: unhandled timeout error
[atlantis1:3316720:0:3316720] ud_ep.c:270 Fatal: UD endpoint 0x208f140 to <no debug data>: unhandled timeout error
[atlantis1:3316730:0:3316730] ud_ep.c:270 Fatal: UD endpoint 0x208f1a0 to <no debug data>: unhandled timeout error
==== backtrace (tid:3316744) ====
0 0x000000000005733e uct_ud_ep_window_release_completed() ???:0
1 0x0000000000055cab ucs_callbackq_get_id() ???:0
2 0x000000000003a29a ucp_worker_progress() ???:0
3 0x000000000000a7a1 mlx_ep_progress() mlx_ep.c:0
4 0x0000000000022b0d ofi_cq_progress() osd.c:0
5 0x0000000000022a97 ofi_cq_readfrom() osd.c:0
6 0x000000000062b3fe fi_cq_read() /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_eq.h:385
7 0x00000000001fa7a1 MPIDI_Progress_test() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:93
8 0x00000000001fa7a1 MPID_Progress_test_impl() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:152
9 0x00000000001fa7a1 MPID_Progress_wait() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:205
10 0x000000000070578f PMPI_Sendrecv() /build/impi/_buildspace/release/../../src/mpi/pt2pt/sendrecv.c:209
11 0x0000000000435fed IMB_sendrecv() /build/impi/_tmp/oneapi_imb/src_cpp/../src_c/IMB_sendrecv.c:158
12 0x000000000043d774 Bmark_descr::IMB_init_buffers_iter() /build/impi/_tmp/oneapi_imb/src_cpp/helpers/helper_IMB_functions.h:627
13 0x0000000000446df3 OriginalBenchmark<BenchmarkSuite<(benchmark_suite_t)0>, &IMB_sendrecv>::run() /build/impi/_tmp/oneapi_imb/src_cpp/helpers/original_benchmark.h:209
14 0x0000000000405d64 main() /build/impi/_tmp/oneapi_imb/src_cpp/imb.cpp:347
15 0x000000000003aca3 __libc_start_main() ???:0
16 0x0000000000403be9 _start() ???:0
=================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 3316713 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 3316714 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 3316715 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 3316716 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 4 PID 3316717 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 5 PID 3316718 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 3316719 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 3316720 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 8 PID 3316721 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 9 PID 3316722 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 10 PID 3316723 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 11 PID 3316724 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 12 PID 3316725 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 13 PID 3316726 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 14 PID 3316727 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 15 PID 3316728 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 16 PID 3316729 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 17 PID 3316730 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 18 PID 3316731 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 19 PID 3316732 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 20 PID 3316733 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 21 PID 3316734 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 22 PID 3316735 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 23 PID 3316736 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 24 PID 3316737 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 25 PID 3316738 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 26 PID 3316739 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 27 PID 3316740 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 28 PID 3316741 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 29 PID 3316742 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 30 PID 3316743 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 31 PID 3316744 RUNNING AT atlantis1
= KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 32 PID 3316745 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 33 PID 3316746 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 34 PID 3316747 RUNNING AT atlantis1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
[ramos@atlantis1 mpi-benchmarks-master]$

0 Kudos
HemanthCH_Intel
Moderator
4,476 Views

Hi,

 

Could you please check the health of your cluster environment using the below command and share the output of the command:


clck -f <nodefile>

 

For more information refer to the below link:

https://www.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/getting-started.html


Thanks & Regards,

Hemanth


0 Kudos
HemanthCH_Intel
Moderator
4,458 Views

Hi,


We haven't heard back from you. Could you please provide an update on your issue?


Thanks & Regards,

Hemanth


0 Kudos
Greg_R_
Beginner
4,446 Views

Here is the output of running on all nodes:

 

Intel(R) Cluster Checker 2021 Update 6 (build 20220318)

Running Collect

.............................................................................................................................................................................................
.............................................................................................................................................................................................
.............................................................................................................................................................................................
.............................................................................................................................................................................................
.............................................................................................................................................................................................
.............................................................................................................................................................................................
.............................................................................................................................................................................................
.............................................................................................................................................................................................
.............................................................................................................................................................................................
.............................................................................................................................................................................................
............................................................
Running Analyze

SUMMARY
Command-line: clck -f nodes
Tests Run: health_base
Overall Result: 3 issues found - FUNCTIONALITY (2), HARDWARE UNIFORMITY (1)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
75 nodes tested: node[0001-0009], node[0010-0075]
0 nodes with no issues:
75 nodes with issues: node[0001-0009], node[0010-0075]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
1. Port '1' of InfiniBand HCA 'mlx5_1' is in the 'Disabled' physical state, not the 'LinkUp' physical state.
75 nodes: node[0001-0009], node[0010-0075]
2. Port '1' of InfiniBand HCA 'mlx5_1' is in the 'Down' state, not the 'Active' state.
75 nodes: node[0001-0009], node[0010-0075]

HARDWARE UNIFORMITY
The following hardware uniformity issues were detected:
1. Inconsistent Ethernet firmware version.
1 node: node0069

PERFORMANCE
No issues detected.

SOFTWARE UNIFORMITY
No issues detected.

See the following files for more information: clck_results.log, clck_execution_warnings.log

0 Kudos
HemanthCH_Intel
Moderator
4,427 Views

Hi,


Could you please provide us with the CPU details?

And also, could you please provide the complete debug log by using the below command?


I_MPI_DEBUG=10 mpirun -n 2 IMB-MPI1


Thanks & Regards,

Hemanth


0 Kudos
Greg_R_
Beginner
4,415 Views

processor : 0
vendor_id : AuthenticAMD
cpu family : 25
model : 1
model name : AMD EPYC 7713 64-Core Processor
stepping : 1
microcode : 0xa00115d
cpu MHz : 3092.910
cache size : 512 KB
physical id : 0
siblings : 128
core id : 0
cpu cores : 64
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nop
l nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy ab
m sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcal
l fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsavee
rptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqd
q rdpid overflow_recov succor smca sme sev sev_es
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips : 3992.40
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

 

0 Kudos
Greg_R_
Beginner
4,411 Views

The benchmark hangs here:

 

[ramos@node0001 ~]$ I_MPI_DEBUG=10 mpirun -n 2 IMB-MPI1
[0] MPI startup(): Intel(R) MPI Library, Version 2021.6 Build 20220227 (id: 28877f3f32)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "" not found
[0] MPI startup(): Load tuning file: "/software8/depot/intel/oneAPI/mpi/2021.6.0/etc/tuning_generic_shm-ofi.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 30 (TAG_UB value: 1073741823)
[0] MPI startup(): source bits available: 2 (Maximal number of rank: 3)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 1580495 node0001 {0}
[0] MPI startup(): 1 1580496 node0001 {0}
[0] MPI startup(): I_MPI_ROOT=/software8/depot/intel/oneAPI/mpi/2021.6.0
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=pbs
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_HYDRA_IFACE=ib0
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=pbs
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10
#----------------------------------------------------------------
# Intel(R) MPI Benchmarks 2021.4, MPI-1 part
#----------------------------------------------------------------
# Date : Fri Jun 24 09:05:30 2022
# Machine : x86_64
# System : Linux
# Release : 4.18.0-348.23.1.el8_5.x86_64
# Version : #1 SMP Tue Apr 12 11:20:32 EDT 2022
# MPI Version : 3.1
# MPI Thread Environment:


# Calling sequence was:

# IMB-MPI1

# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_local
# Reduce_scatter
# Reduce_scatter_block
# Allgather
# Allgatherv
# Gather
# Gatherv
# Scatter
# Scatterv
# Alltoall
# Alltoallv
# Bcast
# Barrier

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 384 13016.95 0.00
1 384 13000.02 0.00
2 384 12983.09 0.00
4 380 13034.23 0.00
8 380 13017.12 0.00
16 380 12998.70 0.00
32 372 12982.54 0.00
64 372 12979.86 0.00
128 372 13017.49 0.01
256 372 12982.54 0.02
512 372 12982.55 0.04
1024 372 13000.02 0.08
2048 372 13012.12 0.16
4096 372 13069.91 0.31
8192 372 12982.55 0.63
16384 372 13052.44 1.26
32768 372 13017.50 2.52
65536 372 13000.03 5.04
131072 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#---------------------------------------------------
# Benchmarking PingPing
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 770 13000.02 0.00
1 753 13017.29 0.00
2 753 13000.02 0.00
4 753 13001.19 0.00
8 753 13055.80 0.00
16 753 13063.77 0.00
32 753 13069.08 0.00
64 753 12993.38 0.00
128 753 13086.34 0.01
256 753 12982.75 0.02
512 753 13023.92 0.04
1024 753 13017.28 0.08
2048 753 13017.28 0.16
4096 751 13207.74 0.31
8192 750 13029.36 0.63
16384 385 25979.27 0.63
32768 385 26000.04 1.26
65536 380 26142.15 2.51
131072 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 770 12983.12 12983.13 12983.12 0.00
1 753 12982.73 13017.28 13000.01 0.00
2 753 13017.26 13051.81 13034.53 0.00
4 753 13000.00 13034.54 13017.27 0.00
8 731 13071.13 13106.72 13088.93 0.00
16 731 13142.27 13177.86 13160.06 0.00
32 731 13071.13 13106.72 13088.93 0.00
64 731 13017.78 13053.37 13035.57 0.01
128 731 13000.00 13035.58 13017.79 0.02
256 731 13000.01 13035.59 13017.80 0.04
512 731 13053.35 13088.94 13071.15 0.08
1024 731 12982.22 13017.80 13000.01 0.16
2048 731 12982.22 13017.80 13000.01 0.31
4096 731 13053.35 13101.25 13077.30 0.63
8192 731 13019.15 13054.73 13036.94 1.26
16384 385 26132.45 26132.50 26132.47 1.25
32768 385 25966.22 25966.26 25966.24 2.52
65536 385 26070.11 26070.15 26070.13 5.03
131072 320 25975.00 25975.06 25975.03 10.09
262144 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 751 12982.69 13017.33 13000.01 0.00
1 751 12965.36 13000.02 12982.69 0.00
2 751 13011.98 13046.62 13029.30 0.00
4 751 12965.38 13000.02 12982.70 0.00
8 751 13000.00 13034.64 13017.32 0.00
16 751 13034.62 13069.26 13051.94 0.00
32 751 12981.36 13017.33 12999.35 0.01
64 751 12982.69 13017.33 13000.01 0.02
128 751 12965.38 13000.02 12982.70 0.04
256 751 12982.69 13017.33 13000.01 0.08
512 751 12948.07 12982.71 12965.39 0.16
1024 751 12948.07 12982.71 12965.39 0.32
2048 751 12913.45 12948.09 12930.77 0.63
4096 751 13022.64 13057.28 13039.96 1.25
8192 751 13086.55 13121.19 13103.87 2.50
16384 385 26101.30 26101.35 26101.32 2.51
32768 385 26000.00 26000.04 26000.02 5.04
65536 382 26000.00 26000.05 26000.02 10.08
131072 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.06 0.07 0.06
4 382 25949.41 26021.44 25985.43
8 382 25805.45 26065.33 25935.39
16 381 25838.41 26001.04 25919.72
32 380 25895.99 25992.18 25944.08
64 375 0.99 25999.67 13000.33
128 375 1.15 25965.01 12983.08
256 375 1.07 26103.67 13052.37
512 375 1.10 26138.33 13069.71
1024 375 1.19 26068.96 13035.07
2048 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#----------------------------------------------------------------
# Benchmarking Reduce
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.07 0.07 0.07
4 387 25934.07 25986.83 25960.45
8 382 25972.84 26139.99 26056.42
16 382 0.41 25931.59 12966.00
32 381 25947.66 26073.83 26010.75
64 381 0.42 26136.14 13068.28
128 381 0.54 26117.79 13059.17
256 381 0.52 25999.66 13000.09
512 381 25992.72 26039.29 26016.00
1024 381 25802.45 25848.08 25825.27
2048 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#----------------------------------------------------------------
# Benchmarking Reduce_local
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
4 1000 0.07 0.07 0.07
8 1000 0.06 0.07 0.07
16 1000 0.06 0.07 0.07
32 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#----------------------------------------------------------------
# Benchmarking Reduce_scatter
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.23 0.26 0.24
4 385 1.05 25999.68 13000.37
8 385 1.01 26088.00 13044.51
16 385 1.01 25999.69 13000.35
32 385 1.04 26168.52 13084.78
64 385 1.08 26051.64 13026.36
128 380 1.17 26307.58 13154.38
256 380 1.22 26065.49 13033.36
512 380 1.31 26170.71 13086.01
1024 380 1.60 26068.10 13034.85
2048 380 2.01 25999.68 13000.85
4096 380 2.22 26033.88 13018.05
8192 380 2.88 26033.86 13018.37
16384 380 25999.91 26073.10 26036.51
32768 373 26125.20 26210.54 26167.87
65536 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#----------------------------------------------------------------
# Benchmarking Reduce_scatter_block
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.12 0.15 0.14
4 381 0.95 26033.81 13017.38
8 381 0.98 26067.92 13034.45
16 381 0.92 26020.69 13010.80
32 381 0.94 26067.91 13034.42
64 381 0.98 26170.28 13085.63
128 381 1.14 26067.93 13034.53
256 381 1.14 26067.93 13034.53
512 381 1.24 25863.22 12932.23
1024 381 1.49 26238.59 13120.04
2048 380 1.81 25991.86 12996.84
4096 380 2.03 25999.75 13000.89
8192 380 2.75 25991.78 12997.26
16384 380 26103.85 26305.66 26204.75
32768 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#----------------------------------------------------------------
# Benchmarking Allgather
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.15 0.18 0.16
1 385 0.77 26033.49 13017.13
2 380 0.74 25999.74 13000.24
4 380 0.74 25999.73 13000.23
8 380 0.75 25897.09 12948.92
16 380 0.72 25931.27 12966.00
32 380 0.73 25794.45 12897.59
64 380 0.74 25983.94 12992.34
128 380 0.95 25965.48 12983.21
256 380 1.03 26273.41 13137.22
512 380 1.03 26033.93 13017.48
1024 380 1.17 26102.34 13051.75
2048 380 1.35 26033.91 13017.63
4096 380 1.56 25999.71 13000.63
8192 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#----------------------------------------------------------------
# Benchmarking Allgatherv
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.17 0.18 0.17
1 381 0.82 25970.85 12985.83
2 381 0.89 25965.60 12983.25
4 380 0.93 25999.72 13000.33
8 380 0.82 26094.47 13047.65
16 380 0.84 26136.57 13068.71
32 380 0.87 26033.93 13017.40
64 380 0.85 26033.94 13017.40
128 380 1.05 26168.14 13084.60
256 380 1.07 26068.17 13034.62
512 371 1.12 26193.46 13097.29
1024 371 1.42 26069.38 13035.40
2048 371 1.76 26013.26 13007.51
4096 371 1.84 26139.89 13070.87
8192 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#----------------------------------------------------------------
# Benchmarking Gather
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.13 0.14 0.13
1 385 0.38 25999.74 13000.06
2 385 0.41 25999.73 13000.07
4 385 0.40 25999.77 13000.08
8 385 0.40 25932.26 12966.33
16 385 0.39 25966.03 12983.21
32 385 0.39 26166.03 13083.21
64 384 0.39 25965.91 12983.15
128 384 0.53 26338.31 13169.42
256 384 0.55 26083.10 13041.82
512 384 0.44 26101.33 13050.88
1024 384 0.53 25999.78 13000.15
2048 381 0.91 25965.64 12983.28
4096 381 1.10 25931.53 12966.31
8192 time-out.; Time limit (secs_per_sample * msg_sizes_list_len) is over; use "-time X" or SECS_PER_SAMPLE=X (IMB_settings.h) to increase time limit.

#----------------------------------------------------------------
# Benchmarking Gatherv
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.17 0.27 0.22

 

0 Kudos
Greg_R_
Beginner
4,411 Views

I attached the output in an easier to read text file.

0 Kudos
JyotsnaK_Intel
Moderator
4,362 Views

Hi Greg,

Thank you for your inquiry. We offer support for hardware platforms that the Intel® oneAPI product supports. These platforms include those that are part of the Intel® Core™ processor family or higher, the Intel® Xeon® processor family, the Intel® Xeon® Scalable processor family, and others which can be found here – Intel® oneAPI Base Toolkit System Requirements, Intel® oneAPI HPC Toolkit System Requirements, Intel® oneAPI IoT Toolkit System Requirements

If you wish to use oneAPI on hardware that is not listed at one of the sites above, we encourage you to visit and contribute to the open oneAPI specification - https://www.oneapi.io/spec/


Best regards,

Jyotsna


0 Kudos
HemanthCH_Intel
Moderator
4,328 Views

Hi,


We are closing this issue. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards,

Hemanth


0 Kudos
Reply