Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
72 Views

Varying Intel MPI results using different topologies

Jump to solution
Hello, I am compiling and running a massive electronic structure program on an NSF supercomputer.  I am compiling with the intel/15.0.2 Fortran compiler and impi/5.0.2, the latest-installed Intel MPI library. The program has hybrid parallelization (MPI and OpenMP).  When I run the program on a molecule using 4 MPI tasks on a single node (no OpenMP threading anywhere here), I obtain the correct result. However, when I spread out the 4 tasks on 2 nodes (still 4 total tasks, just 2 on each node), I get what seem to be numerical-/precision-related errors. Following Intel's Michael Steyer's slides on Intel MPI conditional reproducibility (http://goparallel.sourceforge.net/wp-content/uploads/2015/06/PUM21-3-Intel_MPI_Library_Conditional_Reproducibility.pdf), I specified that all collective operations be run using topology-unaware algorithms by running mpiexec.hydra with the following flags: -genv I_MPI_DEBUG 100 -genv I_MPI_ADJUST_ALLGATHER 1 -genv I_MPI_ADJUST_ALLGATHERV 1 -genv I_MPI_ADJUST_ALLREDUCE 2 -genv I_MPI_ADJUST_ALLTOALL 1 -genv I_MPI_ADJUST_ALLTOALLV 1 -genv I_MPI_ADJUST_ALLTOALLW 1 -genv I_MPI_ADJUST_BARRIER 1 -genv I_MPI_ADJUST_BCAST 1 -genv I_MPI_ADJUST_EXSCAN 1 -genv I_MPI_ADJUST_GATHER 1 -genv I_MPI_ADJUST_GATHERV 1 -genv I_MPI_ADJUST_REDUCE 1 -genv I_MPI_ADJUST_REDUCE_SCATTER 1 -genv I_MPI_ADJUST_SCAN 1 -genv I_MPI_ADJUST_SCATTER 1 -genv I_MPI_ADJUST_SCATTERV 1 -genv I_MPI_ADJUST_REDUCE_SEGMENT 1:14000 -genv I_MPI_STATS_SCOPE "topo" -genv I_MPI_STATS "ipm" This helps my job to proceed further than it did before; however, it still dies with what seems like numerical-/precision-related errors. My question is: What other topology-aware settings are there, so that I can try to disable them and therefore obtain the correct results that I achieve when the MPI tasks run on only a single node? I have pored through the Intel MPI manual and haven't seen anything other than the above. Please note that sometimes using multiple nodes works, e.g., if I use 2 MPI tasks total spread over two nodes.  It really seems to me to be a strange topology issue. Another note, compiling+running with the latest versions of OpenMPI and MVAPICH2 both consistently die with seg faults, so using those libraries isn't really an option here. I obtain the same issues/results no matter what nodes have been allocated to me, and I have tested this many times. Thank you very much in advance for your help! Best, Andrew
0 Kudos

Accepted Solutions
Highlighted
Employee
72 Views

Hi Andrew,

There're a lot of stf files in your archive which one should I look at?

Also I'd recommend you to try the latest Intel MPI Library 5.1.1 and Intel Trace Analyzer and Collector (ITAC) 9.1.1 if possible. It may prevent crashes in case of I_MPI_STATS using. Also there's a tool MPI Performance Snapshot (part of ITAC) - the tool for preliminary analysis. It may help you to determine the problem area (if any).

View solution in original post

0 Kudos
8 Replies
Highlighted
Employee
72 Views

Hello Andrew,

You wrote:

However, when I spread out the 4 tasks on 2 nodes (still 4 total tasks, just 2 on each node), I get what seem to be numerical-/precision-related errors.

Could you please show an example of the error? Is it an application error or something related to MPI?

Could you please try to run the following scenarios with I_MPI_DEBUG=6 and provide the output:
1. 4 MPI tasks on a single node
2. 4 MPI tasks on 2 nodes (still 4 total tasks, just 2 on each node)

0 Kudos
Highlighted
Beginner
72 Views

Hi Artem,

Thanks so much for your reply.  The errors are all application errors.

I have attached what you asked for.  Here are snippets of some I_MPI_DEBUG parts:

4 tasks, 1 node:

[0] MPI startup(): Intel(R) MPI Library, Version 5.0 Update 2  Build 20141030 (build id: 10994)
[0] MPI startup(): Copyright (C) 2003-2014 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[3] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[3] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[0] MPI startup(): shm and dapl data transfer modes
[2] MPI startup(): shm and dapl data transfer modes
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[1] MPI startup(): shm and dapl data transfer modes
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[2] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[2] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[3] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[3] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[0] MPI startup(): Device_reset_idx=0
[0] MPI startup(): Allgather: 1: 1-2861 & 0-8
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-8
[0] MPI startup(): Allgather: 1: 0-605 & 9-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 9-2147483647
[0] MPI startup(): Allgatherv: 1: 0-2554 & 0-8
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-8
[0] MPI startup(): Allgatherv: 1: 0-272 & 9-16
[0] MPI startup(): Allgatherv: 2: 272-657 & 9-16
[0] MPI startup(): Allgatherv: 1: 657-2078 & 9-16
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 9-16
[0] MPI startup(): Allgatherv: 1: 0-1081 & 17-32
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 17-32
[0] MPI startup(): Allgatherv: 1: 0-547 & 33-64
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 33-64
[0] MPI startup(): Allgatherv: 1: 0-19 & 65-2147483647
[0] MPI startup(): Allgatherv: 2: 19-239 & 65-2147483647
[0] MPI startup(): Allgatherv: 1: 239-327 & 65-2147483647
[0] MPI startup(): Allgatherv: 4: 327-821 & 65-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 65-2147483647
[0] MPI startup(): Allreduce: 1: 0-5738 & 0-4
[0] MPI startup(): Allreduce: 2: 5738-197433 & 0-4
[0] MPI startup(): Allreduce: 7: 197433-593742 & 0-4
[0] MPI startup(): Allreduce: 2: 0-2147483647 & 0-4
[0] MPI startup(): Allreduce: 1: 0-5655 & 5-8
[0] MPI startup(): Allreduce: 2: 5655-75166 & 5-8
[0] MPI startup(): Allreduce: 8: 75166-177639 & 5-8
[0] MPI startup(): Allreduce: 3: 177639-988014 & 5-8
[0] MPI startup(): Allreduce: 2: 988014-1643869 & 5-8
[0] MPI startup(): Allreduce: 8: 1643869-2494859 & 5-8
[0] MPI startup(): Allreduce: 2: 0-2147483647 & 5-8
[0] MPI startup(): Allreduce: 1: 0-587 & 9-16
[0] MPI startup(): Allreduce: 2: 587-3941 & 9-16
[0] MPI startup(): Allreduce: 1: 3941-9003 & 9-16
[0] MPI startup(): Allreduce: 2: 9003-101469 & 9-16
[0] MPI startup(): Allreduce: 8: 101469-355768 & 9-16
[0] MPI startup(): Allreduce: 3: 355768-3341814 & 9-16
[0] MPI startup(): Allreduce: 8: 0-2147483647 & 9-16
[0] MPI startup(): Allreduce: 1: 0-795 & 17-32
[0] MPI startup(): Allreduce: 2: 795-146567 & 17-32
[0] MPI startup(): Allreduce: 8: 146567-732118 & 17-32
[0] MPI startup(): Allreduce: 3: 0-2147483647 & 17-32
[0] MPI startup(): Allreduce: 1: 0-528 & 33-64
[0] MPI startup(): Allreduce: 2: 528-221277 & 33-64
[0] MPI startup(): Allreduce: 8: 221277-1440737 & 33-64
[0] MPI startup(): Allreduce: 3: 0-2147483647 & 33-64
[0] MPI startup(): Allreduce: 1: 0-481 & 65-128
[0] MPI startup(): Allreduce: 2: 481-593833 & 65-128
[0] MPI startup(): Allreduce: 8: 593833-2962021 & 65-128
[0] MPI startup(): Allreduce: 7: 0-2147483647 & 65-128
[0] MPI startup(): Allreduce: 1: 0-584 & 129-256
[0] MPI startup(): Allreduce: 2: 0-2147483647 & 129-256
[0] MPI startup(): Allreduce: 1: 0-604 & 257-2147483647
[0] MPI startup(): Allreduce: 2: 604-2997006 & 257-2147483647
[0] MPI startup(): Allreduce: 8: 0-2147483647 & 257-2147483647
[0] MPI startup(): Alltoall: 4: 0-2048 & 0-4
[0] MPI startup(): Alltoall: 2: 2049-8192 & 0-4
[0] MPI startup(): Alltoall: 4: 8193-16384 & 0-4
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 0-4
[0] MPI startup(): Alltoall: 1: 0-0 & 5-8
[0] MPI startup(): Alltoall: 4: 1-8 & 5-8
[0] MPI startup(): Alltoall: 1: 9-2585 & 5-8
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 5-8
[0] MPI startup(): Alltoall: 1: 0-2025 & 9-16
[0] MPI startup(): Alltoall: 4: 2026-3105 & 9-16
[0] MPI startup(): Alltoall: 2: 3106-19194 & 9-16
[0] MPI startup(): Alltoall: 3: 19195-42697 & 9-16
[0] MPI startup(): Alltoall: 4: 42698-131072 & 9-16
[0] MPI startup(): Alltoall: 3: 131073-414909 & 9-16
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 9-16
[0] MPI startup(): Alltoall: 2: 0-0 & 17-32
[0] MPI startup(): Alltoall: 1: 1-1026 & 17-32
[0] MPI startup(): Alltoall: 4: 1027-4096 & 17-32
[0] MPI startup(): Alltoall: 2: 4097-38696 & 17-32
[0] MPI startup(): Alltoall: 4: 38697-131072 & 17-32
[0] MPI startup(): Alltoall: 3: 0-2147483647 & 17-32
[0] MPI startup(): Alltoall: 3: 0-0 & 33-64
[0] MPI startup(): Alltoall: 1: 1-543 & 33-64
[0] MPI startup(): Alltoall: 4: 544-4096 & 33-64
[0] MPI startup(): Alltoall: 2: 4097-16384 & 33-64
[0] MPI startup(): Alltoall: 3: 16385-65536 & 33-64
[0] MPI startup(): Alltoall: 4: 65537-131072 & 33-64
[0] MPI startup(): Alltoall: 3: 0-2147483647 & 33-64
[0] MPI startup(): Alltoall: 1: 0-261 & 65-128
[0] MPI startup(): Alltoall: 4: 262-7180 & 65-128
[0] MPI startup(): Alltoall: 2: 7181-58902 & 65-128
[0] MPI startup(): Alltoall: 4: 58903-65536 & 65-128
[0] MPI startup(): Alltoall: 3: 65537-1048576 & 65-128
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 65-128
[0] MPI startup(): Alltoall: 1: 0-131 & 129-256
[0] MPI startup(): Alltoall: 4: 132-7193 & 129-256
[0] MPI startup(): Alltoall: 2: 7194-16813 & 129-256
[0] MPI startup(): Alltoall: 3: 16814-32768 & 129-256
[0] MPI startup(): Alltoall: 4: 32769-65536 & 129-256
[0] MPI startup(): Alltoall: 3: 0-2147483647 & 129-256
[0] MPI startup(): Alltoall: 1: 0-66 & 257-2147483647
[0] MPI startup(): Alltoall: 4: 67-6568 & 257-2147483647
[0] MPI startup(): Alltoall: 2: 6569-16572 & 257-2147483647
[0] MPI startup(): Alltoall: 3: 16573-32768 & 257-2147483647
[0] MPI startup(): Alltoall: 4: 32769-438901 & 257-2147483647
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 257-2147483647
[0] MPI startup(): Alltoallv: 0: 0-2147483647 & 0-8
[0] MPI startup(): Alltoallv: 0: 0-4 & 9-2147483647
[0] MPI startup(): Alltoallv: 2: 0-2147483647 & 9-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 2: 0-2147483647 & 0-4
[0] MPI startup(): Barrier: 5: 0-2147483647 & 5-8
[0] MPI startup(): Barrier: 2: 0-2147483647 & 9-32
[0] MPI startup(): Barrier: 4: 0-2147483647 & 33-2147483647
[0] MPI startup(): Bcast: 4: 1-256 & 0-8
[0] MPI startup(): Bcast: 1: 257-17181 & 0-8
[0] MPI startup(): Bcast: 7: 17182-1048576 & 0-8
[0] MPI startup(): Bcast: 7: 0-2147483647 & 0-8
[0] MPI startup(): Bcast: 7: 0-2147483647 & 9-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 4: 0-12 & 0-4
[0] MPI startup(): Reduce_scatter: 5: 12-27 & 0-4
[0] MPI startup(): Reduce_scatter: 3: 27-49 & 0-4
[0] MPI startup(): Reduce_scatter: 1: 49-187 & 0-4
[0] MPI startup(): Reduce_scatter: 3: 187-405673 & 0-4
[0] MPI startup(): Reduce_scatter: 4: 405673-594687 & 0-4
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 0-4
[0] MPI startup(): Reduce_scatter: 5: 0-24 & 5-8
[0] MPI startup(): Reduce_scatter: 1: 24-155 & 5-8
[0] MPI startup(): Reduce_scatter: 3: 155-204501 & 5-8
[0] MPI startup(): Reduce_scatter: 5: 204501-274267 & 5-8
[0] MPI startup(): Reduce_scatter: 4: 0-2147483647 & 5-8
[0] MPI startup(): Reduce_scatter: 1: 0-63 & 9-16
[0] MPI startup(): Reduce_scatter: 3: 63-72 & 9-16
[0] MPI startup(): Reduce_scatter: 1: 72-264 & 9-16
[0] MPI startup(): Reduce_scatter: 3: 264-168421 & 9-16
[0] MPI startup(): Reduce_scatter: 4: 168421-168421 & 9-16
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 9-16
[0] MPI startup(): Reduce_scatter: 3: 0-0 & 17-32
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 17-32
[0] MPI startup(): Reduce_scatter: 1: 4-12 & 17-32
[0] MPI startup(): Reduce_scatter: 5: 12-18 & 17-32
[0] MPI startup(): Reduce_scatter: 1: 18-419 & 17-32
[0] MPI startup(): Reduce_scatter: 3: 419-188739 & 17-32
[0] MPI startup(): Reduce_scatter: 4: 188739-716329 & 17-32
[0] MPI startup(): Reduce_scatter: 5: 716329-1365841 & 17-32
[0] MPI startup(): Reduce_scatter: 2: 1365841-2430194 & 17-32
[0] MPI startup(): Reduce_scatter: 4: 0-2147483647 & 17-32
[0] MPI startup(): Reduce_scatter: 3: 0-0 & 33-64
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 33-64
[0] MPI startup(): Reduce_scatter: 5: 4-17 & 33-64
[0] MPI startup(): Reduce_scatter: 1: 17-635 & 33-64
[0] MPI startup(): Reduce_scatter: 3: 635-202937 & 33-64
[0] MPI startup(): Reduce_scatter: 5: 202937-308253 & 33-64
[0] MPI startup(): Reduce_scatter: 4: 308253-1389874 & 33-64
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 33-64
[0] MPI startup(): Reduce_scatter: 3: 0-0 & 65-128
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 65-128
[0] MPI startup(): Reduce_scatter: 5: 4-16 & 65-128
[0] MPI startup(): Reduce_scatter: 1: 16-1238 & 65-128
[0] MPI startup(): Reduce_scatter: 3: 1238-280097 & 65-128
[0] MPI startup(): Reduce_scatter: 5: 280097-631434 & 65-128
[0] MPI startup(): Reduce_scatter: 4: 631434-2605072 & 65-128
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 65-128
[0] MPI startup(): Reduce_scatter: 2: 0-0 & 129-256
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 129-256
[0] MPI startup(): Reduce_scatter: 5: 4-16 & 129-256
[0] MPI startup(): Reduce_scatter: 1: 16-2418 & 129-256
[0] MPI startup(): Reduce_scatter: 3: 0-2147483647 & 129-256
[0] MPI startup(): Reduce_scatter: 2: 0-0 & 257-2147483647
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 257-2147483647
[0] MPI startup(): Reduce_scatter: 5: 4-16 & 257-2147483647
[0] MPI startup(): Reduce_scatter: 1: 16-33182 & 257-2147483647
[0] MPI startup(): Reduce_scatter: 3: 33182-3763779 & 257-2147483647
[0] MPI startup(): Reduce_scatter: 4: 0-2147483647 & 257-2147483647
[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-256
[0] MPI startup(): Reduce: 3: 4-45 & 257-2147483647
[0] MPI startup(): Reduce: 1: 0-2147483647 & 257-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-8
[0] MPI startup(): Scatter: 3: 1-140 & 9-16
[0] MPI startup(): Scatter: 1: 141-1302 & 9-16
[0] MPI startup(): Scatter: 3: 0-2147483647 & 9-16
[0] MPI startup(): Scatter: 3: 1-159 & 17-32
[0] MPI startup(): Scatter: 1: 160-486 & 17-32
[0] MPI startup(): Scatter: 3: 0-2147483647 & 17-32
[0] MPI startup(): Scatter: 1: 1-149 & 33-64
[0] MPI startup(): Scatter: 3: 0-2147483647 & 33-64
[0] MPI startup(): Scatter: 1: 1-139 & 65-2147483647
[0] MPI startup(): Scatter: 3: 0-2147483647 & 65-2147483647
[0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-256
[0] MPI startup(): Scatterv: 2: 0-2147483647 & 257-2147483647
[1] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=6) Fabric(intra=1 inter=4 flags=0x0)
[3] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=6) Fabric(intra=1 inter=4 flags=0x0)
[0] MPI startup(): Rank    Pid      Node name                          Pin cpu
[0] MPI startup(): 0       125336   c560-802.stampede.tacc.utexas.edu  {8,9,10,11}
[0] MPI startup(): 1       125337   c560-802.stampede.tacc.utexas.edu  {12,13,14,15}
[2] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=6) Fabric(intra=1 inter=4 flags=0x0)
[0] MPI startup(): 2       125338   c560-802.stampede.tacc.utexas.edu  {0,1,2,3}
[0] MPI startup(): 3       125339   c560-802.stampede.tacc.utexas.edu  {4,5,6,7}
[0] MPI startup(): Recognition=2 Platform(code=8 ippn=2 dev=6) Fabric(intra=1 inter=4 flags=0x0)
[0] MPI startup(): I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
[0] MPI startup(): I_MPI_DEBUG=6
[0] MPI startup(): I_MPI_FABRICS=shm:dapl
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10,21,21,10
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx4_0:1,scif0:-1,mic0:1
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 8,1 12,2 0,3 4

4 tasks, 2 nodes:

[0] MPI startup(): Intel(R) MPI Library, Version 5.0 Update 2  Build 20141030 (build id: 10994)
[0] MPI startup(): Copyright (C) 2003-2014 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[3] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[3] MPI startup(): shm and dapl data transfer modes
[2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[2] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[1] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): shm and dapl data transfer modes
[2] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[2] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[3] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[3] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[0] MPI startup(): Device_reset_idx=0
[0] MPI startup(): Allgather: 1: 1-490 & 0-8
[0] MPI startup(): Allgather: 2: 491-558 & 0-8
[0] MPI startup(): Allgather: 1: 559-2319 & 0-8
[0] MPI startup(): Allgather: 3: 2320-46227 & 0-8
[0] MPI startup(): Allgather: 1: 46228-2215101 & 0-8
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-8
[0] MPI startup(): Allgather: 1: 1-1005 & 9-16
[0] MPI startup(): Allgather: 2: 1006-1042 & 9-16
[0] MPI startup(): Allgather: 1: 1043-2059 & 9-16
[0] MPI startup(): Allgather: 3: 0-2147483647 & 9-16
[0] MPI startup(): Allgather: 1: 1-2454 & 17-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 17-2147483647
[0] MPI startup(): Allgatherv: 1: 0-3147 & 0-4
[0] MPI startup(): Allgatherv: 2: 3147-5622 & 0-4
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-4
[0] MPI startup(): Allgatherv: 1: 0-975 & 5-8
[0] MPI startup(): Allgatherv: 2: 975-4158 & 5-8
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 5-8
[0] MPI startup(): Allgatherv: 1: 0-2146 & 9-16
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 9-16
[0] MPI startup(): Allgatherv: 1: 0-81 & 17-32
[0] MPI startup(): Allgatherv: 2: 81-414 & 17-32
[0] MPI startup(): Allgatherv: 1: 414-1190 & 17-32
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 17-32
[0] MPI startup(): Allgatherv: 2: 0-1 & 33-2147483647
[0] MPI startup(): Allgatherv: 1: 1-3 & 33-2147483647
[0] MPI startup(): Allgatherv: 2: 3-783 & 33-2147483647
[0] MPI startup(): Allgatherv: 4: 783-1782 & 33-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 33-2147483647
[0] MPI startup(): Allreduce: 7: 0-2084 & 0-4
[0] MPI startup(): Allreduce: 1: 2084-15216 & 0-4
[0] MPI startup(): Allreduce: 7: 15216-99715 & 0-4
[0] MPI startup(): Allreduce: 3: 99715-168666 & 0-4
[0] MPI startup(): Allreduce: 2: 168666-363889 & 0-4
[0] MPI startup(): Allreduce: 7: 0-2147483647 & 0-4
[0] MPI startup(): Allreduce: 1: 0-14978 & 5-8
[0] MPI startup(): Allreduce: 2: 14978-66879 & 5-8
[0] MPI startup(): Allreduce: 8: 66879-179296 & 5-8
[0] MPI startup(): Allreduce: 3: 179296-304801 & 5-8
[0] MPI startup(): Allreduce: 7: 304801-704509 & 5-8
[0] MPI startup(): Allreduce: 2: 0-2147483647 & 5-8
[0] MPI startup(): Allreduce: 1: 0-16405 & 9-16
[0] MPI startup(): Allreduce: 2: 16405-81784 & 9-16
[3] MPI startup(): Recognition=2 Platform(code=8 ippn=1 dev=6) Fabric(intra=1 inter=4 flags=0x0)
[0] MPI startup(): Allreduce: 8: 81784-346385 & 9-16
[0] MPI startup(): Allreduce: 7: 346385-807546 & 9-16
[0] MPI startup(): Allreduce: 2: 807546-1259854 & 9-16
[2] MPI startup(): Recognition=2 Platform(code=8 ippn=1 dev=6) Fabric(intra=1 inter=4 flags=0x0)
[0] MPI startup(): Allreduce: 3: 0-2147483647 & 9-16
[0] MPI startup(): Allreduce: 1: 0-8913 & 17-32
[0] MPI startup(): Allreduce: 2: 8913-103578 & 17-32
[0] MPI startup(): Allreduce: 8: 103578-615876 & 17-32
[0] MPI startup(): Allreduce: 2: 0-2147483647 & 17-32
[0] MPI startup(): Allreduce: 1: 0-1000 & 33-64
[0] MPI startup(): Allreduce: 2: 1000-2249 & 33-64
[0] MPI startup(): Allreduce: 1: 2249-6029 & 33-64
[0] MPI startup(): Allreduce: 2: 6029-325357 & 33-64
[0] MPI startup(): Allreduce: 8: 325357-1470976 & 33-64
[0] MPI startup(): Allreduce: 7: 1470976-2556670 & 33-64
[0] MPI startup(): Allreduce: 3: 0-2147483647 & 33-64
[0] MPI startup(): Allreduce: 1: 0-664 & 65-128
[0] MPI startup(): Allreduce: 2: 664-754706 & 65-128
[0] MPI startup(): Allreduce: 4: 754706-1663862 & 65-128
[0] MPI startup(): Allreduce: 2: 1663862-3269097 & 65-128
[0] MPI startup(): Allreduce: 7: 0-2147483647 & 65-128
[0] MPI startup(): Allreduce: 1: 0-789 & 129-2147483647
[0] MPI startup(): Allreduce: 2: 789-2247589 & 129-2147483647
[0] MPI startup(): Allreduce: 8: 0-2147483647 & 129-2147483647
[0] MPI startup(): Alltoall: 2: 0-1 & 0-2
[0] MPI startup(): Alltoall: 3: 2-64 & 0-2
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 0-2
[0] MPI startup(): Alltoall: 2: 0-0 & 3-4
[0] MPI startup(): Alltoall: 3: 1-119 & 3-4
[0] MPI startup(): Alltoall: 1: 120-256 & 3-4
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 3-4
[0] MPI startup(): Alltoall: 1: 0-1599 & 5-8
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 5-8
[0] MPI startup(): Alltoall: 2: 0-0 & 9-16
[0] MPI startup(): Alltoall: 1: 1-8 & 9-16
[0] MPI startup(): Alltoall: 2: 9-36445 & 9-16
[0] MPI startup(): Alltoall: 3: 36446-163048 & 9-16
[0] MPI startup(): Alltoall: 4: 163049-524288 & 9-16
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 9-16
[0] MPI startup(): Alltoall: 1: 0-789 & 17-32
[0] MPI startup(): Alltoall: 2: 790-78011 & 17-32
[0] MPI startup(): Alltoall: 3: 78012-378446 & 17-32
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 17-32
[0] MPI startup(): Alltoall: 1: 0-517 & 33-64
[0] MPI startup(): Alltoall: 4: 518-4155 & 33-64
[0] MPI startup(): Alltoall: 2: 4156-124007 & 33-64
[0] MPI startup(): Alltoall: 3: 124008-411471 & 33-64
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 33-64
[0] MPI startup(): Alltoall: 1: 0-260 & 65-128
[0] MPI startup(): Alltoall: 4: 261-4618 & 65-128
[0] MPI startup(): Alltoall: 2: 4619-65536 & 65-128
[0] MPI startup(): Alltoall: 3: 65537-262144 & 65-128
[0] MPI startup(): Alltoall: 4: 262145-611317 & 65-128
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 65-128
[0] MPI startup(): Alltoall: 1: 0-133 & 129-2147483647
[0] MPI startup(): Alltoall: 4: 134-5227 & 129-2147483647
[0] MPI startup(): Alltoall: 2: 5228-17246 & 129-2147483647
[0] MPI startup(): Alltoall: 4: 17247-32768 & 129-2147483647
[0] MPI startup(): Alltoall: 3: 32769-365013 & 129-2147483647
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 129-2147483647
[0] MPI startup(): Alltoallv: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 1: 0-2147483647 & 0-2
[0] MPI startup(): Barrier: 3: 0-2147483647 & 3-4
[0] MPI startup(): Barrier: 5: 0-2147483647 & 5-8
[0] MPI startup(): Barrier: 2: 0-2147483647 & 9-32
[0] MPI startup(): Barrier: 3: 0-2147483647 & 33-128
[0] MPI startup(): Barrier: 4: 0-2147483647 & 129-2147483647
[0] MPI startup(): Bcast: 4: 1-806 & 0-4
[0] MPI startup(): Bcast: 7: 807-18093 & 0-4
[0] MPI startup(): Bcast: 6: 18094-51366 & 0-4
[0] MPI startup(): Bcast: 4: 51367-182526 & 0-4
[0] MPI startup(): Bcast: 1: 182527-618390 & 0-4
[0] MPI startup(): Bcast: 7: 0-2147483647 & 0-4
[0] MPI startup(): Bcast: 1: 1-24 & 5-8
[0] MPI startup(): Bcast: 4: 25-74 & 5-8
[0] MPI startup(): Bcast: 1: 75-18137 & 5-8
[0] MPI startup(): Bcast: 7: 18138-614661 & 5-8
[0] MPI startup(): Bcast: 1: 614662-1284626 & 5-8
[0] MPI startup(): Bcast: 2: 0-2147483647 & 5-8
[0] MPI startup(): Bcast: 1: 1-1 & 9-16
[0] MPI startup(): Bcast: 7: 2-158 & 9-16
[0] MPI startup(): Bcast: 1: 159-16955 & 9-16
[0] MPI startup(): Bcast: 7: 0-2147483647 & 9-16
[0] MPI startup(): Bcast: 7: 1-242 & 17-32
[0] MPI startup(): Bcast: 1: 243-10345 & 17-32
[0] MPI startup(): Bcast: 7: 0-2147483647 & 17-32
[0] MPI startup(): Bcast: 1: 1-1 & 33-2147483647
[0] MPI startup(): Bcast: 7: 2-737 & 33-2147483647
[0] MPI startup(): Bcast: 1: 738-5340 & 33-2147483647
[0] MPI startup(): Bcast: 7: 0-2147483647 & 33-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 1: 0-6 & 0-2
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 0-2
[0] MPI startup(): Reduce_scatter: 4: 0-5 & 3-4
[0] MPI startup(): Reduce_scatter: 5: 5-13 & 3-4
[0] MPI startup(): Reduce_scatter: 3: 13-59 & 3-4
[0] MPI startup(): Reduce_scatter: 1: 59-76 & 3-4
[0] MPI startup(): Reduce_scatter: 3: 76-91488 & 3-4
[0] MPI startup(): Reduce_scatter: 4: 91488-680063 & 3-4
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 3-4
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 5-8
[0] MPI startup(): Reduce_scatter: 5: 4-11 & 5-8
[0] MPI startup(): Reduce_scatter: 1: 11-31 & 5-8
[0] MPI startup(): Reduce_scatter: 3: 31-69615 & 5-8
[0] MPI startup(): Reduce_scatter: 2: 69615-202632 & 5-8
[0] MPI startup(): Reduce_scatter: 5: 202632-396082 & 5-8
[0] MPI startup(): Reduce_scatter: 4: 396082-1495696 & 5-8
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 5-8
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 9-16
[0] MPI startup(): Reduce_scatter: 1: 4-345 & 9-16
[0] MPI startup(): Reduce_scatter: 3: 345-79523 & 9-16
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 9-16
[0] MPI startup(): Reduce_scatter: 3: 0-0 & 17-32
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 17-32
[0] MPI startup(): Reduce_scatter: 1: 4-992 & 17-32
[0] MPI startup(): Reduce_scatter: 3: 992-71417 & 17-32
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 17-32
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 33-64
[0] MPI startup(): Reduce_scatter: 1: 4-1472 & 33-64
[0] MPI startup(): Reduce_scatter: 3: 1472-196592 & 33-64
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 33-64
[0] MPI startup(): Reduce_scatter: 3: 0-0 & 65-128
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 65-128
[0] MPI startup(): Reduce_scatter: 1: 4-32892 & 65-128
[0] MPI startup(): Reduce_scatter: 3: 32892-381072 & 65-128
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 65-128
[0] MPI startup(): Reduce_scatter: 2: 0-0 & 129-2147483647
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 129-2147483647
[0] MPI startup(): Reduce_scatter: 1: 4-33262 & 129-2147483647
[0] MPI startup(): Reduce_scatter: 3: 33262-1571397 & 129-2147483647
[0] MPI startup(): Reduce_scatter: 5: 1571397-2211398 & 129-2147483647
[0] MPI startup(): Reduce_scatter: 4: 0-2147483647 & 129-2147483647
[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-2
[0] MPI startup(): Reduce: 3: 0-10541 & 3-4
[0] MPI startup(): Reduce: 1: 0-2147483647 & 3-4
[0] MPI startup(): Reduce: 1: 0-2147483647 & 5-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-2147483647
[1] MPI startup(): Recognition=2 Platform(code=8 ippn=1 dev=6) Fabric(intra=1 inter=4 flags=0x0)
[0] MPI startup(): Rank    Pid      Node name                          Pin cpu
[0] MPI startup(): 0       29454    c557-604.stampede.tacc.utexas.edu  {8,9,10,11,12,13,14,15}
[0] MPI startup(): 1       29455    c557-604.stampede.tacc.utexas.edu  {0,1,2,3,4,5,6,7}
[0] MPI startup(): 2       71567    c558-304.stampede.tacc.utexas.edu  {8,9,10,11,12,13,14,15}
[0] MPI startup(): 3       71568    c558-304.stampede.tacc.utexas.edu  {0,1,2,3,4,5,6,7}
[0] MPI startup(): Recognition=2 Platform(code=8 ippn=1 dev=6) Fabric(intra=1 inter=4 flags=0x0)
[0] MPI startup(): I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
[0] MPI startup(): I_MPI_DEBUG=6
[0] MPI startup(): I_MPI_FABRICS=shm:dapl
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10,21,21,10
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx4_0:1,scif0:-1,mic0:1
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 8,1 0

Please note that I need to re-log-in in order to change the task configurations, that's why different nodes are used between the two cases.  (But the results are always repeatable anyway.)

Thanks very for your help!

Best,

Andrew

0 Kudos
Highlighted
Beginner
72 Views

With mixed feelings, it seems that disabling Infiniband altogether fixes the problem, i.e., export I_MPI_FABRICS=tcp.

Thanks,

Andrew

0 Kudos
Highlighted
Employee
72 Views

Hi Andrew,

As far as I see according to the provided log files it doesn't look like an MPI error:

 ERROR 2189: fatal error -- debug output follows    
  Number of non normalized orbitals          194
  Largest normalization error          16747.5406248335    
  Number of non orthogonal pairs           67803
  Largest orthogonalization error      4957.20919383731    
  orbital          126  is not normalized:   2057.38967586926    
  orbital          127  is not normalized:   3530.87444859233    
  orbital          128  is not normalized:   2152.55926608134    
  orbital          129  is not normalized:   1262.41856845017    
******** MANY ERRORS LIKE THESE, INDICATING NUMERICAL/PRECISION ISSUES ********
  orbitals   126 and     1 are not orthogonal: -0.23166245E-01
  orbitals   126 and     2 are not orthogonal: -0.59257524E-02
  orbitals   126 and     3 are not orthogonal: -0.68079900E-02
  orbitals   126 and     4 are not orthogonal:  0.15700180E-01
******** EVEN MORE ERRORS LIKE THESE, INDICATING NUMERICAL/PRECISION ISSUES ********
 ----------------------------------------------------------------------
 Jaguar cannot recover from this error and will now abort.
 For technical support please contact the Schrodinger Support Center at
 http://www.schrodinger.com/supportcenter/ or help@schrodinger.com
 ----------------------------------------------------------------------

I'd check the application parameters (input data and so on).

The only thing that we are able to do from MPI perspective is to play with different collective algorithms. For this you can try to run your problem task (4 MPI processes on 2 nodes) with I_MPI_FABRICS=tcp and gather IMPI statistics (with I_MPI_STATS=20) to understand what collective operations and message lengths are mostly used in the application. And then try to vary algorithms for most used MPI collective operations. IMPI statistics are saved into stats.txt file by default - please provide it.

0 Kudos
Highlighted
Beginner
72 Views

Hi Artem,

Thank you so much for looking into this.

I may have been content with I_MPI_FABRICS=tcp solving the precision/numerical issue I originally posted about.  It is apparently a known issue that Jaguar seg faults when using Infiniband, at least for Open MPI and now apparently for my build with Intel MPI.

However, I am finding on a test job using 2 nodes, 1 task/node, and 16 threads/task that my build using Intel MPI uses only 54% of the total CPU on average.  In contrast, a version of the program built on a different machine (but run on this machine) using a version of Open MPI that comes bundled with it uses 70% of the total CPU on average (still not great obviously).

Would you guess that your suggestion regarding gathering IMPI statistics would be the way to increase the efficiency of my Intel MPI build?  Should I try using the Intel Trace Analyzer and Collector program to do this as well?

Regardless, I will provide you with the stats.txt file ASAP, I was just wondering how related this problem was in the meantime.

Thank you,

Andrew

0 Kudos
Highlighted
Beginner
72 Views

Hi Artem,

I apologize for the delay, but I have been doing lots of testing on this.

First, I got Infiniband working on most of the subprograms that Jaguar calls, so I am now less concerned about the efficiency of the collectives at the moment.

I am also less concerned about my build using only 54% of the CPUs as opposed to the release version using 70%, because Jerome Vienne at TACC suggested that this sort of thing is hard to interpret and it may be best to compare overall execution time.  Since my build is still 20% faster than the release version, I am therefore okay with this.

However, I still tried what you suggested, setting I_MPI_STATS, but I found that setting it to anything other than 0 returned seg-fault-looking errors.

So I figured that using the Intel Trace Analyzer and Collector programs would give similar results.  I am a novice on that software but upon running a small-medium-sized test system on 4 nodes I found that these were the highest-usage functions (out of the total run time of 1162s).  Please note that the job runs 1 task on each node and uses all 16 CPUs of threading per task:

  • MPI_Barrier: 78s
  • MPI_Allreduce: 46s
  • MPI_Bcast: 19s
  • MPI_Comm_create: 6s

Does this seem reasonable or would you strongly recommend tuning the collectives algorithms used?  I attached the STF files if those will suffice instead of the stats.txt that I was unable to generate.  I'm not really sure how to interpret these Intel Trace Analyzer results.  Note that I generated the "ideal" configurations too in order to then generate the imbalance diagrams...just "ls *.stf" and you'll see what I mean.

Thanks very much for your help, Artem.

Best,

Andrew

0 Kudos
Highlighted
Employee
73 Views

Hi Andrew,

There're a lot of stf files in your archive which one should I look at?

Also I'd recommend you to try the latest Intel MPI Library 5.1.1 and Intel Trace Analyzer and Collector (ITAC) 9.1.1 if possible. It may prevent crashes in case of I_MPI_STATS using. Also there's a tool MPI Performance Snapshot (part of ITAC) - the tool for preliminary analysis. It may help you to determine the problem area (if any).

View solution in original post

0 Kudos
Highlighted
Beginner
72 Views

Hi Artem,

I really apologize for my late reply.  I'd say at this point I'm fine with how everything is working, so no worries with the STF files.  And thank you for your suggestion regarding the performance analysis tools.

Thanks very much for your fast and helpful replies, Artem!

Best,

Andrew

0 Kudos