Greetings,

CStac · ‎10-09-2013

Greetings,

We provide the full Intel Cluster compiler suite of software on our cluster which uses Torque/Moab. Last week one of my users complained that his Hybrid OpenMP/MPI code wasn't running properly. The OpenMP portion was running great, but the MPI wasn't splitting the job up across nodes. So I dug into it a bit. Sure enough, the job launches on $X nodes but each node gets the full range of work and isn't split up.

To ensure that this was not a problem in his code, I slapped together a really basic hello world script with MPI and OpenMP. I confirmed the same behaviour. Not only that but it works just fine if I compile it with GCC instead! Hrm. Well, I haven't upgraded the Intel tool set in a few months and I know at least one update; maybe that is the problem. So I updated all of the toolsets that I have access to. _Everything_ is now up to date (as of yesterday). Try again and the exact same results.

[bash]
$ mpif90 --version
GNU Fortran (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3)
$ mpiifort --version
ifort (IFORT) 14.0.0 20130728
[/bash]

Well maybe it is my code. I am more of a sysadmin then a programmer. I found this code snippet out in the wild and tried it: http://www.rcac.purdue.edu/userinfo/resources/common/compile/hybrid_hello.f90

Compile: mpiifort -openmp -mt_mpi hybrid_hello.f90
Run with the option of two hosts and two OpenMP threads [ `export OMP_NUM_THREADS=2` ] for testing.
Output of run:
[plain]
SERIAL REGION:     Runhost:node03                           Rank:           0 of            1 ranks, Thread:           0 of            1 threads   hello, world
PARALLEL REGION:   Runhost:node03                           Rank:           0 of            1 ranks, Thread:           0 of            2 threads   hello, world
PARALLEL REGION:   Runhost:node03                           Rank:           0 of            1 ranks, Thread:           1 of            2 threads   hello, world
SERIAL REGION:     Runhost:node03                           Rank:           0 of            1 ranks, Thread:           0 of            1 threads   hello, world
SERIAL REGION:     Runhost:node01                           Rank:           0 of            1 ranks, Thread:           0 of            1 threads   hello, world
PARALLEL REGION:   Runhost:node01                           Rank:           0 of            1 ranks, Thread:           0 of            2 threads   hello, world
PARALLEL REGION:   Runhost:node01                           Rank:           0 of            1 ranks, Thread:           1 of            2 threads   hello, world
SERIAL REGION:     Runhost:node01                           Rank:           0 of            1 ranks, Thread:           0 of            1 threads   hello, world
[/plain]

I am given two hosts by Torque/Moab and I get two OpenMP threads, but there is only 1 rank! To quote Adam Savage "Well, there's the problem!" For whatever reason, each node seems to think that it is the only MPI thread. This is pretty much what I had been seeing, but it is much better code then mine so I feel better about showing its results. :-)

What happens with GCC?
Compile: mpif90 -lgomp -fopenmp hybrid_hello.f90
Run with the exact same script/submission process as before.
Output of run:
[plain]
SERIAL REGION:     Runhost:node01                           Rank:           0 of            2 ranks, Thread:           0 of            1 threads   hello, world
PARALLEL REGION:   Runhost:node01                           Rank:           0 of            2 ranks, Thread:           0 of            2 threads   hello, world
PARALLEL REGION:   Runhost:node01                           Rank:           0 of            2 ranks, Thread:           1 of            2 threads   hello, world
SERIAL REGION:     Runhost:node01                           Rank:           0 of            2 ranks, Thread:           0 of            1 threads   hello, world
SERIAL REGION:     Runhost:node02                           Rank:           1 of            2 ranks, Thread:           0 of            1 threads   hello, world
PARALLEL REGION:   Runhost:node02                           Rank:           1 of            2 ranks, Thread:           0 of            2 threads   hello, world
PARALLEL REGION:   Runhost:node02                           Rank:           1 of            2 ranks, Thread:           1 of            2 threads   hello, world
SERIAL REGION:     Runhost:node02                           Rank:           1 of            2 ranks, Thread:           0 of            1 threads   hello, world
[/plain]

Well, look at that. It runs just fine and as expected with GCC but the Intel compiler isn't running the MPI ranking right at all. At this point, I am fairly certain it is an Intel compiler issue. Knowing that, I crashed the boards and info that Intel provides looking for answers. I found a lot but nothing really jumped out at me until I found this nifty hello world application: http://software.intel.com/en-us/articles/beginning-hybrid-mpiopenmp-development

Now for retesting using the Intel provided code. Surely this will run right. After all, the guide uses the same compile options I have been using!

Compile: mpiifort -openmp -mt_mpi hybrid-hello.f90
Run with the option of two hosts and two OpenMP threads [ `export OMP_NUM_THREADS=2` ] for testing.
Output of run:
[plain]
Hello from thread   0 of   2 in rank   0 of   1 on node01
Hello from thread   1 of   2 in rank   0 of   1 on node01
Hello from thread   0 of   2 in rank   0 of   1 on node03
Hello from thread   1 of   2 in rank   0 of   1 on node03
[/plain]

Not a great start. Same output I have been getting. What happens with GCC?

Compile: mpif90 -lgomp -fopenmp hybrid-hello.f90
Run with the exact same script/submission process as before.
Output of run:
[plain]
Hello from thread   1 of   2 in rank   0 of   1 on node02
Hello from thread   0 of   2 in rank   0 of   1 on node02
Hello from thread   0 of   2 in rank   1 of   2 on node03
Hello from thread   1 of   2 in rank   1 of   2 on node03
[/plain]

It works with GCC! What? I am now zero for three on the Hybrid OpenMP/MPI problem (Well, zero for four if you count the user who brought this to my attention). What else could it be? I wonder if it doesn't like the mpirun that I got with Torque/Moab. I do have access to (and have already installed) the Intel MPI toolsets. Well instead of using the Torque mpirun, I will try the Intel mpirun!

And....no. Not only does it not have the pernode parameter but it seems to be missing a few other features as well...I finally get it to run with `mpirun -bynode -np 2 a.out` because Torque is allocating to the job 2 cores on 2 hosts and I want it to *only* launch one MPI job per host. Anyway, it finally runs...with the exact same output as before (though I am still not convinced that I have this limited version of mpirun from Intel configured with the right options yet and I can't use mpiexec because the Intel mpiexec doesn't appear to recognize the Torque/Moab directives).

So the question is, what am I doing wrong? I can't seem to get the Intel compiled version of this code to run in a proper Hybrid OpenMP/MPI configuration. It obviously is working for GCC so I fairly convinced it isn't the code. The cluster is running great with other Intel MPI jobs using the Torque provided mpirun, so I don't think it is that (though it could be I might get better perofmance if I can get the Intel mpirun tweaked and working properly). That really only leaves the Intel compiler left and I am stumped on why it isn't working.

Any help would be greatly appreciated.

Thank you!

James_T_Intel · ‎10-09-2013

Hi Chris,

Ok, let's start with a few basic checks. What is the output from each of the following commands?

[plain]which mpirun

which mpif90

env | grep I_MPI

mpirun -genv I_MPI_DEBUG 5 ./hello[/plain]

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

CStac · ‎10-09-2013

Greetings,

Well, so there are a couple of different options depending on which module I load. For the GCC module:
[plain]
$ which mpirun
/usr/bin/mpirun
$ which mpif90
/usr/bin/mpif90
$ env | grep I_MPI
<nothing>
[/plain]

And for the Intel module:
[plain]
$ which mpirun
/software/intel/impi/4.1.1.036/intel64/bin/mpirun
$ which mpif90
/software/intel/impi/4.1.1.036/intel64/bin/mpif90
$ which mpif90
/software/intel/impi/4.1.1.036/intel64/bin/mpif90
$ which mpiifort
/software/intel/impi/4.1.1.036/intel64/bin/mpiifort
$ env | grep I_MPI
I_MPI_CC=icc
I_MPI_ROOT=/software/intel/impi/4.1.1.036
[/plain]

As for the mpirun, I got errors about passing the "-genv" flag with the Torque/Moab mpirun. However, since it is just setting an environment variable in the run, I passed it in on the job submission phase. It didn't do anything for the GCC run (not that it was expect, but just to be through). Here is the output for the Intel run:
[plain]
[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=1e23070
[0] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name Pin cpu
[0] MPI startup(): 0       100525   node04     +1
[0] MPI startup(): I_MPI_DEBUG=5
BEG SERIAL REGION:     Runhost:node04                           Rank:           0 of            1 ranks, Thread:           0 of            1   threads   hello, world
PARALLEL REGION:       Runhost:node04                           Rank:           0 of            1 ranks, Thread:           0 of            2   threads   hello, world
PARALLEL REGION:       Runhost:node04                           Rank:           0 of            1 ranks, Thread:           1 of            2   threads   hello, world
END SERIAL REGION:     Runhost:node04                           Rank:           0 of            1 ranks, Thread:           0 of            1   threads   hello, world
[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=1a56070
[0] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name Pin cpu
[0] MPI startup(): 0       100236   node05     +1
[0] MPI startup(): I_MPI_DEBUG=5
BEG SERIAL REGION:     Runhost:node05                           Rank:           0 of            1 ranks, Thread:           0 of            1   threads   hello, world
PARALLEL REGION:       Runhost:node05                           Rank:           0 of            1 ranks, Thread:           0 of            2   threads   hello, world
PARALLEL REGION:       Runhost:node05                           Rank:           0 of            1 ranks, Thread:           1 of            2   threads   hello, world
END SERIAL REGION:     Runhost:node05                           Rank:           0 of            1 ranks, Thread:           0 of            1   threads   hello, world
[/plain]

If there is anything else, just let me know.

Thanks!

[Edit: proofreading fail...]

CStac · ‎10-09-2013

I have an update. Reading through some of the documentation I found that the switch I think I should be using for the Intel mpirun should be '-perhost 1'. I gave it a try and got this output:
[plain]
[0] MPI startup(): cannot open dynamic library libdat2.so.2
[0] MPI startup(): cannot open dynamic library libdat2.so
[1] MPI startup(): cannot open dynamic library libdat2.so.2
[0] MPI startup(): cannot open dynamic library libdat.so.1
[1] MPI startup(): cannot open dynamic library libdat2.so
[2] MPI startup(): cannot open dynamic library libdat2.so.2
[1] MPI startup(): cannot open dynamic library libdat.so.1
[3] MPI startup(): cannot open dynamic library libdat2.so.2
[0] MPI startup(): cannot open dynamic library libdat.so
[1] MPI startup(): cannot open dynamic library libdat.so
[0] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so: cannot open shared object file: No such file or directory

[1] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so: cannot open shared object file: No such file or directory

[2] MPI startup(): cannot open dynamic library libdat2.so
[3] MPI startup(): cannot open dynamic library libdat2.so
[2] MPI startup(): cannot open dynamic library libdat.so.1
[3] MPI startup(): cannot open dynamic library libdat.so.1
[2] MPI startup(): cannot open dynamic library libdat.so
[3] MPI startup(): cannot open dynamic library libdat.so
[2] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so: cannot open shared object file: No such file or directory

[3] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so: cannot open shared object file: No such file or directory

[0] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): shm and tcp data transfer modes
[2] MPI startup(): shm and tcp data transfer modes
[3] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): Rank    Pid      Node name Pin cpu
[0] MPI startup(): 0       90339    node02     {0,1,2,3,4,5,6,7}
[0] MPI startup(): 1       90340    node02     {8,9,10,11,12,13,14,15}
[0] MPI startup(): 2       101139   node04     {0,1,2,3,4,5,6,7}
[0] MPI startup(): 3       101140   node04     {8,9,10,11,12,13,14,15}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 8
BEG SERIAL REGION:     Runhost:node04                           Rank:           2 of            4 ranks, Thread:           0 of            1   threads   hello, world
BEG SERIAL REGION:     Runhost:node02                           Rank:           0 of            4 ranks, Thread:           0 of            1   threads   hello, world
PARALLEL REGION:       Runhost:node04                           Rank:           2 of            4 ranks, Thread:           0 of            2   threads   hello, world
PARALLEL REGION:       Runhost:node04                           Rank:           2 of            4 ranks, Thread:           1 of            2   threads   hello, world
END SERIAL REGION:     Runhost:node04                           Rank:           2 of            4 ranks, Thread:           0 of            1   threads   hello, world
PARALLEL REGION:       Runhost:node02                           Rank:           0 of            4 ranks, Thread:           0 of            2   threads   hello, world
PARALLEL REGION:       Runhost:node02                           Rank:           0 of            4 ranks, Thread:           1 of            2   threads   hello, world
END SERIAL REGION:     Runhost:node02                           Rank:           0 of            4 ranks, Thread:           0 of            1   threads   hello, world
BEG SERIAL REGION:     Runhost:node02                           Rank:           1 of            4 ranks, Thread:           0 of            1   threads   hello, world
BEG SERIAL REGION:     Runhost:node04                           Rank:           3 of            4 ranks, Thread:           0 of            1   threads   hello, world
PARALLEL REGION:       Runhost:node04                           Rank:           3 of            4 ranks, Thread:           0 of            2   threads   hello, world
PARALLEL REGION:       Runhost:node04                           Rank:           3 of            4 ranks, Thread:           1 of            2   threads   hello, world
END SERIAL REGION:     Runhost:node04                           Rank:           3 of            4 ranks, Thread:           0 of            1   threads   hello, world
PARALLEL REGION:       Runhost:node02                           Rank:           1 of            4 ranks, Thread:           0 of            2   threads   hello, world
PARALLEL REGION:       Runhost:node02                           Rank:           1 of            4 ranks, Thread:           1 of            2   threads   hello, world
END SERIAL REGION:     Runhost:node02                           Rank:           1 of            4 ranks, Thread:           0 of            1   threads   hello, world
[/plain]

So the good news is that it finally ran an MPI job! The bad news is that it completely disregards the parameters I told it. The last thing I need is users finding ways to get access to more resources then they allocated and screweing over other users in the queue....I have enough to do as is. :-D

Any idea why it isn't playing nicely with the resources it was given?

Thank you!

[Edit] I also just added libibverbs-devel package to the puppet module so it should push out to all the nodes shortly and remove that warning. Not sure why that is coming up but it is an easy fix.
[Edit 2] The genv passes just fine to the Intel mpirun (which shouldn't be a surprise, but I thought I would state it for completeness sake).

James_T_Intel · ‎10-09-2013

Hi Chris,

Try setting the following environment variables:

[plain]I_MPI_HYDRA_BOOTSTRAP=jmi

I_MPI_HYDRA_RMK=pbs[/plain]

James.

CStac · ‎10-10-2013

Greetings,

I tried that but it didn't seem to work.

[plain][mpiexec@node02] HYDT_bsci_jmi_load_library (./tools/bootstrap/jmi/jmi_init.c:68): Cannot open libjmi.so library, error=libjmi.so: cannot open shared object file: No such file or directory
[mpiexec@node02] HYDT_bsci_launcher_jmi_init (./tools/bootstrap/jmi/jmi_init.c:36): Error while loading JMI library.
[mpiexec@node02] HYDT_bsci_init (./tools/bootstrap/src/bsci_init.c:172): launcher init returned error
[mpiexec@node02] main (./ui/mpich/mpiexec.c:514): unable to initialize the bootstrap server
[/plain]

So I went looking for the jmi library but this is all I found.
[plain]
$ find /software/intel -type f -iname "*jmi*"
/software/intel/impi/4.1.0.030/intel64/lib/libjmi_pbs.so.1.0
/software/intel/impi/4.1.0.030/intel64/lib/libjmi_slurm.so.1.1
/software/intel/impi/4.1.1.036/intel64/lib/libjmi_pbs.so.1.0
/software/intel/impi/4.1.1.036/intel64/lib/libjmi_slurm.so.1.1
[/plain]

I decided to try without the jmi
[plain]
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
[0] MPI startup(): RLIMIT_MEMLOCK too small
[0] MPI startup(): RLIMIT_MEMLOCK too small
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
[1] MPI startup(): RLIMIT_MEMLOCK too small
[1] MPI startup(): RLIMIT_MEMLOCK too small
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
[2] MPI startup(): RLIMIT_MEMLOCK too small
[2] MPI startup(): RLIMIT_MEMLOCK too small
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
[3] MPI startup(): RLIMIT_MEMLOCK too small
[3] MPI startup(): RLIMIT_MEMLOCK too small
[0] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): shm and tcp data transfer modes
[2] MPI startup(): shm and tcp data transfer modes
[3] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): Rank    Pid      Node name Pin cpu
[0] MPI startup(): 0       104731   node01     {0,1,2,3,4,5,6,7}
[0] MPI startup(): 1       104732   node01     {8,9,10,11,12,13,14,15}
[0] MPI startup(): 2       52013    node02     {0,1,2,3,4,5,6,7}
[0] MPI startup(): 3       52014    node02     {8,9,10,11,12,13,14,15}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10,21,21,10
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 8
BEG SERIAL REGION:     Runhost:node02                           Rank:           2 of            4 ranks, Thread:           0 of            1   threads   hello, world
PARALLEL REGION:       Runhost:node02                           Rank:           2 of            4 ranks, Thread:           0 of            2   threads   hello, world
PARALLEL REGION:       Runhost:node02                           Rank:           2 of            4 ranks, Thread:           1 of            2   threads   hello, world
END SERIAL REGION:     Runhost:node02                           Rank:           2 of            4 ranks, Thread:           0 of            1   threads   hello, world
BEG SERIAL REGION:     Runhost:node02                           Rank:           3 of            4 ranks, Thread:           0 of            1   threads   hello, world
BEG SERIAL REGION:     Runhost:node01                           Rank:           1 of            4 ranks, Thread:           0 of            1   threads   hello, world
BEG SERIAL REGION:     Runhost:node01                           Rank:           0 of            4 ranks, Thread:           0 of            1   threads   hello, world
PARALLEL REGION:       Runhost:node02                           Rank:           1 of            4 ranks, Thread:           0 of            2   threads   hello, world
PARALLEL REGION:       Runhost:node01                           Rank:           3 of            4 ranks, Thread:           0 of            2   threads   hello, world
PARALLEL REGION:       Runhost:node02                           Rank:           3 of            4 ranks, Thread:           1 of            2   threads   hello, world
END SERIAL REGION:     Runhost:node02                           Rank:           3 of            4 ranks, Thread:           0 of            1   threads   hello, world
PARALLEL REGION:       Runhost:node01                           Rank:           1 of            4 ranks, Thread:           1 of            2   threads   hello, world
END SERIAL REGION:     Runhost:node01                           Rank:           1 of            4 ranks, Thread:           0 of            1   threads   hello, world
PARALLEL REGION:       Runhost:node01                           Rank:           0 of            4 ranks, Thread:           0 of            2   threads   hello, world
PARALLEL REGION:       Runhost:node01                           Rank:           0 of            4 ranks, Thread:           1 of            2   threads   hello, world
END SERIAL REGION:     Runhost:node01                           Rank:           0 of            4 ranks, Thread:           0 of            1   threads   hello, world
[/plain]

I am not sure what those errrors mean at the moment, but I will search online and see what turns up.

Thanks!

Hybrid OpenMP/MPI doesn't work with the Intel compiler