- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
I am using Intel MPI (4.1.0.024 from ICS 2013.0.028) to run my parallel application (Gromacs 4.6.1 molecular dynamics) on a SGI cluster with CentOS 6.2 and Torque 2.5.12.
When I submitt a MPI job with Torque to start and run on 2 nodes, MPI startup fails to negotiate with Infiniband (IB) and internode communication falls back to Ethernet. This is my job script:
#PBS -l nodes=n001:ppn=32+n002:ppn=32
#PBS -q normal
source /opt/intel/impi/4.1.0.024/bin64/mpivars.sh
source /opt/progs/gromacs/bin/GMXRC.bash
cd $PBS_O_WORKDIR/
mpiexec.hydra -machinefile macs -np 64 mdrun_mpi >& md.out
and this is the output:
[54] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
....
[45] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
....
[33] MPI startup(): DAPL provider <NULLstring> on rank 0:n001 differs from ofa-v2-mlx4_0-1(v2.0) on rank 33:n002
...
[0] MPI startup(): shm and tcp data transfer modes
However, MPI negotiates fine with IB if I run the same mpiexec.hydra
line from the console either logged to n001 (one of the running nodes)
or logged in another, say the admin, node. It also works fine if I
submitt the TORQUE job using a different start node than the running
nodes (-machinefile macs points to n001 and n002), say using #PBS -l
nodes=n003 and the rest identical to as above. This a succesfull (IB)
output:
[55] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
...
[29] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
...
[0] MPI startup(): shm and dapl data transfer modes
...
Any tips on what is going wrong? PLs, let me know if you need more info. This has also been posted to the TORQUE user list, but your help is welcome, too.
Cheers,
Guilherme
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Any help here appreciated. We do have purchased an ICS license. Is there another channel for support?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Guilherme,
I apologize for the delay. If you would prefer, we can handle this issue via Intel® Premier Support (https://premier.intel.com).
I would recommend using the latest version of the Intel® MPI Library, Version 4.1 Update 1. You can download this from the Intel® Registration Center (https://registrationcenter.intel.com). We have improved integration with Torque* in this update.
Is /etc/dat.conf identical on all of the nodes? What happens if you use mpirun instead of mpiexec.hydra?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi James,
Thanks for the reply.
Yes, dat.conf are the same.
If I use mpirun, the same thing happens as it eventually calls mpiexec.hydra in the compute node.
I am downloading and will install the nem IMPI version.
Cheers,
Guilherme
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi again,
Just an update to let you know the same behaviour was found after installing the IMPI Library, Version 4.1 Update 1 (4.1.1.036) you suggested.
Any tips?
Cheers,
Guilherme
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Guilherme,
Please add
[plain]-verbose -genv I_MPI_DEBUG 5[/plain]
to your mpiexec.hydra arguments and attach the output.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
James,
I upgrades TORQUE to release 3.0.6 and now the problem has disappeared.
Since the new install overwrites the old problematic one I am unable to provide you with the required output.
Thanks for you help.
Guilherme
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Guilherme,
Understood. I'm glad everything is working now.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page