Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Ct_Sun
Beginner
96 Views

System hangs at intel_mpi_rt test of Intel Cluster Checker.

Hello,
I got a problem when executed Intel Cluster Checker 1.8.The system had no response when executed intel_mpi_rt test module. The log file just reports and stops at showing MPI library version.
The system configuration is :
OS : RHEL 5.5
Intel Cluster Runtime : 3.2-1
Intel MPI Library : 4.0.2.003
Intel MKL Library : 10.3.4.191
C++ compiler : mpicc
Is there any checking methods for resolving the problem?
Thanks.
Best regards,
CT
0 Kudos
6 Replies
Andres_M_Intel4
Employee
96 Views

CT,
Thanks for your report.
What's your network setup?The test module runs a local hello world with MPI, so chances are that DAPL providers are wrongly configured if you are running OFED software.
I would first suggest to run the dat_conf test module (it can be added as a dependency of intel_mpi_rt also).
Another option is to manually run an MPI hello world example to reproduce what the tool is doing, you can use the I_MPI_DEBUG to get more details on what's missing.
I'm adding some background details below, just in case.
-- Andres
[man clck-intel_mpi_rt]
By default, the test module exercises 4 MPI processes over different network devices by using the shm and
the sock I_MPI_DEVICES (or the shm and tcp I_MPI_FABRICS). Furthermore, if the /etc/dat.conf file or the
DAT_OVERRIDE variable are present it also locally exercises the rdma (or dapl) fabric device.
The I_MPI_FABRICS style is used if Intel MPI Library 4.x or later is detected.
[what the test is trying to do]
command: sh -c "source /opt/intel/impi/3.1//bin64/mpivars.sh; mpiexec -n 4 -env I_MPI_FALLBACK_DEVICE 0 -env I_MPI_DEVICE rdssm /tmp/clck-intel_mpi_rt.ic7884/test.impi"
output:
Hello world: rank 0 of 4 running on compute-00-00.local
Hello world: rank 1 of 4 running on compute-00-00.local
Hello world: rank 2 of 4 running on compute-00-00.local
Hello world: rank 3 of 4 running on compute-00-00.local
[running dat_conf]
/opt/intel/clck/1.8/cluster-check aic.xml --include_onlydat_conf --verbose 5
[hello world example in a similar MPI installation]
/opt/intel/impi/4.0.2.003/test/test.c: ASCII C program text
Ct_Sun
Beginner
96 Views

Hi Andres,
Thanks for your useful recommendations, I will try to fix the networking settings first.
Best regards,
CT
Ct_Sun
Beginner
96 Views

Hi Andres,
After checking the facbric settings, I can pass the intel_mpi_rt test module. I just change the settings to "shm" from "rdssm". Sorry that I didn't mentioned that I just setup one machine (head node) for thesting, maybe this is the reason why should be set to "shm"(I guess...).
Furthermore, I still have two questions:
1. Where can I get detail information and definition about settings:sock, shm, ssm, rdma, rdssm?
2. If my cluster nodes are connected by Ethernet(no InfinBand, iWARP devices), there is no DAPL and OFED software installed, how should I setup dat.conf file to pass the dat_conf test?
Thanks.
Best regards,
CT
Dmitry_K_Intel2
Employee
96 Views

Dear CT,

It would be better to use I_MPI_FABRICS with Intel MPI Library 4.x instead of I_MPI_DEVICE.
The format is:
export I_MPI_FABRICS=shm:dapl
or
mpirun -genv I_MPI_FABRICS shm:dapl -np 222 ./a.out
So, the format is: Local_fabric:remote_fabric. Local_fabric can be any of: {shm, dapl, tcp, ofa, tmi}. Remote_fabric can be: {dapl, tcp, ofa, tmi}.

I hope that this format is more informative and doesn't require additional comments.

/etc/dat.conf lists all available providers on the node and this list depends on the Infiniband cards installed on this particular node.
Setting I_MPI_DAPL_PROVIDER you can select needed provider from the list of available providers.

If there is no IB cards or DAPL (OFED) was not installed there will be no /etc/dat.conf file on the node and you need to use I_MPI_FABRICS=shm:tcp. And of cause there will no dat_conf test.

Regards!
Dmitry
Ct_Sun
Beginner
96 Views

Dear Dmitry,
Thanks for your answer.
Although I have set"exportI_MPI_FABRICS=shm:tcp" for my system, the Cluster Checker still performs "dat_conf" test automatically. Obviously the test item is failed because of no IB cards/DAPL installed.So, shall I skip this test by settingdat_conf? or there have some settings should be changed in my XML file?
Best regards,
CT
Andres_M_Intel4
Employee
96 Views

CT,
I think your exclude setting is the best approach to avoid running the test module, as you mention it is not applicable in your setup.
As Dmitry mentions, the preferred syntax is the one with I_MPI_FABRICS
You can find more details hereatpage 74.
Reply