Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2161 Discussions

Intel Cluster Checker hangs indefinitely

Kevin_McGrattan
2,014 Views

I have oneAPI 2021.2.0 installed on a small linux cluster. Recently, we rebooted the cluster after a power outage, and there are some lingering issues related to connectivity. Jobs still run and everything appears to be OK, but I cannot run the Intel Cluster Checker to see if there are still problems. When I invoke it, it just hangs:

$ clck -f nodefile
Intel(R) Cluster Checker 2021 Update 2 (build 20210301)

Running Collect

Nothing happens beyond this. It is very possible that there are network issues on our cluster. Is there a verbose or debug option that can alert me to what is causing clck to hang?

 

0 Kudos
11 Replies
ShivaniK_Intel
Moderator
1,969 Views

Hi,


Thanks for reaching out to us.


>>Is there a verbose or debug option that can alert me to what is causing clck to hang?


-l / –log-level: Specifies the output level. Recognized values are (in increasing order of verbosity)**: alert, critical, error, warning, notice, info, and debug. The default log level is an error.


For more details, you can refer to the below link


https://software.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/configuring-intel-cluster-checker.html


If your issue persists, please provide us with the error log details to investigate more on your issue.


Could you also please provide us your system environment details(OS version)?


Thanks & Regards

Shivani


0 Kudos
Kevin_McGrattan
1,960 Views

CentOS Linux release 7.8.2003 (Core)

 

I ran clck with -l debug. Things hung as before, and then I ctrl-C'ed to abort. I got messages like this:

 

data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.2.0/libexec/intel64/pdsh -b -K -w burn035 'kill -SIGINT -- -$(head -n 1 /home4/mcgratta/.clck/clck-collect-tempS1Fykb/pidfile_burn035);echo $?'

burn035: 1

burn035: head: cannot open ‘/home4/mcgratta/.clck/clck-collect-tempS1Fykb/pidfile_burn035’ for reading: No such file or directory
burn035: bash: line 0: kill: -: arguments must be process or job IDs

 

I checked the .clck directory, and then the tmp directory corresponding to my session. I noticed that some of the node pid-files were not listed. There are 36 nodes, and 4 were not listed (like burn035 shown above). So it appears that the pidfiles are not all being created. This may well be a problem with our cluster, but I cannot see what is unusual about the nodes that are not assigned pidfiles, other than they are at the end of the sequence. In other words, burn001 through burn032 have pidfiles, burn033 through burn036 do not.

 

Is there something I can check on the bad nodes to see why they are not writing pidfiles.

0 Kudos
ShivaniK_Intel
Moderator
1,913 Views


Hi,


Could you please run the below commands and let us know if you face any issues?


mpirun -n 36 -ppn 1 -f hostfile hostname

mpirun -bootstrap pdsh -n 36 -ppn 1 -f hostfile hostname


The above commands are to just check whether you have access to all the nodes or not


Could you also run the below command and provide us the details.


cat hostfile


Thanks & Regards

Shivani


0 Kudos
Kevin_McGrattan
1,899 Views

Our cluster uses the psm libfabric that we build. Invoking mpirun at the command line does not work, but we can run slurm job control scripts. All of our nodes work. We can run jobs on all the nodes. Our only problem is that clck does not create pidfiles for 4 out of the 36 nodes (the last 4 nodes). This seems to cause the hang, but I cannot figure out what is different about these 4 nodes.

0 Kudos
Kevin_McGrattan
1,879 Views

I tried running the Cluster Checker with only two nodes. Both nodes produce a pidfile, so I don't think that is the problem. Here is the session:

 

[mcgratta@burn ~]$ clck -f nodefile2 -l debug
Intel(R) Cluster Checker 2021 Update 2 (build 20210301)

Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/health_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/health_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
message catalog path: /opt/intel/oneapi/clck/2021.2.0/kb/data/
message catalog: msg_en.xmc

Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/cpu_user.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/cpu_user.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/cpu_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/cpu_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
provider(s): cpuid, cpuinfo, cpupower, hwloc_dump_hwdata, kernel_tools, lscpu,
numactl, uname
analyzer extension: cpu
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.2.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.2.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/environment_variables_uniformity.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/environment_variables_uniformity.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
provider(s): printenv, uname
analyzer extension: environment
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.2.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.2.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/ethernet.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/ethernet.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
provider(s): ethtool, ethtool_show_coalesce, ipaddr, uname
analyzer extension: ethernet
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
analyzer extension: ulimit
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.2.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.2.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/infiniband_user.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/infiniband_user.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/infiniband_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/infiniband_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
provider(s): datconf, ibstat, ibv_devinfo, lspci, ofedinfo, ulimit, uname
analyzer extension: infiniband
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
analyzer extension: ulimit
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.2.0/kb/data/
message catalog: msg_en.xmc

Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/dapl_fabric_providers_present.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/dapl_fabric_providers_present.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
provider(s): datconf, ibstat, ipaddr, uname
analyzer extension: datconf
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.2.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.2.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/network_time_uniformity.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/network_time_uniformity.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
provider(s): chronyc, ntpq, uname
analyzer extension: ntp
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.2.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.2.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/node_process_status.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/node_process_status.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
provider(s): ps, uname
analyzer extension: process
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.2.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.2.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/opa_user.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/opa_user.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
Include: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/opa_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.2.0/etc/fwd/opa_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.2.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.2.0/etc/providers
provider(s): fw_ver, lspci, opahfirev, opatools, ulimit, uname
analyzer extension: opa
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
analyzer extension: ulimit
analyzer extension path: /opt/intel/oneapi/clck/2021.2.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.2.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.2.0/kb/data/msg_schema.xml

Postprocessor config file: /opt/intel/oneapi/clck/2021.2.0/etc/postprocessor/table.xml
Database: clck_default, at: "$HOME/.clck/2021.2.1/clck.db".

Running Collect
provider(s): getent, ip, uname
about to copy to shared location
clck-collect temp-shared location created
about to copy env to shared location
clck-collect copied env variables to temp-shared location
accumulate endpoint = tcp://129.6.159.109:49153
Accumulate server started
Starting pre-check............
data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.2.0/libexec/intel64/pdsh -b -K -w burn001,burn002 'if [[ ! -d /home4/mcgratta/.clck ]]; then echo CLCK_PRECHECK_ND;elif [[ ! -w /home4/mcgratta/.clck ]] || [[ ! -x /home4/mcgratta/.clck ]] || [[ ! -r /home4/mcgratta/.clck ]]; then echo CLCK_PRECHECK_NON_RW;else echo CLCK_PRECHECK_OK;fi;stat -c "#####SHAREDDIR_INODE %i#####" /home4/mcgratta/.clck;'


Pre-check completed successfully
data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.2.0/libexec/intel64/pdsh -b -K -w burn001,burn002 ' . /home4/mcgratta/.clck/env_prop-tempG0UTv5/env_file1qwMgU; /opt/intel/oneapi/clck/2021.2.0/libexec/intel64/clck_run_provider -c /home4/mcgratta/.clck/clck-collect-tempnOKnfD/temp-configjtO1Kg -e tcp://129.6.159.109:49153 -l debug -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/ipaddr.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/ethtool.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/ibstat.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/ibv_devinfo.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/ps.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/ofedinfo.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/ulimit.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/opatools.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/getent.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/cpuinfo.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/uname.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/cpuid.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/fw_ver.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/lspci.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/hwloc_dump_hwdata.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/cpupower.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/printenv.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/opahfirev.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/kernel_tools.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/datconf.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/ntpq.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/chronyc.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/lscpu.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/numactl.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/ethtool_show_coalesce.xml -f /opt/intel/oneapi/clck/2021.2.0/etc/providers/ip.xml'

^C
Caught Ctrl-C. Cleaning up.
DO NOT HIT CTRL-C AGAIN. Sending Ctrl-C multiple times halts the cleaning process which can leave processes running on the nodes.
data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.2.0/libexec/intel64/pdsh -b -K -w burn001 'kill -SIGINT -- -$(head -n 1 /home4/mcgratta/.clck/clck-collect-tempnOKnfD/pidfile_burn001);echo $?'

Draining queue of accumulated data providers
Received SIGINT / SIGTERM. Cleaning up and stopping...
burn001: 0


data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.2.0/libexec/intel64/pdsh -b -K -w burn002 'kill -SIGINT -- -$(head -n 1 /home4/mcgratta/.clck/clck-collect-tempnOKnfD/pidfile_burn002);echo $?'

burn002: 0


[mcgratta@burn ~]$

0 Kudos
ShivaniK_Intel
Moderator
1,870 Views

Hi,


Could you please set I_MPI_HYDRA_BOOTSTRAP=SLURM before running the cluster checker command and later run the clck command?


Secondly, set I_MPI_HYDRA_BOOTSTRAP=SSH and again run the cluster checker command and provide the complete logs of both.


Thanks & Regards 

Shivani


0 Kudos
Kevin_McGrattan
1,859 Views

 

[mcgratta@burn ~]$ set I_MPI_HYDRA_BOOTSTRAP=SLURM
[mcgratta@burn ~]$ clck -f nodefile2 -l debug
Intel(R) Cluster Checker 2021 Update 3 (build 20210615)

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/health_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/health_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/cpu_user.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/cpu_user.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/cpu_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/cpu_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): cpuid, cpuinfo, cpupower, hwloc_dump_hwdata, kernel_tools, lscpu,
numactl, uname
analyzer extension: cpu
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/environment_variables_uniformity.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/environment_variables_uniformity.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): printenv, uname
analyzer extension: environment
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/ethernet.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/ethernet.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): ethtool, ethtool_show_coalesce, ipaddr, uname
analyzer extension: ethernet
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
analyzer extension: ulimit
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/infiniband_user.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/infiniband_user.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/infiniband_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/infiniband_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): datconf, ibstat, ibv_devinfo, lspci, ofedinfo, ulimit, uname
analyzer extension: infiniband
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
analyzer extension: ulimit
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/dapl_fabric_providers_present.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/dapl_fabric_providers_present.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): datconf, ibstat, ipaddr, uname
analyzer extension: datconf
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/network_time_uniformity.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/network_time_uniformity.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): chronyc, ntpq, uname
analyzer extension: ntp
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/node_process_status.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/node_process_status.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): ps, uname
analyzer extension: process
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/opa_user.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/opa_user.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/opa_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/opa_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): fw_ver, lspci, opahfirev, opatools, ulimit, uname
analyzer extension: opa
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
analyzer extension: ulimit
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Postprocessor config file: /opt/intel/oneapi/clck/2021.3.0/etc/postprocessor/table.xml
Database: clck_default, at: "$HOME/.clck/2021.3.1/clck.db".

Running Collect
provider(s): getent, ip, uname
about to copy to shared location
clck-collect temp-shared location created
about to copy env to shared location
clck-collect copied env variables to temp-shared location
accumulate endpoint = tcp://129.6.159.109:49152
Accumulate server started
Starting pre-check............
data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.3.0/libexec/intel64/pdsh -b -K -w burn001,burn002 'if [[ ! -d /home4/mcgratta/.clck ]]; then echo CLCK_PRECHECK_ND;elif [[ ! -w /home4/mcgratta/.clck ]] || [[ ! -x /home4/mcgratta/.clck ]] || [[ ! -r /home4/mcgratta/.clck ]]; then echo CLCK_PRECHECK_NON_RW;else echo CLCK_PRECHECK_OK;fi;stat -c "#####SHAREDDIR_INODE %i#####" /home4/mcgratta/.clck;'


Pre-check completed successfully
data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.3.0/libexec/intel64/pdsh -b -K -w burn001,burn002 ' . /home4/mcgratta/.clck/env_prop-tempXTCxtx/env_fileYDuWK7; /opt/intel/oneapi/clck/2021.3.0/libexec/intel64/clck_run_provider -c /home4/mcgratta/.clck/clck-collect-tempF9zzDM/temp-configgKcacX -E /home4/mcgratta/.clck/env_prop-tempXTCxtx/env_fileYDuWK7 -e tcp://129.6.159.109:49152 -l debug -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/numactl.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/uname.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ulimit.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/hwloc_dump_hwdata.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/cpuid.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/opatools.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/cpupower.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/datconf.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ethtool.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ethtool_show_coalesce.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ntpq.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ibv_devinfo.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/chronyc.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/kernel_tools.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/lscpu.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/printenv.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/cpuinfo.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ps.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ibstat.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ipaddr.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ofedinfo.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/fw_ver.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/lspci.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/opahfirev.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/getent.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ip.xml'

^C
Caught Ctrl-C. Cleaning up.
DO NOT HIT CTRL-C AGAIN. Sending Ctrl-C multiple times halts the cleaning process which can leave processes running on the nodes.
data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.3.0/libexec/intel64/pdsh -b -K -w burn001 'kill -SIGINT -- -$(head -n 1 /home4/mcgratta/.clck/clck-collect-tempF9zzDM/pidfile_burn001);echo $?'

Draining queue of accumulated data providers
Received SIGINT / SIGTERM. Cleaning up and stopping...
burn001: 0


data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.3.0/libexec/intel64/pdsh -b -K -w burn002 'kill -SIGINT -- -$(head -n 1 /home4/mcgratta/.clck/clck-collect-tempF9zzDM/pidfile_burn002);echo $?'

burn002: 0

 

0 Kudos
Kevin_McGrattan
1,858 Views

[mcgratta@burn ~]$ set I_MPI_HYDRA_BOOTSTRAP=SSH
[mcgratta@burn ~]$ clck -f nodefile2 -l debug
Intel(R) Cluster Checker 2021 Update 3 (build 20210615)

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/health_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/health_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/cpu_user.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/cpu_user.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/cpu_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/cpu_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): cpuid, cpuinfo, cpupower, hwloc_dump_hwdata, kernel_tools, lscpu,
numactl, uname
analyzer extension: cpu
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/environment_variables_uniformity.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/environment_variables_uniformity.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): printenv, uname
analyzer extension: environment
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/ethernet.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/ethernet.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): ethtool, ethtool_show_coalesce, ipaddr, uname
analyzer extension: ethernet
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
analyzer extension: ulimit
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/infiniband_user.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/infiniband_user.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/infiniband_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/infiniband_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): datconf, ibstat, ibv_devinfo, lspci, ofedinfo, ulimit, uname
analyzer extension: infiniband
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
analyzer extension: ulimit
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/dapl_fabric_providers_present.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/dapl_fabric_providers_present.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): datconf, ibstat, ipaddr, uname
analyzer extension: datconf
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/network_time_uniformity.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/network_time_uniformity.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): chronyc, ntpq, uname
analyzer extension: ntp
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/node_process_status.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/node_process_status.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): ps, uname
analyzer extension: process
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/opa_user.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/opa_user.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
Include: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/opa_base.xml
Opening Fwd: /opt/intel/oneapi/clck/2021.3.0/etc/fwd/opa_base.xml
provider aux path: /opt/intel/oneapi/clck/2021.3.0/provider/share
provider path: /opt/intel/oneapi/clck/2021.3.0/etc/providers
provider(s): fw_ver, lspci, opahfirev, opatools, ulimit, uname
analyzer extension: opa
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
analyzer extension: ulimit
analyzer extension path: /opt/intel/oneapi/clck/2021.3.0/analyzer/intel64/cpp
message catalog path: /opt/intel/oneapi/clck/2021.3.0/kb/data/
message catalog: msg_en.xmc

message schema: /opt/intel/oneapi/clck/2021.3.0/kb/data/msg_schema.xml

Postprocessor config file: /opt/intel/oneapi/clck/2021.3.0/etc/postprocessor/table.xml
Database: clck_default, at: "$HOME/.clck/2021.3.1/clck.db".

Running Collect
provider(s): getent, ip, uname
about to copy to shared location
clck-collect temp-shared location created
about to copy env to shared location
clck-collect copied env variables to temp-shared location
accumulate endpoint = tcp://129.6.159.109:49152
Accumulate server started
Starting pre-check............
data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.3.0/libexec/intel64/pdsh -b -K -w burn001,burn002 'if [[ ! -d /home4/mcgratta/.clck ]]; then echo CLCK_PRECHECK_ND;elif [[ ! -w /home4/mcgratta/.clck ]] || [[ ! -x /home4/mcgratta/.clck ]] || [[ ! -r /home4/mcgratta/.clck ]]; then echo CLCK_PRECHECK_NON_RW;else echo CLCK_PRECHECK_OK;fi;stat -c "#####SHAREDDIR_INODE %i#####" /home4/mcgratta/.clck;'


Pre-check completed successfully
data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.3.0/libexec/intel64/pdsh -b -K -w burn001,burn002 ' . /home4/mcgratta/.clck/env_prop-temppRilnd/env_filemtwajy; /opt/intel/oneapi/clck/2021.3.0/libexec/intel64/clck_run_provider -c /home4/mcgratta/.clck/clck-collect-tempet0Zzc/temp-configAxKwrS -E /home4/mcgratta/.clck/env_prop-temppRilnd/env_filemtwajy -e tcp://129.6.159.109:49152 -l debug -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/numactl.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/uname.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/chronyc.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ethtool_show_coalesce.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ulimit.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ofedinfo.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ibv_devinfo.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/datconf.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ibstat.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/cpupower.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/lscpu.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/kernel_tools.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/cpuid.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ntpq.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/printenv.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/cpuinfo.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ps.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ethtool.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/hwloc_dump_hwdata.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ipaddr.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/opahfirev.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/lspci.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/fw_ver.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/getent.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/opatools.xml -f /opt/intel/oneapi/clck/2021.3.0/etc/providers/ip.xml'

^C
Caught Ctrl-C. Cleaning up.
DO NOT HIT CTRL-C AGAIN. Sending Ctrl-C multiple times halts the cleaning process which can leave processes running on the nodes.
data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.3.0/libexec/intel64/pdsh -b -K -w burn001 'kill -SIGINT -- -$(head -n 1 /home4/mcgratta/.clck/clck-collect-tempet0Zzc/pidfile_burn001);echo $?'

Draining queue of accumulated data providers
Received SIGINT / SIGTERM. Cleaning up and stopping...
burn001: 0


data collection command:
PDSH_SSH_ARGS_APPEND="$PDSH_SSH_ARGS_APPEND $PDSH_SSH_ARGS -oStrictHostKeyChecking=no -oLogLevel=FATAL" PDSH_SSH_ARGS="" /opt/intel/oneapi/clck/2021.3.0/libexec/intel64/pdsh -b -K -w burn002 'kill -SIGINT -- -$(head -n 1 /home4/mcgratta/.clck/clck-collect-tempet0Zzc/pidfile_burn002);echo $?'

burn002: 0

 

0 Kudos
ShivaniK_Intel
Moderator
1,829 Views

Hi,


We are working on it and will get back to you soon.


Thanks & Regards

Shivani


0 Kudos
Kevin_O_Intel1
Employee
1,739 Views

I am escalating the issue


Regards


0 Kudos
Kevin_O_Intel1
Employee
1,723 Views

"Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel."


0 Kudos
Reply