Intel® Parallel Studio XE installations on HPC cluster environmentt

ahmad__haseeb · ‎03-09-2020

Dear intel team and users, I was trying to install the intel parallel studio cluster 2019 edition on my university HPC cluster. In fact, on head-node, we already had this intel product and later we added the 6 nodes to the main node. Now, we have 6 nodes cluster with a head node. I have proceeded according to the intel installation guide. But setup failed error came and a log file was created (attached here). The setup was successfully picking up the nodes from machines.LINUX file and we were doing this as a root user. The only thing which makes me doubtful is that setup was looking for /opt/intel on nodes while on the head node it was installed on some other directory (/export/installs/) but this was a common and shared directory. Any help in this regard will be highly appreciated.

I did the test to check the presence of ifort on nodes, interestingly sometimes it correctly prints the path and load the intel on nodes and sometimes it produces the error! Can someone please guide me on the procedure and what I am doing wrong here?

Regards,

Haseeb Ahmad

PrasanthD_intel · ‎03-10-2020

Hi ,

From the log file , it seems that the installation got failed due to unavailability of disk space. Is there enough space on the nodes for the installation of software?. The minumum requirement for cluster edition is 16GB (https://software.intel.com/en-us/parallel-studio-xe/documentation/system-requirements) .

Could you please check on this.

This might be a reason for failure of installation.

Thanks

Prasanth

PrasanthD_intel · ‎03-17-2020

Hi,

Is your problem resolved? Have you checked for free space in your nodes?

Please reach out to us if you are facing any problems.

Thanks

Prasanth

ahmad__haseeb · ‎03-22-2020

Dear Prasanth, Yes I have had checked the disk status of all the nodes in my cluster. I have found that plenty of Free space was available in each node but the partitions of some of the nodes were not showing on the NFS/shared partition list. So, I thought that might be the problem. To see it I modified my machine.LINUX file and listed only those nodes which have shared partition active. Then run the intel setup again but the setup didn't find anything to install which means that originally it was installed on the nodes which the setup could access! Am I right? I had the fear also that is my installation successful or not?

Q2: If I add more nodes in the future then should I update the machine.LINUX file and run the setup again? And what if I want to update the ifort to the new version? Should I delete the /opt/intel first from the head node and redo all the installations?

Moreover, I was getting some mpi aborted errors which I will post if I couldn't solve that issue.

Many thanks,

Haseeb Ahmad.

ahmad__haseeb · ‎03-26-2020

Dear Prasanth,

I am facing the following mpi error: I was using two nodes via slurm job script.But here it is listing more nodes also...

Abort(67708674) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(448): MPI_Allreduce(sbuf=0x7fd292591700, rbuf=0x7fdb9d6f2d00, count=-2104727296, MPI_COMPLEX, MPI_SUM, MPI_COMM_WORLD) failed
PMPI_Allreduce(389): Negative count, value is -2104727296
Abort(134817538) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(448): MPI_Allreduce(sbuf=0x7f6c2d84c700, rbuf=0x7f753884ad00, count=-2104727296, MPI_COMPLEX, MPI_SUM, MPI_COMM_WORLD) failed
PMPI_Allreduce(389): Negative count, value is -2104727296
Abort(403252994) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(448): MPI_Allreduce(sbuf=0x7f3e62e00700, rbuf=0x7f477d812d00, count=-2104727296, MPI_COMPLEX, MPI_SUM, MPI_COMM_WORLD) failed
PMPI_Allreduce(389): Negative count, value is -2104727296
Abort(604579586) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(448): MPI_Allreduce(sbuf=0x7f85328e4700, rbuf=0x7f8e28456d00, count=-2104727296, MPI_COMPLEX, MPI_SUM, MPI_COMM_WORLD) failed
PMPI_Allreduce(389): Negative count, value is -2104727296
[cli_2]: readline failed

PrasanthD_intel · ‎03-27-2020

Hi Haseeb,

Do the following :

1)Install the parallel studio in the shared file system (/nfs) once.

2)Mount /nfs on all the other nodes.

You don't need to install it for each node. The installation will be done in the shared file system and in the future if you add a new node you just have to mount the file system on the node.

To check whether the installation is successful try to run a simple MPI hello world application on all nodes.

The ifort comes along with the parallel studio and if you want to install another version you can install it separately and source the scripts of that version to load that fortran compiler.

Thanks

Prasanth

ahmad__haseeb · ‎03-29-2020

Dear Prasanth,

Thanks for the help. Now, I am having issue (possibly with intel mpi).

some mpi processes are demanding the huge virtual memory and hence job is not progressing! Can there be any memory leak in intel mpi? I am allocating more than required memory per task in mem-per-cpu in slurm job file. And using the top command during the execution, I have found that with the small increase in RES memory, virtual memory increases very sharply!

error is something like this:

slurmstepd: error: Step 1893.1 exceeded virtual memory limit (185029212 > 138412032), being killed
slurmstepd: error: Exceeded job memory limit
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: Step 1893.2 exceeded virtual memory limit (185017888 > 138412032), being killed
slurmstepd: error: Exceeded job memory limit
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
srun: error: compute-0-2: task 0: Killed
slurmstepd: error: *** STEP 1893.1 ON compute-0-2 CANCELLED AT 2020-03-30T00:48:21 ***
slurmstepd: error: *** STEP 1893.2 ON compute-0-3 CANCELLED AT 2020-03-30T00:48:21 ***
srun: error: compute-0-3: task 0: Killed
[mpiexec@compute-0-2.local] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2047): assert (exitcodes != NULL) failed

Thanks,

Haseeb Ahmad

PrasanthD_intel · ‎04-02-2020

Hi Haseeb,

We need some details to debug the error

1)The virtual memory limit you have set

2)How many nodes are active.

3)Could you provide the commands you are using to launch the job (mpirun or srun).

4)Also, provide the log after exporting I_MPI_DEBUG-5.

Thanks

Prasanth

ahmad__haseeb · ‎04-02-2020

Dear Prasanth, thanks for the considerations.

For the first question, there is no virtual limit in the system, as can be seen in the output of the following command!

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 514532
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

---------------------------------------------------------------------------------------------------

The 2nd and third questions can be answered by the following attached slurm job script. Actually 6 nodes are part of the cluster, while three are shut down that's why I specifically giving the names of nodes in the batch file (although slurm take care of these things by default, I think).

#!/bin/bash
#SBATCH --job-name=H1
#SBATCH --partition=debug
#SBATCH --output=job.%J.out
#SBATCH -t 999:00:00
#SBATCH --tasks-per-node=4
#SBATCH --nodes=2
#SBATCH --nodelist=compute-0-2,compute-0-3
#SBATCH --mem-per-cpu=30G
#SBATCH --cpus-per-task=1 ### Number of threads per task (OMP threads)
#SBATCH --error=job.%J.err

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

nodes=2
tasks_per_node=4
nthreads=1
ncpu=`echo $nodes $tasks_per_node | awk '{print $1*$2}'`

bindir=/export/installs/Yambo_intel/Yambo4.5.1_OpenMP/yambo-4.5.1/bin
echo "Running on $ncpu MPI, $nthreads OpenMP threads"
srun hostname
source /export/installs/intelcc/parallel_studio_xe_2019.1.053/bin/psxevars.sh
which mpirun

export I_MPI_HYDRA_TOPOLIB=ipl
mpirun -np $ncpu $bindir/yambo -F bse_inv.in -C bse/130-200/

About your fourth question (Also, provide the log after exporting I_MPI_DEBUG-5).

I really apologize, I don't know about this! How can I get from this log? I mean, where it is stored?

Regards,

Haseeb Ahmad

ahmad__haseeb · ‎04-03-2020

Dear Prasanth,

If I do a small job to see ensure memory issue never comes into the picture, then following frightening looking error arises!

Intel(R) Parallel Studio XE 2019 Update 1 for Linux*
Copyright (C) 2009-2018 Intel Corporation. All rights reserved.
/export/installs/intelcc/compilers_and_libraries_2019.1.144/linux/mpi/intel64/bin/mpirun
Abort(604579586) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(448): MPI_Allreduce(sbuf=0x7f741079e2c0, rbuf=0x7f78ccc9ece0, count=-2104727296, MPI_COMPLEX, MPI_SUM, MPI_COMM_WORLD) failed
PMPI_Allreduce(389): Negative count, value is -2104727296
Abort(671688450) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(448): MPI_Allreduce(sbuf=0x7fc6b1095500, rbuf=0x7fcb8cc56ce0, count=-2104727296, MPI_COMPLEX, MPI_SUM, MPI_COMM_WORLD) failed
PMPI_Allreduce(389): Negative count, value is -2104727296
Abort(805906178) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(448): MPI_Allreduce(sbuf=0x7fb9b3b42c00, rbuf=0x7fbe67682ce0, count=-2104727296, MPI_COMPLEX, MPI_SUM, MPI_COMM_WORLD) failed
PMPI_Allreduce(389): Negative count, value is -2104727296
Abort(805906178) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(448): MPI_Allreduce(sbuf=0x7f9a85f3de40, rbuf=0x7f9f4ec16ce0, count=-2104727296, MPI_COMPLEX, MPI_SUM, MPI_COMM_WORLD) failed
PMPI_Allreduce(389): Negative count, value is -2104727296
[cli_0]: readline failed

Thanks,

Haseeb Ahmad

PrasanthD_intel · ‎04-06-2020

Hi Haseeb,

We are forwarding this issue to the respective team.

Thanks

Prasanth

Michael_Intel · ‎04-08-2020

Hi Haseeb,

I see there are basically two issues in this support request.:

1) Cluster setup issue

2) Application specific issue

In the future, please open one case per issue in order to allow us a better tracking.

For the 1st cluster setup related issue, you need to make sure that on each node, the Intel developer software components are sitting in the same absolute path - this can either be achieved by having the tools installed in the same directory structure on each node or by installing them on the head-node which exports the filesystem so that all compute nodes mount it to the same path. Please mind that the same is true for the user home directories.

With regards to the 2nd application specific issue, the Intel MPI library already points out that you have an invalid (negative) count in your MPI_Allreduce operation - so please watch out for possible integer overflows. Here you may use the ITAC message checker in order to identify coding issues that are violating the MPI standard. Please set LD_PRELOAD=libVTmc.so to leverage the message checker feature from ITAC.

Furthermore, please let me know if this issue is resolved.

Best regards,

Michael