I am currently facing a problem while trying to scale my application. I am not sure what is the exact issue but I will try to explain it. I compiled the WRF model with ifort, icc, mpiifort, and mpiicc coming from the Intel parallel studio 2020.0.166 (compiler version 19..1.0.166). Until here, everything is fine. I start to have problems when I try to run the WRF model with more than 16 nodes.
I have to use the following command...
export FI_PROVIDER=sockets; export I_MPI_HYDRA_BRANCH_COUNT=0; mpirun -machinefile hosts.txt -np 720 ./wrf.exe
If I do not use the variable I_MPI_HYDRA_BRANCH_COUNT I can run jobs up to 16 nodes only.
However, if I use the Intel parallel studio 2016.2.181 (compiler version 16.0.2) I can run the WRF model without the need to set I_MPI_HYDRA_BRANCH_COUNT and at the same time, the performance of the WRF model is increased by a factor of 2 when compared to the newer version of the Intel compiler.
I wonder if this is an issue associated with my shell environment (I am using CentOS 7), if the newer version of the Intel compiler requires new flags, or something like that.
Let me know your thoughts.
Thank you in advance.
You are aware, I hope, that "everyone" will probably not have access to a multiprocessor system with the ability to run 720 processes, let aside the experience to pinpoint the reason for the slowdown!