Floating Point Exception Overflow and OpenMPI equiv tunning?

Stone · ‎09-01-2019

I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with star ccm+)
Cisco UCS cluster using USNIC fabric over 10gbe
Intel(R) Xeon(R) CPU E5-2698
7 nodes, 280 cores

enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed
usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed
enic modinfo version: 3.2.210.22
enic loaded module version: 3.2.210.22
usnic_verbs modinfo version: 3.2.158.15
usnic_verbs loaded module version: 3.2.158.15
libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed
libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed

On runs less than 5 hours, everything works flawlessly and is quite fast.

However when running with 280 cores at or around 5 hours into a job, the longer jobs die with the floating point exception.
The same job completes fine with 140 cores, but takes about 14 hours to finish.
Also I am using PBS Pro with 99 hour wall time

------------------
Turbulent viscosity limited on 56 cells in Region
A floating point exception has occurred: floating point exception [Overflow]. The specific cause cannot be identified. Please refer to the troubleshooting section of the User's Guide.
Context: star.coupledflow.CoupledImplicitSolver
Command: Automation.Run
error: Server Error
------------------

I have been doing some reading and some say that using other MPI are more stable with Star CCM.

I have not ruled out that I am missing some parameters or tuning with Intel MPI as this is a new cluster.

I am also trying to make Open MPI work. I have openmpi compiled and it runs, however only with very small number of CPU. Anything over about 2 cores per node it hangs indefinately.

I have compiled Open MPI 3.1.3 from https://www.open-mpi.org/ because this is what Star CCM version I am running supports. I am telling star to use the open mpi that I installed so it can support the Cisco USNIC fabric, which I can verify using Cisco native tools. Note that star also ships with openmpi however

I am thinking that I need to tune OpenMPI, which was also requried with Intel MPI.

With Intel MPI, jobs with more than about 100 cores would hang until I added these parameters:

reference: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/542591
reference: https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-advanced-techniques

export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208
export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208
export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704
export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704
export I_MPI_DAPL_UD_RNDV_EP_NUM=2
export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000
export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096
export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647

After adding these parms I can scale to 280 cores and it runs very fast, up until the point where it gets the floating point exception.

I am struggling trying to find equivelant turning parms for Open MPI or resolve the floating point overflow.

I have listed all the MCA available with Open using MCA, and have tried setting these parms with no success.

btl_max_send_size = 4096
btl_usnic_eager_limit = 2147483647
btl_usnic_rndv_eager_limit = 2147483647
btl_usnic_sd_num = 8208
btl_usnic_rd_num = 8208
btl_usnic_prio_sd_num = 8704
btl_usnic_prio_rd_num = 8704
btl_usnic_pack_lazy_threshold = -1

Does anyone have any advice or ideas for:

1.) The floating point overflow issue
and
2.) Know of equivelant tuning parms for Open MPI

Many thanks in advance