Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
2275 Discussions

Floating Point Exception Overflow and OpenMPI equiv tunning?

Stone
Beginner
2,199 Views

I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with star ccm+)
Cisco UCS cluster using USNIC fabric over 10gbe
Intel(R) Xeon(R) CPU E5-2698
7 nodes, 280 cores

enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed
usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed
enic modinfo version: 3.2.210.22
enic loaded module version: 3.2.210.22
usnic_verbs modinfo version: 3.2.158.15
usnic_verbs loaded module version: 3.2.158.15
libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed
libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed


On runs less than 5 hours, everything works flawlessly and is quite fast.

However when running with 280 cores at or around 5 hours into a job, the longer jobs die with the floating point exception.
The same job completes fine with 140 cores, but takes about 14 hours to finish. 
Also I am using PBS Pro with 99 hour wall time

------------------
Turbulent viscosity limited on 56 cells in Region
A floating point exception has occurred: floating point exception [Overflow].  The specific cause cannot be identified.  Please refer to the troubleshooting section of the User's Guide.
Context: star.coupledflow.CoupledImplicitSolver
Command: Automation.Run
   error: Server Error
------------------

I have been doing some reading and some say that using other MPI are more stable with Star CCM.

I have not ruled out that I am missing some parameters or tuning with Intel MPI as this is a new cluster.

I am also trying to make Open MPI work.  I have openmpi compiled and it runs, however only with very small number of CPU.  Anything over about 2 cores per node it hangs indefinately.

I have compiled Open MPI 3.1.3 from https://www.open-mpi.org/ because this is what Star CCM version I am running supports.  I am telling star to use the open mpi that I installed so it can support the Cisco USNIC fabric, which I can verify using Cisco native tools.  Note that star also ships with openmpi however 

I am thinking that I need to tune OpenMPI, which was also requried with Intel MPI.

With Intel MPI, jobs with more than about 100 cores would hang until I added these parameters:

reference: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/542591
reference: https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-advanced-techniques

export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208
export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208
export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704
export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704
export I_MPI_DAPL_UD_RNDV_EP_NUM=2
export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000
export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096
export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647

After adding these parms I can scale to 280 cores and it runs very fast, up until the point where it gets the floating point exception.

I am struggling trying to find equivelant turning parms for Open MPI or resolve the floating point overflow.

I have listed all the MCA available with Open using MCA, and have tried setting these parms with no success.

btl_max_send_size = 4096
btl_usnic_eager_limit = 2147483647
btl_usnic_rndv_eager_limit = 2147483647
btl_usnic_sd_num = 8208
btl_usnic_rd_num = 8208
btl_usnic_prio_sd_num = 8704
btl_usnic_prio_rd_num = 8704
btl_usnic_pack_lazy_threshold = -1


Does anyone have any advice or ideas for:

1.) The floating point overflow issue
and   
2.)  Know of equivelant tuning parms for Open MPI 

Many thanks in advance

0 Kudos
0 Replies
Reply