I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with star ccm+)
Cisco UCS cluster using USNIC fabric over 10gbe
Intel(R) Xeon(R) CPU E5-2698
7 nodes, 280 cores
enic RPM version kmod-enic-184.108.40.206-738.18.centos7u7.x86_64 installed
usnic RPM kmod-usnic_verbs-220.127.116.11-738.18.rhel7u6.x86_64 installed
enic modinfo version: 18.104.22.168
enic loaded module version: 22.214.171.124
usnic_verbs modinfo version: 126.96.36.199
usnic_verbs loaded module version: 188.8.131.52
libdaplusnic RPM version 2.0.39cisco184.108.40.206 installed
libfabric RPM version 1.6.0cisco220.127.116.11.rhel7u6 installed
On runs less than 5 hours, everything works flawlessly and is quite fast.
However when running with 280 cores at or around 5 hours into a job, the longer jobs die with the floating point exception.
The same job completes fine with 140 cores, but takes about 14 hours to finish.
Also I am using PBS Pro with 99 hour wall time
Turbulent viscosity limited on 56 cells in Region
A floating point exception has occurred: floating point exception [Overflow]. The specific cause cannot be identified. Please refer to the troubleshooting section of the User's Guide.
error: Server Error
I have been doing some reading and some say that using other MPI are more stable with Star CCM.
I have not ruled out that I am missing some parameters or tuning with Intel MPI as this is a new cluster.
I am also trying to make Open MPI work. I have openmpi compiled and it runs, however only with very small number of CPU. Anything over about 2 cores per node it hangs indefinately.
I have compiled Open MPI 3.1.3 from https://www.open-mpi.org/ because this is what Star CCM version I am running supports. I am telling star to use the open mpi that I installed so it can support the Cisco USNIC fabric, which I can verify using Cisco native tools. Note that star also ships with openmpi however
I am thinking that I need to tune OpenMPI, which was also requried with Intel MPI.
With Intel MPI, jobs with more than about 100 cores would hang until I added these parameters:
After adding these parms I can scale to 280 cores and it runs very fast, up until the point where it gets the floating point exception.
I am struggling trying to find equivelant turning parms for Open MPI or resolve the floating point overflow.
I have listed all the MCA available with Open using MCA, and have tried setting these parms with no success.
btl_max_send_size = 4096
btl_usnic_eager_limit = 2147483647
btl_usnic_rndv_eager_limit = 2147483647
btl_usnic_sd_num = 8208
btl_usnic_rd_num = 8208
btl_usnic_prio_sd_num = 8704
btl_usnic_prio_rd_num = 8704
btl_usnic_pack_lazy_threshold = -1
Does anyone have any advice or ideas for:
1.) The floating point overflow issue
2.) Know of equivelant tuning parms for Open MPI
Many thanks in advance