I have a problem with the parallel version of a numerical simulation program for smoke and heat transport from fires, which is called FDS (see www.fds-smv.net). This program is written in Fortran90 (except of one routine in C) and compiled with the Intel 10.1 compiler. FDS can be executed in a serial as well as a parallel way. During the last years I used FDS on a Linux-cluster of up to 18 Intel-Xeon nodes without namable problems. Now, I am concerned with the porting of FDS to a new Windows cluster. The first problem consisted in the fact that I had to use several additional compiler options (/Qftz /fpe3 /Qzero) to get the code to run.
Now, the code can be started on different numbers of processors of the Windows-cluster without any problem, but the runtime behaviour is different from that on the Linux-cluster: The code runs for many hours (normal applications may take one or more weeks!), then it stagnates while still having the indication 'Running' in the job manager of the Windows HPC cluster. Obviously there are some problems with the MPI-communication routines: Several processes seem to wait for others (staying employed), resp. they iterate much faster than other ones although each iteration uses a MPI_BARRIER-command for synchronization!!
This problem doesn't occur at all on the Linux-Cluster where the same code with Intel compilation runs for weeks without any changing of the code!!! After searching and debugging for many days I haven't any idea what to do next and would be very grateful if somebody could give me an advice how this problem can be solved. Is this a known problem and has somebody else similar problems, respectively?
Many thanks in advance,
Could you please clarify which MPI implementation do you use? Do I understand right that you have Windows HPC 2008 based cluster?
I'm not familiar with your particular issue but the problem may be caused even an OS issue. For instance it wasknown problem in Windows Server 2003. A network communication program may stop respondingwhen you use Windows Server 2003 to implement Winsock Direct in a fast SAN environmenthttp://support.microsoft.com/?kbid=910481
many thanks for your reply! Yes, we use a Windows HPC 2008 cluster with the Microsoft-MPI-version MSMPI. Probably, it's worthwhile trying the standard MPI-version. It really seems to be a communication problem or at least to be related to something like that because it shouldn't be possible for single processes to start the next iteration if the others haven't passed the MPI_BARRIER-command in the precedent iteration before. But exactly this happens, particularly with hundreds of iterations difference between single processes! I will follow your hint and check, if their is a relation.
Probably you may also want to try other MPI implementations and their various tuning options as well. Personally for me it is interesting to check the Intel MPI Library behavior in your environment. You may get an evaluation version at http://www.intel.com/go/mpi if you will decide to try it.
yes, I will do, because it can't be a problem of the code itself, but must have something to do with some system properties. And the biggest difference to the running Linux-version is the MPI-version. I will test the evaluation version as you mentioned.
Many thanks again