ICS 2011 is 15% slower than ICT 2010 on the same cluster with "basic install"
On our little cluster (12 nodes, 144 cores) I have installed the new Intel Cluster Studio (ICS) 2011. I don't have uninstall Intel Cluster Toolkit (ICT) 2010. All of our programs (Fortran or C++ codes) are 15% slower when we start them with the mpirun of ICS 2011. I don't understand why... I did a normal installation and I didn't noticed any problem during the installation. I did the ICS 2011 installation with the same method that I used for ICT 2010.
About our Hardware/Sofwares: Master: Intel Xeon CPU E5620 nodes: Intel Xeon CPU X5650 OS: CentOS 5.5 we are using Infiniband DDR (driver OFED-1.5.1); I_MPI_FABRIC set to shm:ofa; pinning is disabled.
We have recompiled our programs with Intel 12.0 or Intel 11.1 Compilers and the problem appears in the both case...so it is not a compiler problem.
First of all, could you check perfomance with default parameters. Please run IMB-MPI1 for both cases and compare. Let me know if the difference is still so big. Pay attention that there is 'S' at the end of I_MPI_FABRICS. Please run with I_MPI_DEBUG=9 and compare fabrics selected at run-time for both cases. Also you can compare settings for collective operations. If you cannot find out the reason of different behaviour please submit a tracker at premier.intel.com and attached log files you got with I_MPI_DEBUG=9.
Thank for the S at the end of I_MPI_FABRICS...it is now corrected. but it doesn't seem to solve the performance problem. Before I copy/paste the results with IMB1 Benchmark I would like to be sure that I'm using all the defaults of intel mpi. How can I be sure that I don't use mpitune optimized data files ?
Mpitune settings will be used only in case of '-tune' option passed to mpiexec. Also, please check enviroment: 'set | grep I_MPI_' . Ideally you should see I_MPI_ROOT only.
I've just found your question about 3 programs (I start 3 programs with mpirun) in MKL forum... I should mention that Intel MPI Library uses internal pinning (it's ON by default) so all 0 processes from all programs will be pinned to the processor #0 and you can get performance degradation. If you run more than 1 MPI tasks you need to switch pinning OFF by 'export I_MPI_PIN=0'.
So here is my environment for ICS 2011: set | grep I_MPI_ gives: I_MPI_CC=icc I_MPI_CXX=icpc I_MPI_F77=ifort I_MPI_F90=ifort I_MPI_FABRICS=shm:ofa I_MPI_FC=ifort I_MPI_MPD_RSH=ssh I_MPI_PIN=1 I_MPI_ROOT=/opt/intel/impi/4.0.1.007 I_MPI_TUNER_DATA_DIR=/opt/intel/impi/4.0.1/etc64/
With ICT 2010 here is the output: I_MPI_CC=icc I_MPI_CXX=icpc I_MPI_F77=ifort I_MPI_F90=ifort I_MPI_FABRICS=shm:ofa I_MPI_FC=ifort I_MPI_MPD_RSH=ssh I_MPI_PIN=0 I_MPI_ROOT=/opt/intel/impi/4.0.0.028 I_MPI_TUNER_DATA_DIR=/opt/intel/impi/4.0.0/etc64/
The big difference is the pinning. With ICT I have to disable it, I had problem with the pinning, when I started many job on he same node. With ICS this problem seems to be solve so the pinning is activated.
So I have started the intel IMB Benchmark on 12 nodes with ppn=4, so 48 processes. I can't use the whole cluster...we have important simulations, which are running. I attach the 2 logs.
We can see that ICS 2011 has the best results almost everywhere (not in barrier). But the problem is always here: with ICT2010 with pinning deactivated our simulation go faster (between 10 and 15 %) than with ICS2011, pinning activated!?!
I think the problem is solved. it was not a problem with Intel Softwares. We had problems with the "master" of the cluster. So my tests were running with a degraded master (12Gb RAM instead 48 Gb). The tests were on the nodes and not on the master, but it seems that the master memory has a influence on the results (The disks are mounted on the master)...at the moment all is OK.