I'm in the process of setting up a new development environment for myself; I'm working mostly on Fortran programs, some C/C++, which are parallelised with either OpenMP or MPI. These programs are computational electro-magnetic simulations for geophysical applications: lots of number-crunching and large data sets.
We use the Intel compilers and aim to have our programs build and execute on both Windows desktops and on the company's large Linux clusters. I'm leaning towards installing VTune on my Windows desktop (dual quad-core Xeons right now, plenty of RAM) rather than the Linux development system for a variety of reasons.
My question is this: if I use VTune (etc) to optimise the performance of a program on a Windows PC will the optimisations I make also optimise it for execution on the Linux clusters ? I can see a number of potential differences arising from process-process communication in MPI programs, but what else should I be on the look out for ?
Any opionions and suggestions welcome.
Thisis complex question, I mean the answer includes many ifs:
If you optimize for general single-threaded execution on Windows, most probably you will observe the improvement on the Linux cluster.
If you optimize for low level microarchitecture features, like size of 2-n level cache, the improvement on the Linux machine might look different as the machine itself might be different. So, you have to keep some level of generality of optimization (e.g. increasing data locality vs. optimizing data structures for the particular cache size).
If you optimize for communication and input-output on Windows, it may completely not work on Linux.
If you optimize for multithreading on the Windows platform, you might not find the expected improvement on the Linux cluster. Or you have to consider the MPI based application on Windows as well.
I can see a number of potential differences arising from process-process communication in MPI programs, but what else should I be on the look out for ?
FYI, VTune doesnt support MPI-based multiprocessing model. Usually developers use VTune for single node optimization of the application running on a cluster. This improvement will scale to the rest of nodes.
In case the communication part is the performance bottleneck of the MPI application, we use Intel Trace Collector/Analyzer , which is a part of Intel Cluster Toolkit.
Hope this helps.
Thanks Vladimir that certainly does help. Yes, it's a complex question with many ifs and buts. But it's very useful to have such input as yours to make me think this through in detail. Thanks too for pointing out the limitations of VTune wrt MPI.