If you add -pg to both your compile and link options, gmon.out data should be generated at normal termination, according to the file system environment for each process (not necessarily the same environment where you launched MPI). There would be a file system race among processes running gprof in the same directory, but you can often get useful data. You may get data from static linked functions which weren't built with -pg, but no call graph. Execution time associated with IMPI functions is spread more broadly in 4.0 than 3.2.2, and will be affected by your SPIN environment settings.
It seems to me that it would be better to use a special tool. Something like Intel Trace Analyzer and Collector. Of cause it depends on the purposes of your profiling. As the first step you can try to use I_MPI_STATS environment variable (look for stats.txt file)