I'm currently profiling a parallel Fortran program (using Intel MPI) for performance tuning. I compiled the program with "-profile-functions -profile-loops=all" options, then ran this program with 65 processes. After the execution I got one loop_prof_funcs_xxxx.dump and one loop_prof_loops_xxxx.dump files, showing the execution cost (in time) of functions and loops at entire job level. Is there any way to generate the same files at process level so we can study the performance of individual processes?
Please advise. Your guidance is highly appreciated!