I have been compiling a fairly large Fortran application with Intel v8.1 compiler using OMP directives for parallelism on dual socket IA-32 machines.
All subroutines (a few are in ANSI C) have been instrumented with -pg compile switch so that I can run gprof later to check on performance. According to the "flat profile" in the gprof output, the code runs twice as fast when using two processors compared to with a single processor.
However, the code also calls difftime (from C) to give me the elapsed execution time for the software. Comparing the elapsed times gives me only a 30% improvement in speed. According to gprof, I should expect a 50% improvement.
I don't understand why the two methods of clocking the parallelised software are so different.
It does not seem likely that it is due to OMP overhead, because gprof measures the time spent in each subroutine. The overhead should be included in the gprof profile stats, unless I am misinterpreting the gprof method. Same thoughts about memory access. About 10% of the computer run seems to be spent on Fortran i/o, based on the output of the "top" utility (procs are "idle" or running "system calls"), but again, this should be included in the gprof stats, since the io is called from inside the profiled subroutines.
I would be grateful if someone could comment on what might be causing the difference in clocking methods.
gprof attempts to measure the CPU time spent in each reported subroutine. In the call graph profile, it attempts to show time spent in a function, plus those functions called by it which are instrumented for gprof. Thus, the function calls inserted by OpenMP, and Fortran run-time library functions,would show up only as separate entries, if you are lucky.