- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am profiling a code with gprof, and gprof seems to be reporting things incorrectly. The code is compiled like this:
ifort -g -pg -fno-omit-frame-pointer -O3 -O3 -align -traceback -sox -warn all
I am on x86_64 X 4 cores @ 2.67GHz RHEL 5, running an oldish kernel because of the lustre fs mounted on the machine and various bugs, stability issues, etc require the use of a specific kernel version.
In the code I read in about 70GB worth of data then process it. While, there's no doubt the IO takes a long time, I have been increasing the number of iterations in the inner most computational kernel by MANY orders of magnitude to make sure that the time spent here is not negligable compared to the rest of the program (i.e. the IO). As I increase the number of iterations program execution time takes longer. I look at my watch and it goes from 5 minutes to ten minutes to twenty. The profiling results do change though. gprof reports the same amount of time spent in _MAIN, and reports 0 time spent in the computational kernel even though I have added many many many many more iterations here. Does anyone have any ideas as to what is going on here? I could see that if the computational kernel is very quick there might be some underflow issues as gprof tries to add many small numbers through each loop, resulting in zero execution time for the kernel, but _MAIN should reflect the increased execution time I see on the wall clock right? If anybody has any insight into the source of this problem please let me know.
ifort -g -pg -fno-omit-frame-pointer -O3 -O3 -align -traceback -sox -warn all
I am on x86_64 X 4 cores @ 2.67GHz RHEL 5, running an oldish kernel because of the lustre fs mounted on the machine and various bugs, stability issues, etc require the use of a specific kernel version.
In the code I read in about 70GB worth of data then process it. While, there's no doubt the IO takes a long time, I have been increasing the number of iterations in the inner most computational kernel by MANY orders of magnitude to make sure that the time spent here is not negligable compared to the rest of the program (i.e. the IO). As I increase the number of iterations program execution time takes longer. I look at my watch and it goes from 5 minutes to ten minutes to twenty. The profiling results do change though. gprof reports the same amount of time spent in _MAIN, and reports 0 time spent in the computational kernel even though I have added many many many many more iterations here. Does anyone have any ideas as to what is going on here? I could see that if the computational kernel is very quick there might be some underflow issues as gprof tries to add many small numbers through each loop, resulting in zero execution time for the kernel, but _MAIN should reflect the increased execution time I see on the wall clock right? If anybody has any insight into the source of this problem please let me know.
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
it dumps a text file with extension .dump after the run. And an XML file is created that can be viewed by an included java viewer. File names are loop_prof_.xml and .dump
loopprofileviewer.csh.xml
assuming you have java and X11 forwarding set up.
ron
loopprofileviewer.csh
assuming you have java and X11 forwarding set up.
ron
Link Copied
11 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you wished to capture time spent in I/O or other library functions, you would likely link with -static. I'm not sure if that may be what you mean by a problem. You would still see only cpu time, not idle wait time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HMMMM that may be the source of the issue. I'll report back. thanks for noticing that. What about intrinsic functions like SUM? Without -static will they be profiled correctly?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The sum intrinsic is expanded in line, so will be included in the total for the subroutine where it appears, not affected by -static link. Intrinsics from the Intel compiler library, like functions from glibc et al, will likely not be itemized unless the static library version is linked.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The -static flag seems to have no effect on the results I am seeing. I suspect perhaps some serious underflow on gprof's part (each of the routines I'm interested in must run quite quickly, but I call them each many many times). But the strange thing is that the amount of reported cumulative time spent in MAIN__ is so much less time than it actually takes to run, and my sytem load is really low other than this benchmark/test. I have also included my own instrumentation via the CPU_TIME intrinsic which also seems to be giving suspect results. I'm giving up on this issue for the time being because i have more urgent issues to attend to, but am frustrated by the non-repeatability of my testing and the seemingly strange results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Write a trivial program that computes for about 10 seconds.
Then run the gprof on it.
If you get the same goofy results then assume that there is an issue with gprof and your installed libraries.
If this works as expected....
Then is your program with the gprof problems multi-threaded?
If so, then is something else on the system accounting for the compute time (and gprof doesn't think it is part of your program so it does not include the sample data).
Browse www.amd.com to see if they have a version of CodeAnalyst. CodeAnalyst will run timer based profiling (at least on Windows) for Intel processors (it won't run event based profiling). The purpose of the test is to see both exhibit the same problem.
Jim Dempsey
Then run the gprof on it.
If you get the same goofy results then assume that there is an issue with gprof and your installed libraries.
If this works as expected....
Then is your program with the gprof problems multi-threaded?
If so, then is something else on the system accounting for the compute time (and gprof doesn't think it is part of your program so it does not include the sample data).
Browse www.amd.com to see if they have a version of CodeAnalyst. CodeAnalyst will run timer based profiling (at least on Windows) for Intel processors (it won't run event based profiling). The purpose of the test is to see both exhibit the same problem.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There was a range of binutils versions for which gprof was broken; I thought that was corrected a while ago as far as RHEL5 (and 6, which is all I have) were concerned. It's sometimes worth checking out current binutils.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This could be the issue. I'm on a strange/old kernel b/c of lustrefs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The 12.0 compiler has -profile-functions and -profile-loops options to consider as an alternative. Compile with -g to get symbols and disable inlining with -fno-inline-functions. But like gprof, I doubt the IO wait time will be capture - but I haven't tried this.
ron
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How do you see the profiling output? Do you need Vtune?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
it dumps a text file with extension .dump after the run. And an XML file is created that can be viewed by an included java viewer. File names are loop_prof_.xml and .dump
loopprofileviewer.csh.xml
assuming you have java and X11 forwarding set up.
ron
loopprofileviewer.csh
assuming you have java and X11 forwarding set up.
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Epic. Thanks Ron.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page