- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I am profiling a code with gprof, and gprof seems to be reporting things incorrectly. The code is compiled like this:
ifort -g -pg -fno-omit-frame-pointer -O3 -O3 -align -traceback -sox -warn all
I am on x86_64 X 4 cores @ 2.67GHz RHEL 5, running an oldish kernel because of the lustre fs mounted on the machine and various bugs, stability issues, etc require the use of a specific kernel version.
In the code I read in about 70GB worth of data then process it. While, there's no doubt the IO takes a long time, I have been increasing the number of iterations in the inner most computational kernel by MANY orders of magnitude to make sure that the time spent here is not negligable compared to the rest of the program (i.e. the IO). As I increase the number of iterations program execution time takes longer. I look at my watch and it goes from 5 minutes to ten minutes to twenty. The profiling results do change though. gprof reports the same amount of time spent in _MAIN, and reports 0 time spent in the computational kernel even though I have added many many many many more iterations here. Does anyone have any ideas as to what is going on here? I could see that if the computational kernel is very quick there might be some underflow issues as gprof tries to add many small numbers through each loop, resulting in zero execution time for the kernel, but _MAIN should reflect the increased execution time I see on the wall clock right? If anybody has any insight into the source of this problem please let me know.
ifort -g -pg -fno-omit-frame-pointer -O3 -O3 -align -traceback -sox -warn all
I am on x86_64 X 4 cores @ 2.67GHz RHEL 5, running an oldish kernel because of the lustre fs mounted on the machine and various bugs, stability issues, etc require the use of a specific kernel version.
In the code I read in about 70GB worth of data then process it. While, there's no doubt the IO takes a long time, I have been increasing the number of iterations in the inner most computational kernel by MANY orders of magnitude to make sure that the time spent here is not negligable compared to the rest of the program (i.e. the IO). As I increase the number of iterations program execution time takes longer. I look at my watch and it goes from 5 minutes to ten minutes to twenty. The profiling results do change though. gprof reports the same amount of time spent in _MAIN, and reports 0 time spent in the computational kernel even though I have added many many many many more iterations here. Does anyone have any ideas as to what is going on here? I could see that if the computational kernel is very quick there might be some underflow issues as gprof tries to add many small numbers through each loop, resulting in zero execution time for the kernel, but _MAIN should reflect the increased execution time I see on the wall clock right? If anybody has any insight into the source of this problem please let me know.
1 解決策
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
it dumps a text file with extension .dump after the run. And an XML file is created that can be viewed by an included java viewer. File names are loop_prof_.xml and .dump
loopprofileviewer.csh.xml
assuming you have java and X11 forwarding set up.
ron
loopprofileviewer.csh
assuming you have java and X11 forwarding set up.
ron
コピーされたリンク
11 返答(返信)
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
If you wished to capture time spent in I/O or other library functions, you would likely link with -static. I'm not sure if that may be what you mean by a problem. You would still see only cpu time, not idle wait time.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
HMMMM that may be the source of the issue. I'll report back. thanks for noticing that. What about intrinsic functions like SUM? Without -static will they be profiled correctly?
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
The sum intrinsic is expanded in line, so will be included in the total for the subroutine where it appears, not affected by -static link. Intrinsics from the Intel compiler library, like functions from glibc et al, will likely not be itemized unless the static library version is linked.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
The -static flag seems to have no effect on the results I am seeing. I suspect perhaps some serious underflow on gprof's part (each of the routines I'm interested in must run quite quickly, but I call them each many many times). But the strange thing is that the amount of reported cumulative time spent in MAIN__ is so much less time than it actually takes to run, and my sytem load is really low other than this benchmark/test. I have also included my own instrumentation via the CPU_TIME intrinsic which also seems to be giving suspect results. I'm giving up on this issue for the time being because i have more urgent issues to attend to, but am frustrated by the non-repeatability of my testing and the seemingly strange results.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Write a trivial program that computes for about 10 seconds.
Then run the gprof on it.
If you get the same goofy results then assume that there is an issue with gprof and your installed libraries.
If this works as expected....
Then is your program with the gprof problems multi-threaded?
If so, then is something else on the system accounting for the compute time (and gprof doesn't think it is part of your program so it does not include the sample data).
Browse www.amd.com to see if they have a version of CodeAnalyst. CodeAnalyst will run timer based profiling (at least on Windows) for Intel processors (it won't run event based profiling). The purpose of the test is to see both exhibit the same problem.
Jim Dempsey
Then run the gprof on it.
If you get the same goofy results then assume that there is an issue with gprof and your installed libraries.
If this works as expected....
Then is your program with the gprof problems multi-threaded?
If so, then is something else on the system accounting for the compute time (and gprof doesn't think it is part of your program so it does not include the sample data).
Browse www.amd.com to see if they have a version of CodeAnalyst. CodeAnalyst will run timer based profiling (at least on Windows) for Intel processors (it won't run event based profiling). The purpose of the test is to see both exhibit the same problem.
Jim Dempsey
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
There was a range of binutils versions for which gprof was broken; I thought that was corrected a while ago as far as RHEL5 (and 6, which is all I have) were concerned. It's sometimes worth checking out current binutils.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
This could be the issue. I'm on a strange/old kernel b/c of lustrefs.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
The 12.0 compiler has -profile-functions and -profile-loops options to consider as an alternative. Compile with -g to get symbols and disable inlining with -fno-inline-functions. But like gprof, I doubt the IO wait time will be capture - but I haven't tried this.
ron
ron
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
How do you see the profiling output? Do you need Vtune?
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
it dumps a text file with extension .dump after the run. And an XML file is created that can be viewed by an included java viewer. File names are loop_prof_.xml and .dump
loopprofileviewer.csh.xml
assuming you have java and X11 forwarding set up.
ron
loopprofileviewer.csh
assuming you have java and X11 forwarding set up.
ron
