- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I had compiled the same program using the two different parameters to compile.
1. icc -qopt-report -g -O2 MD.c util.c control.c coord.h -c -lm
2. icc -qopt-report -g -O2 -ipo MD.c util.c control.c coord.h -c -lm //add "-ipc"
Then I compared every corresponding .optrpt file from the two. The result is that all the contents are the same except the second's content is
-inline-max-per-routine: disabled
-inline-max-per-compile: disabled
while the first's content is
-inline-max-per-routine: 10000
-inline-max-per-compile: 500000
It seems that the two's performances will also be the same. But the amazing result is that the second is three times speed up than the first!
So what is the reason? Who can help me explain it?
- Tags:
- CC++
- Development Tools
- General Support
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Optimization
- Parallel Computing
- Vectorization
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you provide us the ICC compiler version, OS, sample test case on which you have worked?
--Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Quick reminder to provide sample test case.
--Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>But the amazing result is that the second is three times speed up than the first!
If I were to make a guess....
The first compilation was "inline everything that can be inlined".
The second compilation placed upper limits on the degree of inlining.
To a new programmer, when they discover that inlining can be good in one case, naively assume that inlining to the max must be better.
There are a few issues with over aggressive inlining (and loop unrolling)
1) The level 1 instruction cache has a limited size. A loop with several calls to the same function when inlined can produce a loop that spills out of the L1 instruction cache. Whereas the same loop with the function calls not inlined can produce a loop + function that fits within the L1 instruction cache. In the non-inlined case in this example will run faster than the inlined case.
2) overuse of inlining can at times result in over-subscription of the available registers.
You often need to be more judicious (less aggressive) in where you perform inlining and/or how/where you perform ipo.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Lei,
Kindly confirm if your query is resolved or else provide a sample test case so that we can get back to you with the actual explanation for such behavior.
--Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are closing this thread. Feel free to post a new question, if your issue still persists.
--Rahul

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page