Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Why are the two run's qopt-report the same but the two run's performance are very different?

sun__lei
Beginner
892 Views

I had compiled the same program using the two different parameters to compile.

1. icc -qopt-report -g -O2 MD.c util.c control.c coord.h -c -lm

2. icc -qopt-report -g -O2 -ipo MD.c util.c control.c coord.h -c -lm //add "-ipc"

Then I compared every corresponding .optrpt file from the two. The result is that all the contents are the same except the second's content is 

-inline-max-per-routine: disabled

-inline-max-per-compile: disabled

while the first's content is

-inline-max-per-routine: 10000

-inline-max-per-compile: 500000

It seems that the two's performances will also be the same. But the amazing result is that the second is three times speed up than the first!

So what is the reason? Who can help me explain it?

0 Kudos
5 Replies
RahulV_intel
Moderator
892 Views

Hi,

Could you provide us the ICC compiler version, OS, sample test case on which you have worked?

 

--Rahul

0 Kudos
RahulV_intel
Moderator
892 Views

Hi,

Quick reminder to provide sample test case.

 

--Rahul

0 Kudos
jimdempseyatthecove
Honored Contributor III
892 Views

>>But the amazing result is that the second is three times speed up than the first!

If I were to make a guess....

The first compilation was "inline everything that can be inlined".

The second compilation placed upper limits on the degree of inlining.

To a new programmer, when they discover that inlining can be good in one case, naively assume that inlining to the max must be better.

There are a few issues with over aggressive inlining (and loop unrolling)

1) The level 1 instruction cache has a limited size. A loop with several calls to the same function when inlined can produce a loop that spills out of the L1 instruction cache. Whereas the same loop with the function calls not inlined can produce a loop + function that fits within the L1 instruction cache. In the non-inlined case in this example will run faster than the inlined case.
2) overuse of inlining can at times result in over-subscription of the available registers.

You often need to be more judicious (less aggressive) in where you perform inlining and/or how/where you perform ipo.

Jim Dempsey

0 Kudos
RahulV_intel
Moderator
892 Views

Hi Lei,

Kindly confirm if your query is resolved or else provide a sample test case so that we can get back to you with the actual explanation for such behavior.

 

--Rahul

0 Kudos
RahulV_intel
Moderator
892 Views

We are closing this thread. Feel free to post a new question, if your issue still persists.

 

--Rahul

0 Kudos
Reply