Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7956 Discussions

Missed compiler optimization opportunity

jimdempseyatthecove
Honored Contributor III
573 Views
I posted a thread on the AVX Instruction forum illustrating a missed compiler optimization issue

http://software.intel.com/en-us/forums/showthread.php?t=107112&o=a&s=lr

Jim Dempsey
0 Kudos
14 Replies
Georg_Z_Intel
Employee
573 Views
Hello Jim,

I've took a look with the following results:

  1. It's not related to AVX, it occurs with SSE, too. In the following I'll only use/refer to SSE. For AVX the argumentation is similar.
  2. When not using IPO both loops are equal in performance. When turning on IPO the first loop (auto-vectorized version) becomes faster. Comparing the assembly of both loops then unveils nearly no difference. So, why the difference in performance? Let's take a closer look at the memory layout...
  3. The first loop profits from the a contiguous elements in the three arrays ("a1", "a2" & "a3") - array "a4" is not used at all.
    The reason is that the first loop can pre-fetch and keep array elements on the full cache-line - for each array there's high likely a separate cache-line which even amplifies further.
    The second loop uses an array of structures and needs to load a cache-line with every iteration. You exactly hit the worst case here: Your structure is exactly 64 bytes (SSE version) - the size of a cache-line. For AVX it's even two cache-lines. In the SSE case, remove member "a4" and you'll notice a speedup already.
Conclusion:
The effects you're seeing is that IPO improves optimization to the same level as the second loop. However, the first loop profits from caching effects and thus is faster.

Edit:
I should also mention that in summary the amount of fetched cache-lines is identical for both loops. However, the timing is different. Processors can very well detect contiguous access to data in memory and pre-fetch already the next cache-line while it's not yet needed. The loads instructions in the loop won't add latency then.
For arrays of structures that's less likely since the processor does not have knowledge about the layout of the structures. It can only load the data when there's a load instruction telling the exact address (non-contiguous). Inside a loop this will add undesired latency.

Best regards,

Georg Zitzlsberger
0 Kudos
jimdempseyatthecove
Honored Contributor III
573 Views
George,

You are not quite right. The reason loop 1 runs faster is it is interleving fetches.

[plain]AOS Block 41: 0x1400014bb 74 vmovups ymm0, ymmword ptr [r9+rcx*4] 0x1400014c1 74 vmovups xmm4, xmmword ptr [r8+rcx*4] 0x1400014c7 74 vmulps ymm3, ymm0, ymm0 0x1400014cb 74 vmulps ymm5, ymm3, ymm2 0x1400014cf 74 vinsertf128 ymm9, ymm4, xmmword ptr [r8+rcx*4+0x10], 0x1 0x1400014d7 74 vmulps ymm3, ymm5, ymm9 0x1400014dc 74 vsubps ymm5, ymm3, ymm0 0x1400014e0 74 vmovups xmm0, xmmword ptr [r10+rcx*4] 0x1400014e6 74 vinsertf128 ymm4, ymm0, xmmword ptr [r10+rcx*4+0x10], 0x1 0x1400014ee 74 vmulps ymm9, ymm1, ymm4 0x1400014f2 74 vaddps ymm0, ymm5, ymm9 0x1400014f7 72 vmovups xmmword ptr [r8+rcx*4], xmm0 0x1400014fd 72 vextractf128 xmmword ptr [r8+rcx*4+0x10], ymm0, 0x1 0x140001505 72 add rcx, 0x8 0x140001509 72 cmp rcx, rdx 0x14000150c 72 jb 0x1400014bb [/plain]
See the fetch into ymm0 is interlieved with fetch into xmm4 _before_ ymm0 is used.
Similarly other instructions are inserted between the fetch into xmm4 and its use as ymm4 later.
The next use of xmm0 and ymm0 are not interleaved.
And the stores do not matter as much.
The above hasfour interleaved reads. This avoids 4 pipeline stalls waiting for data.

Loop 2 has

[plain]PAOS Block 45: 0x14000154d 81 vmovups ymm5, ymmword ptr [rax+rdx*1+0x20] 0x140001553 79 inc rcx 0x140001556 81 vmulps ymm2, ymm5, ymm5 0x14000155a 81 vmulps ymm3, ymm2, ymm1 0x14000155e 81 vmulps ymm4, ymm3, ymmword ptr [rax+rdx*1+0x40] 0x140001564 81 vmulps ymm3, ymm0, ymmword ptr [rax+rdx*1] 0x140001569 81 vsubps ymm2, ymm4, ymm5 0x14000156d 81 vaddps ymm4, ymm2, ymm3 0x140001571 81 vmovups ymmword ptr [rax+rdx*1+0x40], ymm4 0x140001577 79 add rax, 0x80 0x14000157d 79 cmp rcx, 0x800000 0x140001584 79 jl 0x14000154d [/plain]
See than (other than for inc rcx) all the reads are immediately followed by use of target registerof read. This causes a pipeline stall waiting for data.

The way to fix this is to unroll the loop to 2x. Something like the following:

[plain]PAOS (unrolled 2x) Block 45: 0x14000154d 81 vmovups ymm5, ymmword ptr [rax+rdx*1+0x20] vmovups ymm15, ymmword ptr [rax+rdx*1+0x20+0x80] 0x140001553 79 add rcx,2 0x140001556 81 vmulps ymm2, ymm5, ymm5 vmulps ymm12, ymm15, ymm15 0x14000155a 81 vmulps ymm3, ymm2, ymm1 vmulps ymm13, ymm12, ymm11 0x14000155e 81 vmulps ymm4, ymm3, ymmword ptr [rax+rdx*1+0x40] vmulps ymm14, ymm13, ymmword ptr [rax+rdx*1+0x40+0x80] 0x140001564 81 vmulps ymm3, ymm0, ymmword ptr [rax+rdx*1] vmulps ymm13, ymm10, ymmword ptr [rax+rdx*1+0x80] 0x140001569 81 vsubps ymm2, ymm4, ymm5 vsubps ymm12, ymm14, ymm15 0x14000156d 81 vaddps ymm4, ymm2, ymm3 vaddps ymm14, ymm12, ymm13 0x140001571 81 vmovups ymmword ptr [rax+rdx*1+0x40], ymm4 vmovups ymmword ptr [rax+rdx*1+0x40+0x80], ymm14 0x140001577 79 add rax, 0x80+0x80 0x14000157d 79 cmp rcx, 0x800000 0x140001584 79 jl 0x14000154d [/plain]In the above, for clarity I +10'd the register number for the second section of the unrolled loop,
Also, I showed ...+0x80 (e.g. +0x20+0x80) where the compiler would combine the base (0xA0)
Now when you read the code you can see the reads into the ymm registers are fully interleaved.

The compiler optimizations should easily be able to do this since the compiler can already unroll loops .AND. it is already smart enough to interleave instructions.

I think this case may be a trivial fix.

Jim Dempsey
0 Kudos
Georg_Z_Intel
Employee
573 Views
Hello Jim,

you're assuming that the latency of loads can be amortized by interleaving just a few non-load/store operations. That's not always true. If you're transferring lots of (different) data, which is the case in your example, cache-misses are high and latency is even worse. Interleaving only helps if the data is already in cache. Any access down the cache hierarchy towards main memory adds another magnitude of latency.

In your example you're comparing apples with oranges: consecutive (homogeneous) data in dedicated arrays vs. single array of structures (AoS).

Let's do the following and compare test/loop 1 with something that has the same memory access pattern:

[bash]F32vec8* b1; // ; F32vec8* b2; // ; F32vec8* b3; // ; F32vec8* b4; // ; ... b1 = new F32vec8[Nvecs]; b2 = new F32vec8[Nvecs]; b3 = new F32vec8[Nvecs]; b4 = new F32vec8[Nvecs]; ... // Added after test 2:

// test 3 for (int i=0; i = b2*b2*coeff1*b3 - b2 + b1*coeff2; } double t3 = omp_get_wtime(); totPAOS2 += t3 - t2; ...[/bash]
You'll notice that test/loop 3 has the same performance as test/loop 1. The access method is the same here but is definitely not for test/loop 2!

Does this answer your question?

Best regards,

Georg Zitzlsberger
0 Kudos
jimdempseyatthecove
Honored Contributor III
573 Views
>>you're assuming that the latency of loads can be amortized by interleaving just a few non-load/store operations

No George,

I am stating (assuming) that the latency of loads can be amortized by interleaving with a load/store that is not dependent on the prior load/store.

Looking at the higher performing AOS code:

[plain] vmovups ymm0, ymmword ptr [r9+rcx*4] vmovups xmm4, xmmword ptr [r8+rcx*4] vmulps ymm3, ymm0, ymm0 vmulps ymm5, ymm3, ymm2 vinsertf128 ymm9, ymm4, xmmword ptr [r8+rcx*4+0x10], 0x1 [/plain]
You find a load of cache/mem into ymm0 and ymm0 is not used until after a load of cache/mem into ymm4. The execution of the request to load xmm4 is practically free (and returns while fetch pending to xmm4). The last instruction may or may not stall, or the degree of stall,waiting for ymm0 depending oncache level (or not) of ymmword ptr [r8+rcx*4]. And the last instruction memory reference [r8+rcx*4+0x10], has a 3/4 probability of being in L1. And the pipeline designer may have been smart enough to recognize the vmovups xmm...vinsertf128...xmmword ptr from same cache line and bypass the load from L1 (saving possibly 3 clock ticks).

The last instruction avbove uses ymm4 (results of load issued at second instruction) 3 (2+) instruction times following load request.

This lack of stall (or lesser degree of stall) waiting for fetch to complete is one of the reasons for the performance difference.

A secondary issue, looking at the complete AOS assembly in a post above, is the AOS is using 3 data streams (off of r8, r9, r10) whereas the PAOS (in this example) is using 1 data stream (off of rax). The combination of interleaved load/store .AND. three streams increases the L1 hit ratio. Introducing interleaving load/store on the PAOS code would reduce pipeline stalls.

The loop need not be unrolled x2 as in earlier post. Consider the original and two alternate proposals:

[plain]PAOS ---- original ---- vmovups ymm5, ymmword ptr [rax+rdx*1+0x20] inc rcx vmulps ymm2, ymm5, ymm5 vmulps ymm3, ymm2, ymm1 vmulps ymm4, ymm3, ymmword ptr [rax+rdx*1+0x40] vmulps ymm3, ymm0, ymmword ptr [rax+rdx*1] vsubps ymm2, ymm4, ymm5 vaddps ymm4, ymm2, ymm3 vmovups ymmword ptr [rax+rdx*1+0x40], ymm4 add rax, 0x80 cmp rcx, 0x800000 jl 0x14000154d ---- proposed 1 ---- vmovups ymm5, ymmword ptr [rax+rdx*1+0x20] vmovups ymm15, ymmword ptr [rax+rdx*1+0x40] ; load lifted inc rcx vmulps ymm2, ymm5, ymm5 vmovups ymm16, ymmword ptr [rax+rdx*1] ; load lifted vmulps ymm3, ymm2, ymm1 vmulps ymm4, ymm3, ymm15 ; was ymmword ptr [rax+rdx*1+0x40] vmulps ymm3, ymm0, ymm16 ; was ymmword ptr [rax+rdx*1] vsubps ymm2, ymm4, ymm5 vaddps ymm4, ymm2, ymm3 vmovups ymmword ptr [rax+rdx*1+0x40], ymm4 add rax, 0x80 cmp rcx, 0x800000 jl 0x14000154d ---- proposed 2 ---- vmovups ymm16, ymmword ptr [rax+rdx*1] ; load lifted vmovups ymm5, ymmword ptr [rax+rdx*1+0x20] vmovups ymm15, ymmword ptr [rax+rdx*1+0x40] ; load lifted inc rcx vmulps ymm2, ymm5, ymm5 vmulps ymm3, ymm2, ymm1 vmulps ymm4, ymm3, ymm15 ; was ymmword ptr [rax+rdx*1+0x40] vmulps ymm3, ymm0, ymm16 ; was ymmword ptr [rax+rdx*1] vsubps ymm2, ymm4, ymm5 vaddps ymm4, ymm2, ymm3 vmovups ymmword ptr [rax+rdx*1+0x40], ymm4 add rax, 0x80 cmp rcx, 0x800000 jl 0x14000154d [/plain]
The second one lifts a load further but also results in a sequential access pattern (+0, +0x20, +0x40) which may aid the hardware prefetcher.

Jim Dempsey
0 Kudos
Georg_Z_Intel
Employee
573 Views
Hello Jim,

you still see test 1 performing different than test 3? That surprises me because both SSE(2) & AVX versions are at the very same level for me.

I've attached your initial example, extended by test 3 (compile with "-DUSE_AVX" & "-xAVX" for AVX, nothing else for SSE2).

I got those results (all 64 bit Linux systems - only systems I have at hand right now):

Core i7-2600 CPU @ 3.40GHz (AVX):
$ icpc openmp.cpp -ipo -o openmp -openmp -DUSE_AVX -xAVX # AVX
$ ./openmp
1.76201 6.20015 1.74814


$ icpc openmp.cpp -ipo -o openmp -openmp # SSE2
$ ./openmp
1.84652 6.30253
1.8449

Intel Core i7 CPU X 980 @ 3.33GHz
$ icpc openmp.cpp -ipo -o openmp -openmp
$ ./openmp
2.24444 7.25693 2.18517


Of course all systems have Intel Hyper-Threading, SpeedStep & Intel Turbo Boost disabled for providing comparable results.
Please not that I'm using optimization level 2 and IPO.

Discussing the details like timing of instructions/dependencies/(SW-)pipe-lining etc. is difficult because the underlying processors are doing lots of optimizations that only turn out during run-time and even change across generations.
I'd like to see hard numbers first to reproduce. So far I'm seeing the serial version (baseline) at the same level as the revised C++ class library version (test 3). Thus the C++ class library works well with the compiler optimizations.

In case you described more hypothetical improvements above, I'd like to ask for an example (e.g. with in-line assembly) on which we can work.

Thank you & best regards,

Georg Zitzlsberger

Modified source:
openmp.cpp
0 Kudos
jimdempseyatthecove
Honored Contributor III
573 Views
George,

Thanks for the edits

I receive

3.18165 5.98651 3.29903

Core i7 2600K (Turbo On, HT On) standard clock setting. Unwilling to reboot at this time to run test /wo turbo or HT.

Your edits will construct a scenario of 4 independent arrays (3 used as you stated earlier) resulting in 3 input streams. Essenially same configuratinas AOS. Performance should be approximately the same.

My system showed slightly slower, yours slightly faster.

In looking at the assembler code we see:
[plain] Block 64: vmovups ymm5, ymmword ptr [rax+rcx*1] inc r9 vmulps ymm2, ymm5, ymm5 vmulps ymm3, ymm2, ymm1 vmulps ymm4, ymm3, ymmword ptr [rdx] vmulps ymm3, ymm0, ymmword ptr [rax+r8*1] vsubps ymm2, ymm4, ymm5 vaddps ymm4, ymm2, ymm3 add rax, 0x20 vmovups ymmword ptr [rdx], ymm4 add rdx, 0x20 cmp r9, 0x800000 jl 0x1400016af [/plain]
The load of ymm5 (line 1) via rax+rcx streamis used "immediately" thereafter (inc occure concurrently in different path) so a stall could occur there every other iteration. The result of the second memory reference (rdx stream) is not used until after the fetchis issued for the rax+r8 stream.

This instruction sequence has approximatly the same interleaved fetch characteristics as theloop-1 has.

The following loop (hand written) mayyeild better performance
[bash] Block 64: vmovups ymm5, ymmword ptr [rax+rcx*1] vmovups ymm15, ymmword ptr [rdx] ; lifted load vmovups ymm16, ymmword ptr [rax+r8*1] ; lifted load inc r9 vmulps ymm2, ymm5, ymm5 vmulps ymm3, ymm2, ymm1 vmulps ymm4, ymm3, ymm15 ; was ymmword ptr [rdx] vmulps ymm3, ymm0, ymm16 ; ymmword ptr [rax+r8*1] vsubps ymm2, ymm4, ymm5 vaddps ymm4, ymm2, ymm3 add rax, 0x20 vmovups ymmword ptr [rdx], ymm4 add rdx, 0x20 cmp r9, 0x800000 jl 0x1400016af [/bash]
Though placement of the 3rd read may need to be deferred one or two instructions depending on archetecture.

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
573 Views
Georg,

Here is something new to play with.

I added out of line functions for the three loops that are either conditionalized in or out (see #define near top).

I also added an out of line function that has __asm block that optimizes the fetching for Test2 (original PAOS). This produces the 4th output number.

As configured, on my system (no out of line functions, multi-file IPO, full optimization, AVX, #pragmas disabled) I see:

3.181 5.92313 3.29127 2.97794

So the minor code change I introduced into the (single stream) PAOS doubled the performance and it is now faster than the prior AOS and the four stream variant you produced forPAOS2.

Now then for some interesting observatons:

The complete program is a single file.

Compiling for muti-file IPO produces much faster code than compiling for single file IPO,why???
(note there is only 1 procedure/funciton)

Compiling single file IPO, and with USE_functions causes the dvec.h member functions to not be inlined, why?

This file may be a good test program for your optimization team to use to figure out what is going on (and to improve your product). Depending on options (all with high optimization levels) there is a wide variation in performance of each loop. (up to 6x difference).
[cpp]// Felix.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include "dvec.h" #include "omp.h" #include #define USE_AVX #define USE_functionCalls //#define USE_pragmas // 3.181 5.92313 3.29127 2.97794 #ifdef USE_AVX struct aosoa { F32vec8 a1; F32vec8 a2; F32vec8 a3; F32vec8 a4; }; const int VecWidth = 8; #else struct aosoa { F32vec4 a1; F32vec4 a2; F32vec4 a3; F32vec4 a4; }; const int VecWidth = 4; #endif const int N = 64*1024*1024; const int Nvecs = N / VecWidth; float* a1; // ; float* a2; // ; float* a3; // ; float* a4; // ; #ifdef USE_AVX F32vec8* b1; // ; F32vec8* b2; // ; F32vec8* b3; // ; F32vec8* b4; // ; #else F32vec4* b1; // ; F32vec4* b2; // ; F32vec4* b3; // ; F32vec4* b4; // ; #endif aosoa* s; // [Nvecs]; float coeff1 = 1.2345f; float coeff2 = .987654321f; #ifdef USE_functionCalls void Test1() { #ifdef USE_pragmas #pragma ivdep #pragma vector always #endif for (int i=0; i = a2*a2*coeff1*a3 - a2 + coeff2*a1; } } void Test2() { #ifdef USE_pragmas #pragma ivdep #pragma vector always #endif for (int i=0; i.a3 = s.a2*s.a2*coeff1*s.a3 - s.a2 + s.a1*coeff2; } } void Test3() { #ifdef USE_pragmas #pragma ivdep #pragma vector always #endif for (int i=0; i = b2*b2*coeff1*b3 - b2 + b1*coeff2; } } #endif void PAOShandFix() { __asm { sub rsp, 40 ;;; for (int i=0; i.a3 = s.a2*s.a2*coeff1*s.a3 - s.a2 + s.a1*coeff2; vbroadcastss ymm1, DWORD PTR [coeff1] vbroadcastss ymm0, DWORD PTR [coeff2] mov rdx, QWORD PTR xor eax, eax vmovups ymm6, YMMWORD PTR [rax+rdx] vmovups ymm7, YMMWORD PTR [64+rax+rdx] B4_2: vmovups ymm5, YMMWORD PTR [32+rax+rdx] add rcx, 2 vmulps ymm2, ymm5, ymm5 vmulps ymm3, ymm2, ymm1 vmulps ymm4, ymm3, ymm7 ; was YMMWORD PTR [64+rax+rdx] vmovups ymm7, YMMWORD PTR [192+rax+rdx] vmulps ymm3, ymm0, ymm6 ; was YMMWORD PTR [rax+rdx] vmovups ymm6, YMMWORD PTR [128+rax+rdx] vsubps ymm2, ymm4, ymm5 vaddps ymm4, ymm2, ymm3 vmovups YMMWORD PTR [64+rax+rdx], ymm4 add rax, 128 cmp rcx, 8388608 jl B4_2 ;;; } vzeroupper ;86.1 add rsp, 40 ;86.1 } // __asm__ } // PAOShandFix int _tmain(int argc, _TCHAR* argv[]) { a1 = new float; a2 = new float; a3 = new float; a4 = new float; s = new aosoa[Nvecs]; #ifdef USE_AVX b1 = new F32vec8[Nvecs]; b2 = new F32vec8[Nvecs]; b3 = new F32vec8[Nvecs]; b4 = new F32vec8[Nvecs]; #else b1 = new F32vec4[Nvecs]; b2 = new F32vec4[Nvecs]; b3 = new F32vec4[Nvecs]; b4 = new F32vec4[Nvecs]; #endif // Overwrite junk data from allocation to avoid NaN // Not using single constant as the compiler may NOP the test loops for(int i=0; i = 1.0f / float(i); a2 = 2.0f / float(i); a3 = 3.0f / float(i); a4 = 4.0f / float(i); } for(int j=0; j.a1 = (float(i) + 1.0f) / float(j*VecWidth + i); } double totAOS = 0.0; double totPAOS = 0.0; double totPAOS2 = 0.0; double totPAOShandFix = 0.0; for(int iRep=0; iRep < 50; ++iRep) { // test 1 double t0 = omp_get_wtime(); #ifdef USE_functionCalls Test1(); #else #ifdef USE_pragmas #pragma ivdep #pragma vector always #endif for (int i=0; i = a2*a2*coeff1*a3 - a2 + coeff2*a1; } #endif double t1 = omp_get_wtime(); totAOS += t1 - t0; // test 2 #ifdef USE_functionCalls Test2(); #else #ifdef USE_pragmas #pragma ivdep #pragma vector always #endif for (int i=0; i.a3 = s.a2*s.a2*coeff1*s.a3 - s.a2 + s.a1*coeff2; } #endif double t2 = omp_get_wtime(); totPAOS += t2 - t1; // test 3 (the apple :-)) #ifdef USE_functionCalls Test3(); #else #ifdef USE_pragmas #pragma ivdep #pragma vector always #endif for (int i=0; i = b2*b2*coeff1*b3 - b2 + b1*coeff2; } #endif double t3 = omp_get_wtime(); totPAOS2 += t3 - t2; PAOShandFix(); double t4 = omp_get_wtime(); totPAOShandFix += t4 - t3; } std::cout << totAOS << " " << totPAOS << " " << totPAOS2 << " " << totPAOShandFix << std::endl; return 0; } [/cpp]

Jim Dempsey


0 Kudos
Georg_Z_Intel
Employee
573 Views
Hello Jim,

why did you increment by 2?

[sectionBody]B4_2: vmovups ymm5, YMMWORD PTR [32+rax+rdx] add rcx, 2 <= HERE; should be "incq rcx"?! vmulps ymm2, ymm5, ymm5 vmulps ymm3, ymm2, ymm1 vmulps ymm4, ymm3, ymm7 ; was YMMWORD PTR [64+rax+rdx] vmovups ymm7, YMMWORD PTR [192+rax+rdx] vmulps ymm3, ymm0, ymm6 ; was YMMWORD PTR [rax+rdx] vmovups ymm6, YMMWORD PTR [128+rax+rdx] vsubps ymm2, ymm4, ymm5 vaddps ymm4, ymm2, ymm3 vmovups YMMWORD PTR [64+rax+rdx], ymm4 add rax, 128 cmp rcx, 8388608 jl B4_2 [/sectionBody]
The original only increments by 1. So you'd have computed only half of the values...

Regarding IPO:
Even though (multi-file) IPO shouldn't make a difference to the single-file version there are still different optimization paths executed in the compiler. Thus, resulting in different binaries. In theory both variants should show identical results but reality is different. It's also the same as for optimization level 2 vs. 3 - sometimes 2 is better than 3 and vice versa otherwise. It always depends on the use-cases and nothing we could ultimately guarantee with small changes/fixes to the compiler.

Regarding in-lining:
The compiler might not in-line functions if it does not see any benefit. You can override that by various options (e.g. /Qinline-forceinline or /Qinline-factor). Or you can use the pragmas [my personal favorites btw.]:
#pragma [force]inline
#pragma noinline

Best regards,

Georg Zitzlsberger
0 Kudos
jimdempseyatthecove
Honored Contributor III
573 Views
>>why did you increment by 2?

Oops Error on my part, good catch.

Itried the unroll x2 and other variations and missed resetting"incq" (new instruction :). With the corrected code, the single stream pre-fetch is not showing a significant difference. So this confirms that having multiple streams improves the performance.

Thank you for taking your time to address this. The insight on use of multi-stream input will be a valuable lesson in future assessments of performance issues (at least for Sandy Bridge).

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
573 Views
RE: no/mf/sf IPO and inlining

>> different optimization paths executed in the compiler

Granted, however it appears that when a given optimization is added to mf IPO it is not determined if the particular optimization would also be suitable for sf IPO and/or no IPO. Most likely because this would require time in regression testing.

I think it a false assumption on Intel's part that mf IPO is the assumed preference for optimization.

Consider the situation of a software provider. Examples:

Intel with MKL, IPP, etc...
A central IT department providing libraries
An Independent Software Vendor providing libraries

Consider now a bug report coming in whereby the problem report contains the sorce files exhibiting the error and the error interacts with the providers libraries. The provider's support person will want source level debugging for their libraries .AND. compillation equivilancies of the user site which does not have the source to the library. The support person does NOT want mfIPO to include files from the libraries to which the user does not have source access.

I may have a configuration issue here, but on my site, when I compile a test program using mfIPO with a solution NOT containing the projects for the "private" library, that VS (or Intel C++) is able to locate the project and sources of the "private" library and include these into the mfIPO. I do not want to be placed into the postion of requiring two test machines and copying of object files. Yet I want (require) full source level debugging.

Jim
0 Kudos
Georg_Z_Intel
Employee
573 Views
Hello Jim,

I think there is a basic misunderstanding regarding multi-file IPO:
IPO does not in-line more code than the linker wouldn't collect already. IPO only works on what's already been put together by the linker.

So, if you see anything being added to your application binary that should not be there it's a topic for setting up the linker. Same, vice versa.

And, anything that was compiled w/o IPO won't be subject to IPO optimization. This is esp. true for 3rd party libraries (where you don't have the source code for and hence did not compile with IPO).

Best regards,

Georg Zitzlsberger
0 Kudos
jimdempseyatthecove
Honored Contributor III
573 Views
Georg,

Are you saying single file IPO is post linker?

The problem is when the party of thethe 3rd party library is yourself. And when you want to use mf-IPO yet keep the library out of the IPO consideration. Because you are trying to compile with the same optimization selections that the user has - yet have full debug symbols of the user app + your library (not co-mingled with optimizations).

Jim
0 Kudos
Georg_Z_Intel
Employee
573 Views
Hello Jim,

not single-file IPO, only multi-file IPO is done during linker stage.

Did you try "/Qipo-c" (or "-ipo-c"). This allows you to optimize a library with IPO but drops all the intermediate representation (IR) afterwards. The final (single & optimized) object file can easily be linked to a library using the usual tools. As it won't contain IR it won't be part of IPO of your main program, yet taking advantage of (partial*) IPO.
Of course, if you use "/Qipo" (or "-ipo") for every module/library/component then you also enable all of them for IPO. There are lots of other options that allow you to control how IPO should be applied.

*: Partial, because the inner core of the library will be optimized by IPO but not the interfaces.

Best regards,

Georg Zitzlsberger
0 Kudos
jimdempseyatthecove
Honored Contributor III
573 Views
George,

I did not try /Qipo-c (didn't know it was available). The pull-down in Visual Studio does not include this option for the project properties. This should be considered as something to add (although the -c part may be a Librarian function). My preference would be to include this choice on the:

(Project) | Optimization [Intel C++] | Interprocedural Optimization

property page (keep it in one place). Can you post this request to the development team.

Also consider including it on

(file.cpp) | Optimization [Intel C++] | Interprocedural Optimization

Which /Qipo-c would permit multi-file IPO to be incorporated into file.cpp while excluding the resultant output from being incorporated into callers offile.cpp's .objfunctions (procedures).

Thanks for your tip.

Jim Dempsey
0 Kudos
Reply