- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a very simple code that I use as a test input for a binary analysis tool. The code performs a naive matrix multiplication. I am compiling this code on an Intel Xeon E5-2690 machine using option -O3 -g -xHost.
The Intel compiler 13.x with the above options would perform loop interchange, moving the reccurence from the innermost loop to an outer loop. Version 14.0.0 of the compiler does not perform this transformation. Both versions unroll one of the loops 16 times, filling up 4 AVX vectors. However, version 14.0 also generates many more address arithmetic instructions in the innermost loop. The end result is that the code produced by version 14.0 takes 50% to 3x longer to execute for matrix sizes >= 40.
There is no value in this naive matrix multiply code, but I am trying to understand what changed with the new compiler that it failed to interchange the loops, and I wonder if this change can possibly affect other codes whose performance is actually relevant. Are there any command line flags that would enable the previous behavior?
Thanks, Gabriel
PS: I am not allowed to attach files. Pasting the code in the message body, or including a link to pastebin triggers the spam filter. What's the appropriate way to include sample code?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i'm trying to reproduce ,it would be good if you can also paste the pictures or assembly codes to demonstrate the generated differences between both versions,which would enable more enthusiasts to help you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am comparing the performances of two binaries compiled with
- icc (ICC) 13.1.3 20130607
- icc (ICC) 14.0.0 20130728
I am attaching the full objdump of the two binaries. The 'compute' routine has been inlined into 'main'.
The four level loop in the compute routine maps to the following object code loop structure. All addresses are in hex. The end address is the address of the instruction immediately after the loop branch.
Binary produced with icc 13.1.3, flags -O3 -g -xHost:
- L,Lev1: 400e29 : 401050
- L,Lev2: 400e84 : 40102a
- L,Lev3: 400e9c : 401013
- L,Lev4: 400ef2 : 400f0c
- L,Lev4: 400f1e : 400fae (X)
- L,Lev4: 400fdc : 400ff6
Binary produced by icc 14.0.0, flags -O3 -g -xHost:
- L,Lev1: 400f9f : 401295
- L,Lev2: 400ffd : 401272
- L,Lev3: 40100a : 40125a
- L,Lev4: 40106b : 401088
- L,Lev4: 4010e5 : 4011e1 (X)
- L,Lev4: 40122c : 401249
The loops marked with (X) are the unrolled loops. The other level 4 loops handle the odd iterations. I am inlining in this post just the body of the unrolled level 4 loops. Full disassembly is attached.
[plain]
Binary mm_i13:
400f1e: c4 c1 79 10 0c d3 vmovupd (%r11,%rdx,8),%xmm1
400f24: c4 c1 79 10 6c d3 20 vmovupd 0x20(%r11,%rdx,8),%xmm5
400f2b: c4 41 79 10 4c d3 40 vmovupd 0x40(%r11,%rdx,8),%xmm9
400f32: c4 41 79 10 6c d3 60 vmovupd 0x60(%r11,%rdx,8),%xmm13
400f39: c4 c3 55 18 74 d3 30 vinsertf128 $0x1,0x30(%r11,%rdx,8),%ymm5,%ymm6
400f40: 01
400f41: c4 c3 75 18 54 d3 10 vinsertf128 $0x1,0x10(%r11,%rdx,8),%ymm1,%ymm2
400f48: 01
400f49: c4 43 35 18 54 d3 50 vinsertf128 $0x1,0x50(%r11,%rdx,8),%ymm9,%ymm10
400f50: 01
400f51: c4 43 15 18 74 d3 70 vinsertf128 $0x1,0x70(%r11,%rdx,8),%ymm13,%ymm14
400f58: 01
400f59: c5 fd 59 da vmulpd %ymm2,%ymm0,%ymm3
400f5d: c5 fd 59 fe vmulpd %ymm6,%ymm0,%ymm7
400f61: c4 41 7d 59 da vmulpd %ymm10,%ymm0,%ymm11
400f66: c4 41 7d 59 fe vmulpd %ymm14,%ymm0,%ymm15
400f6b: c4 c1 65 58 24 d7 vaddpd (%r15,%rdx,8),%ymm3,%ymm4
400f71: c4 41 45 58 44 d7 20 vaddpd 0x20(%r15,%rdx,8),%ymm7,%ymm8
400f78: c4 41 25 58 64 d7 40 vaddpd 0x40(%r15,%rdx,8),%ymm11,%ymm12
400f7f: c4 c1 05 58 4c d7 60 vaddpd 0x60(%r15,%rdx,8),%ymm15,%ymm1
400f86: c4 c1 7d 11 24 d7 vmovupd %ymm4,(%r15,%rdx,8)
400f8c: c4 41 7d 11 44 d7 20 vmovupd %ymm8,0x20(%r15,%rdx,8)
400f93: c4 41 7d 11 64 d7 40 vmovupd %ymm12,0x40(%r15,%rdx,8)
400f9a: c4 c1 7d 11 4c d7 60 vmovupd %ymm1,0x60(%r15,%rdx,8)
matmul.c:16
400fa1: 48 83 c2 10 add $0x10,%rdx
400fa5: 49 3b d4 cmp %r12,%rdx
400fa8: 0f 82 70 ff ff ff jb 400f1e <main+0x57e>
[/plain]
[plain]
Binary mm_i14:
4010e5: 4d 63 d2 movslq %r10d,%r10
matmul.c:17
4010e8: 41 83 c7 10 add $0x10,%r15d
matmul.c:18
4010ec: 4e 8d 2c d3 lea (%rbx,%r10,8),%r13
4010f0: c4 c1 7b 10 65 00 vmovsd 0x0(%r13),%xmm4
4010f6: c4 c1 59 16 74 0d 00 vmovhpd 0x0(%r13,%rcx,1),%xmm4,%xmm6
4010fd: 4d 8d 6c 4d 00 lea 0x0(%r13,%rcx,2),%r13
401102: c4 c1 7b 10 6d 00 vmovsd 0x0(%r13),%xmm5
401108: c4 c1 51 16 7c 0d 00 vmovhpd 0x0(%r13,%rcx,1),%xmm5,%xmm7
40110f: 46 8d 2c 17 lea (%rdi,%r10,1),%r13d
401113: 4d 63 ed movslq %r13d,%r13
401116: c4 63 4d 18 c7 01 vinsertf128 $0x1,%xmm7,%ymm6,%ymm8
40111c: c4 21 3d 59 0c f0 vmulpd (%rax,%r14,8),%ymm8,%ymm9
401122: 4e 8d 2c ea lea (%rdx,%r13,8),%r13
401126: c4 41 7b 10 55 00 vmovsd 0x0(%r13),%xmm10
40112c: c4 41 29 16 64 0d 00 vmovhpd 0x0(%r13,%rcx,1),%xmm10,%xmm12
401133: c4 c1 65 58 d9 vaddpd %ymm9,%ymm3,%ymm3
401138: 4d 8d 6c 4d 00 lea 0x0(%r13,%rcx,2),%r13
40113d: c4 41 7b 10 5d 00 vmovsd 0x0(%r13),%xmm11
401143: c4 41 21 16 6c 0d 00 vmovhpd 0x0(%r13,%rcx,1),%xmm11,%xmm13
40114a: 47 8d 2c 10 lea (%r8,%r10,1),%r13d
40114e: 4d 63 ed movslq %r13d,%r13
401151: c4 43 1d 18 f5 01 vinsertf128 $0x1,%xmm13,%ymm12,%ymm14
401157: c4 21 0d 59 7c f0 20 vmulpd 0x20(%rax,%r14,8),%ymm14,%ymm15
40115e: 4e 8d 2c ea lea (%rdx,%r13,8),%r13
401162: c4 c1 7b 10 65 00 vmovsd 0x0(%r13),%xmm4
401168: c4 c1 59 16 74 0d 00 vmovhpd 0x0(%r13,%rcx,1),%xmm4,%xmm6
40116f: c5 85 58 d2 vaddpd %ymm2,%ymm15,%ymm2
401173: 4d 8d 6c 4d 00 lea 0x0(%r13,%rcx,2),%r13
401178: c4 c1 7b 10 6d 00 vmovsd 0x0(%r13),%xmm5
40117e: c4 c1 51 16 7c 0d 00 vmovhpd 0x0(%r13,%rcx,1),%xmm5,%xmm7
401185: 47 8d 2c 11 lea (%r9,%r10,1),%r13d
matmul.c:17
401189: 45 03 d4 add %r12d,%r10d
matmul.c:18
40118c: 4d 63 ed movslq %r13d,%r13
40118f: c4 63 4d 18 c7 01 vinsertf128 $0x1,%xmm7,%ymm6,%ymm8
401195: c4 21 3d 59 4c f0 40 vmulpd 0x40(%rax,%r14,8),%ymm8,%ymm9
40119c: 4e 8d 2c ea lea (%rdx,%r13,8),%r13
4011a0: c4 41 7b 10 55 00 vmovsd 0x0(%r13),%xmm10
4011a6: c4 41 29 16 64 0d 00 vmovhpd 0x0(%r13,%rcx,1),%xmm10,%xmm12
4011ad: c5 b5 58 c9 vaddpd %ymm1,%ymm9,%ymm1
4011b1: 4d 8d 6c 4d 00 lea 0x0(%r13,%rcx,2),%r13
4011b6: c4 41 7b 10 5d 00 vmovsd 0x0(%r13),%xmm11
4011bc: c4 41 21 16 6c 0d 00 vmovhpd 0x0(%r13,%rcx,1),%xmm11,%xmm13
4011c3: c4 43 1d 18 f5 01 vinsertf128 $0x1,%xmm13,%ymm12,%ymm14
4011c9: c4 21 0d 59 7c f0 60 vmulpd 0x60(%rax,%r14,8),%ymm14,%ymm15
matmul.c:17
4011d0: 49 83 c6 10 add $0x10,%r14
matmul.c:18
4011d4: c5 85 58 c0 vaddpd %ymm0,%ymm15,%ymm0
matmul.c:17
4011d8: 45 3b fb cmp %r11d,%r15d
4011db: 0f 82 04 ff ff ff jb 4010e5 <main+0x6f5>
[/plain]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Gabriel,
I reproduced as you said ,even i add "#pragma noinline" before the compute(); ,
Also can be seen from the optimization report that
<u487046.c;14:14;hlo_linear_trans;compute;0> //13.1
LOOP INTERCHANGE in loops at line: 14 15 16 17
Loopnest permutation ( 1 2 3 4 ) --> ( 2 1 4 3 )
<u487046.c;14:14;hlo_linear_trans;main;0> //SP1 update1
LOOP INTERCHANGE in loops at line: 14 15
Loopnest permutation ( 1 2 3 4 ) --> ( 2 1 3 4 )
So this is a High Level Optimizer regression ,i will submit a bug-report for this ,and will keep you posted on news from the developers as soon as possible.
Thank you.
-- QIAOMIN.Q
Intel Developer Support
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Qiaomin, thanks for confirming and submitting this along to the right people.
Gabriel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Gabriel,
Thank you for submitting the issue.I'll get this issue posted whenever there comes out an update fix from the development team.
Thank you.
-- QIAOMINQ.
Intel Developer Support

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page