Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Performance regression with Intel C 14.0

Gabriel_M_3
Beginner
577 Views

I have a very simple code that I use as a test input for a binary analysis tool. The code performs a naive matrix multiplication. I am compiling this code on an Intel Xeon E5-2690 machine using option -O3 -g -xHost.

The Intel compiler 13.x with the above options would perform loop interchange, moving the reccurence from the innermost loop to an outer loop. Version 14.0.0 of the compiler does not perform this transformation. Both versions unroll one of the loops 16 times, filling up 4 AVX vectors. However, version 14.0 also generates many more address arithmetic instructions in the innermost loop. The end result is that the code produced by version 14.0 takes 50% to 3x longer to execute for matrix sizes >= 40.

There is no value in this naive matrix multiply code, but I am trying to understand what changed with the new compiler that it failed to interchange the loops, and I wonder if this change can possibly affect other codes whose performance is actually relevant. Are there any command line flags that would enable the previous behavior?

Thanks, Gabriel

PS: I am not allowed to attach files. Pasting the code in the message body, or including a link to pastebin triggers the spam filter. What's the appropriate way to include sample code?

0 Kudos
6 Replies
Gabriel_M_3
Beginner
577 Views

This is about the Linux C compiler. And it looks like I can attach files now. Source code is attached.

0 Kudos
QIAOMIN_Q_
New Contributor I
577 Views

i'm trying to reproduce ,it would be good if you can also paste the pictures or assembly codes to demonstrate the generated differences between both versions,which would enable more enthusiasts to help you.

0 Kudos
Gabriel_M_3
Beginner
577 Views

I am comparing the performances of two binaries compiled with

  1. icc (ICC) 13.1.3 20130607
  2. icc (ICC) 14.0.0 20130728

I am attaching the full objdump of the two binaries. The 'compute' routine has been inlined into 'main'.

The four level loop in the compute routine maps to the following object code loop structure. All addresses are in hex. The end address is the address of the instruction immediately after the loop branch.

Binary produced with icc 13.1.3, flags -O3 -g -xHost:

  • L,Lev1: 400e29 : 401050
  • L,Lev2: 400e84 : 40102a
  • L,Lev3: 400e9c : 401013
  • L,Lev4: 400ef2 : 400f0c
  • L,Lev4: 400f1e : 400fae (X)
  • L,Lev4: 400fdc : 400ff6

Binary produced by icc 14.0.0, flags -O3 -g -xHost:

  • L,Lev1: 400f9f : 401295
  • L,Lev2: 400ffd : 401272
  • L,Lev3: 40100a : 40125a
  • L,Lev4: 40106b : 401088
  • L,Lev4: 4010e5 : 4011e1 (X)
  • L,Lev4: 40122c : 401249

The loops marked with (X) are the unrolled loops. The other level 4 loops handle the odd iterations. I am inlining in this post just the body of the unrolled level 4 loops. Full disassembly is attached.

[plain]

Binary mm_i13:

  400f1e:       c4 c1 79 10 0c d3       vmovupd (%r11,%rdx,8),%xmm1
  400f24:       c4 c1 79 10 6c d3 20    vmovupd 0x20(%r11,%rdx,8),%xmm5
  400f2b:       c4 41 79 10 4c d3 40    vmovupd 0x40(%r11,%rdx,8),%xmm9
  400f32:       c4 41 79 10 6c d3 60    vmovupd 0x60(%r11,%rdx,8),%xmm13
  400f39:       c4 c3 55 18 74 d3 30    vinsertf128 $0x1,0x30(%r11,%rdx,8),%ymm5,%ymm6
  400f40:       01
  400f41:       c4 c3 75 18 54 d3 10    vinsertf128 $0x1,0x10(%r11,%rdx,8),%ymm1,%ymm2
  400f48:       01
  400f49:       c4 43 35 18 54 d3 50    vinsertf128 $0x1,0x50(%r11,%rdx,8),%ymm9,%ymm10
  400f50:       01
  400f51:       c4 43 15 18 74 d3 70    vinsertf128 $0x1,0x70(%r11,%rdx,8),%ymm13,%ymm14
  400f58:       01
  400f59:       c5 fd 59 da             vmulpd %ymm2,%ymm0,%ymm3  
  400f5d:       c5 fd 59 fe             vmulpd %ymm6,%ymm0,%ymm7
  400f61:       c4 41 7d 59 da          vmulpd %ymm10,%ymm0,%ymm11
  400f66:       c4 41 7d 59 fe          vmulpd %ymm14,%ymm0,%ymm15
  400f6b:       c4 c1 65 58 24 d7       vaddpd (%r15,%rdx,8),%ymm3,%ymm4
  400f71:       c4 41 45 58 44 d7 20    vaddpd 0x20(%r15,%rdx,8),%ymm7,%ymm8
  400f78:       c4 41 25 58 64 d7 40    vaddpd 0x40(%r15,%rdx,8),%ymm11,%ymm12
  400f7f:       c4 c1 05 58 4c d7 60    vaddpd 0x60(%r15,%rdx,8),%ymm15,%ymm1
  400f86:       c4 c1 7d 11 24 d7       vmovupd %ymm4,(%r15,%rdx,8)
  400f8c:       c4 41 7d 11 44 d7 20    vmovupd %ymm8,0x20(%r15,%rdx,8)
  400f93:       c4 41 7d 11 64 d7 40    vmovupd %ymm12,0x40(%r15,%rdx,8)
  400f9a:       c4 c1 7d 11 4c d7 60    vmovupd %ymm1,0x60(%r15,%rdx,8)
matmul.c:16
  400fa1:       48 83 c2 10             add    $0x10,%rdx
  400fa5:       49 3b d4                cmp    %r12,%rdx
  400fa8:       0f 82 70 ff ff ff       jb     400f1e <main+0x57e>   

[/plain]

[plain]

Binary mm_i14:

  4010e5:       4d 63 d2                movslq %r10d,%r10
matmul.c:17
  4010e8:       41 83 c7 10             add    $0x10,%r15d
matmul.c:18
  4010ec:       4e 8d 2c d3             lea    (%rbx,%r10,8),%r13
  4010f0:       c4 c1 7b 10 65 00       vmovsd 0x0(%r13),%xmm4
  4010f6:       c4 c1 59 16 74 0d 00    vmovhpd 0x0(%r13,%rcx,1),%xmm4,%xmm6
  4010fd:       4d 8d 6c 4d 00          lea    0x0(%r13,%rcx,2),%r13
  401102:       c4 c1 7b 10 6d 00       vmovsd 0x0(%r13),%xmm5
  401108:       c4 c1 51 16 7c 0d 00    vmovhpd 0x0(%r13,%rcx,1),%xmm5,%xmm7
  40110f:       46 8d 2c 17             lea    (%rdi,%r10,1),%r13d
  401113:       4d 63 ed                movslq %r13d,%r13
  401116:       c4 63 4d 18 c7 01       vinsertf128 $0x1,%xmm7,%ymm6,%ymm8
  40111c:       c4 21 3d 59 0c f0       vmulpd (%rax,%r14,8),%ymm8,%ymm9
  401122:       4e 8d 2c ea             lea    (%rdx,%r13,8),%r13
  401126:       c4 41 7b 10 55 00       vmovsd 0x0(%r13),%xmm10
  40112c:       c4 41 29 16 64 0d 00    vmovhpd 0x0(%r13,%rcx,1),%xmm10,%xmm12
  401133:       c4 c1 65 58 d9          vaddpd %ymm9,%ymm3,%ymm3  
  401138:       4d 8d 6c 4d 00          lea    0x0(%r13,%rcx,2),%r13
  40113d:       c4 41 7b 10 5d 00       vmovsd 0x0(%r13),%xmm11
  401143:       c4 41 21 16 6c 0d 00    vmovhpd 0x0(%r13,%rcx,1),%xmm11,%xmm13
  40114a:       47 8d 2c 10             lea    (%r8,%r10,1),%r13d
  40114e:       4d 63 ed                movslq %r13d,%r13
  401151:       c4 43 1d 18 f5 01       vinsertf128 $0x1,%xmm13,%ymm12,%ymm14
  401157:       c4 21 0d 59 7c f0 20    vmulpd 0x20(%rax,%r14,8),%ymm14,%ymm15
  40115e:       4e 8d 2c ea             lea    (%rdx,%r13,8),%r13
  401162:       c4 c1 7b 10 65 00       vmovsd 0x0(%r13),%xmm4
  401168:       c4 c1 59 16 74 0d 00    vmovhpd 0x0(%r13,%rcx,1),%xmm4,%xmm6
  40116f:       c5 85 58 d2             vaddpd %ymm2,%ymm15,%ymm2
  401173:       4d 8d 6c 4d 00          lea    0x0(%r13,%rcx,2),%r13
  401178:       c4 c1 7b 10 6d 00       vmovsd 0x0(%r13),%xmm5
  40117e:       c4 c1 51 16 7c 0d 00    vmovhpd 0x0(%r13,%rcx,1),%xmm5,%xmm7
  401185:       47 8d 2c 11             lea    (%r9,%r10,1),%r13d
matmul.c:17
  401189:       45 03 d4                add    %r12d,%r10d
matmul.c:18
  40118c:       4d 63 ed                movslq %r13d,%r13
  40118f:       c4 63 4d 18 c7 01       vinsertf128 $0x1,%xmm7,%ymm6,%ymm8
  401195:       c4 21 3d 59 4c f0 40    vmulpd 0x40(%rax,%r14,8),%ymm8,%ymm9
  40119c:       4e 8d 2c ea             lea    (%rdx,%r13,8),%r13
  4011a0:       c4 41 7b 10 55 00       vmovsd 0x0(%r13),%xmm10
  4011a6:       c4 41 29 16 64 0d 00    vmovhpd 0x0(%r13,%rcx,1),%xmm10,%xmm12
  4011ad:       c5 b5 58 c9             vaddpd %ymm1,%ymm9,%ymm1
  4011b1:       4d 8d 6c 4d 00          lea    0x0(%r13,%rcx,2),%r13
  4011b6:       c4 41 7b 10 5d 00       vmovsd 0x0(%r13),%xmm11
  4011bc:       c4 41 21 16 6c 0d 00    vmovhpd 0x0(%r13,%rcx,1),%xmm11,%xmm13
  4011c3:       c4 43 1d 18 f5 01       vinsertf128 $0x1,%xmm13,%ymm12,%ymm14
  4011c9:       c4 21 0d 59 7c f0 60    vmulpd 0x60(%rax,%r14,8),%ymm14,%ymm15
matmul.c:17
  4011d0:       49 83 c6 10             add    $0x10,%r14
matmul.c:18
  4011d4:       c5 85 58 c0             vaddpd %ymm0,%ymm15,%ymm0
matmul.c:17
  4011d8:       45 3b fb                cmp    %r11d,%r15d
  4011db:       0f 82 04 ff ff ff       jb     4010e5 <main+0x6f5>

[/plain]

0 Kudos
QIAOMIN_Q_
New Contributor I
577 Views

Hello Gabriel,

I reproduced as you said ,even i add "#pragma noinline" before the compute(); ,

Also can be seen from the optimization report that

<u487046.c;14:14;hlo_linear_trans;compute;0>     //13.1

LOOP INTERCHANGE in loops at line: 14 15 16 17

Loopnest permutation ( 1 2 3 4 ) --> ( 2 1 4 3 )

<u487046.c;14:14;hlo_linear_trans;main;0>     //SP1 update1

LOOP INTERCHANGE in loops at line: 14 15

Loopnest permutation ( 1 2 3 4 ) --> ( 2 1 3 4 )

So this is a High Level Optimizer regression ,i will submit a bug-report for this ,and will keep you posted on news from the developers as soon as possible.

 

Thank you.

-- QIAOMIN.Q

Intel Developer Support

0 Kudos
Gabriel_M_3
Beginner
577 Views

Qiaomin, thanks for confirming and submitting this along to the right people.

Gabriel

0 Kudos
QIAOMIN_Q_
New Contributor I
577 Views

Hello Gabriel,

Thank you for submitting the issue.I'll get this issue posted whenever there comes out an update fix from the development team.

 

Thank you.

-- QIAOMINQ.

Intel Developer Support

0 Kudos
Reply