Solved: Loop Versioning in Icc

zakaria-bendifallah · ‎07-24-2011

Hi all,
I have a question about the loop versioning in the Intel compiler.
Well, i compiled the SPEC 2006 with ICC -g3 and for one of the
codes (the bwaves code to be more specific) there where a
loop versioning and the generated code was :

=================================================================
404802: 48 f7 c2 0f 00 00 00 test $0xf,%rdx
404809: 0f 84 71 00 00 00 je 404880
40480f: 90 nop
404810: f2 42 0f 10 1c c3 movsd (%rbx,%r8,8),%xmm3
404816: f2 42 0f 10 64 c3 10 movsd 0x10(%rbx,%r8,8),%xmm4
40481d: 66 42 0f 16 5c c3 08 movhpd 0x8(%rbx,%r8,8),%xmm3
404824: 66 42 0f 16 64 c3 18 movhpd 0x18(%rbx,%r8,8),%xmm4
40482b: 66 43 0f 59 1c c1 mulpd (%r9,%r8,8),%xmm3
404831: 66 43 0f 59 64 c1 10 mulpd 0x10(%r9,%r8,8),%xmm4
404838: 66 0f 58 d3 addpd %xmm3,%xmm2
40483c: 66 0f 58 cc addpd %xmm4,%xmm1
404840: f2 42 0f 10 6c c3 20 movsd 0x20(%rbx,%r8,8),%xmm5
404847: f2 42 0f 10 74 c3 30 movsd 0x30(%rbx,%r8,8),%xmm6
40484e: 66 42 0f 16 6c c3 28 movhpd 0x28(%rbx,%r8,8),%xmm5
404855: 66 42 0f 16 74 c3 38 movhpd 0x38(%rbx,%r8,8),%xmm6
40485c: 66 43 0f 59 6c c1 20 mulpd 0x20(%r9,%r8,8),%xmm5
404863: 66 43 0f 59 74 c1 30 mulpd 0x30(%r9,%r8,8),%xmm6
40486a: 66 0f 58 d5 addpd %xmm5,%xmm2
40486e: 66 0f 58 ce addpd %xmm6,%xmm1
404872: 49 83 c0 08 add $0x8,%r8
404876: 4c 3b c1 cmp %rcx,%r8
404879: 72 95 jb 404810
40487b: eb 4e jmp 4048cb
40487d: 48 89 f6 mov %rsi,%rsi
404880: 42 0f 28 1c c3 movaps (%rbx,%r8,8),%xmm3
404885: 42 0f 28 64 c3 10 movaps 0x10(%rbx,%r8,8),%xmm4
40488b: 66 43 0f 59 1c c1 mulpd (%r9,%r8,8),%xmm3
404891: 66 43 0f 59 64 c1 10 mulpd 0x10(%r9,%r8,8),%xmm4
404898: 66 0f 58 d3 addpd %xmm3,%xmm2
40489c: 66 0f 58 cc addpd %xmm4,%xmm1
4048a0: 42 0f 28 6c c3 20 movaps 0x20(%rbx,%r8,8),%xmm5
4048a6: 42 0f 28 74 c3 30 movaps 0x30(%rbx,%r8,8),%xmm6
4048ac: 66 43 0f 59 6c c1 20 mulpd 0x20(%r9,%r8,8),%xmm5
4048b3: 66 43 0f 59 74 c1 30 mulpd 0x30(%r9,%r8,8),%xmm6
4048ba: 66 0f 58 d5 addpd %xmm5,%xmm2
4048be: 66 0f 58 ce addpd %xmm6,%xmm1
4048c2: 49 83 c0 08 add $0x8,%r8
4048c6: 4c 3b c1 cmp %rcx,%r8
4048c9: 72 b5 jb 404880
4048cb: 49 3b cc cmp %r12,%rcx
4048ce: 0f 83 64 14 00 00 jae 405d38
4048d4: f2 0f 10 1c cb movsd (%rbx,%rcx,8),%xmm3
=========================================================================
the first version starts at @ 40480f
the second at @ 404880

Two strange lines were there the :
40480f a nop
40487d a mov %rsi,%rsi

And the jump to the second version of the loop is done to the
@ 404880 and not the mov instr at @ 40487d !.
So, i would like to know why these two instructions were generated
and why the second one is a mov and not a nop ?

thanks in advence for your answers :)

jimdempseyatthecove · ‎07-24-2011

40480f: 90 nop

*** note address of top of loop is now aligned to multiple of 8 (and 16)bytes
404810: f2 42 0f 10 1c c3 movsd (%rbx,%r8,8),%xmm3
...
404879: 72 95 jb 404810
---------------------------------------------
40487d: 48 89 f6 mov %rsi,%rsi
*** above is effectively a 3-byte nop (faster than 3 nop's)
*** and note address of top of following loop is now aligned to multiple of 8 (and 16, ...)bytes
404880: 42 0f 28 1c c3 movaps (%rbx,%r8,8),%xmm3
...
4048c9: 72 b5 jb 404880
--------------------------------------------

Depending on processor model, top-of-loop alignment to multiple of word size (8-bytes in this case) tends to improve performance. The overhead of the nop (or equivilent 3-byte mov %rsi,%rsi) is recouped by the 2nd iteration of the loop.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎07-24-2011

40480f: 90 nop

*** note address of top of loop is now aligned to multiple of 8 (and 16)bytes
404810: f2 42 0f 10 1c c3 movsd (%rbx,%r8,8),%xmm3
...
404879: 72 95 jb 404810
---------------------------------------------
40487d: 48 89 f6 mov %rsi,%rsi
*** above is effectively a 3-byte nop (faster than 3 nop's)
*** and note address of top of following loop is now aligned to multiple of 8 (and 16, ...)bytes
404880: 42 0f 28 1c c3 movaps (%rbx,%r8,8),%xmm3
...
4048c9: 72 b5 jb 404880
--------------------------------------------

Depending on processor model, top-of-loop alignment to multiple of word size (8-bytes in this case) tends to improve performance. The overhead of the nop (or equivilent 3-byte mov %rsi,%rsi) is recouped by the 2nd iteration of the loop.

Jim Dempsey

zakaria-bendifallah · ‎07-25-2011

Okey, loop alignment, this is the information i missed.

Well, thanks a lot.

regards,

Zakaria Bendifallah