Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
New Contributor II
31 Views

loop was unrolled by 2: is it sufficient?

Greetings,

I use MSVC and /QxHOST on Haswell (AVX-256).

I have code under MSVC that is using __m256 type for my own memcpy, and ICC generates correct result, and it is working well.

But when I look at the assembler output, is it sufficient to unroll ONLY by 2 ?! when I have:

#define PACKET_SIZE_MIN             128
#define PACKET_SIZE_AVG             512
#define PACKET_SIZE_MAX             2048

...

#if defined(__INTEL_COMPILER)
#   pragma loop count min(PACKET_SIZE_MIN) avg(PACKET_SIZE_AVG) max(PACKET_SIZE_MAX)
#endi
#   pragma unroll

and the assembler output reads:

.B1.8::                         ; Preds .B1.6 .B1.8
L4::            ; optimization report
                ; LOOP WAS UNROLLED BY 2
                ; %s was not vectorized: operation cannot be vectorized
$LN15:
  00022 48 ff c1         inc rcx                                ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5
$LN16:
  00025 c5 fe 6f 04 10   vmovdqu ymm0, YMMWORD PTR [rax+rdx]    ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.14
$LN17:
  0002a c5 fe 6f 4c 10 
        20               vmovdqu ymm1, YMMWORD PTR [32+rax+rdx] ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.14
$LN18:
  00030 c4 a1 7e 7f 04 
        08               vmovdqu YMMWORD PTR [rax+r9], ymm0     ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.9
$LN19:
  00036 c4 a1 7e 7f 4c 
        08 20            vmovdqu YMMWORD PTR [32+rax+r9], ymm1  ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.9
$LN20:
  0003d 48 83 c0 40      add rax, 64                            ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5
$LN21:
  00041 49 3b c8         cmp rcx, r8                            ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5
$LN22:
  00044 72 dc            jb .B1.8 ; Prob 63%                    ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5
$LN23:
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r12 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15

 

PS: I need to "#undef" the "min" and the "max" because of MSVC defining these symbols in the other way...

TIA, best

0 Kudos
12 Replies
Highlighted
New Contributor II
31 Views

My first code code was malformed. There is for loop:

#define PACKET_SIZE_MIN             128
#define PACKET_SIZE_AVG             512
#define PACKET_SIZE_MAX             2048

...
static unsigned g_num_frames=512; // this is set *ONCE* at program start, sometimes could be 128, but often it is always 512
...

const unsigned num_frames=g_num_frames;

#if defined(__INTEL_COMPILER)
#   pragma loop count min(PACKET_SIZE_MIN) avg(PACKET_SIZE_AVG) max(PACKET_SIZE_MAX)
#endif
#if defined(__INTEL_COMPILER)
#   pragma unroll
#endif
for(unsigned i1=0; i1<num_frames; i1++) {


 

0 Kudos
Highlighted
New Contributor II
31 Views

IMO it should be unrolled by 4 at least (as MIN says), or by 8 (as MAX says) with a condition...

0 Kudos
Highlighted
31 Views

If you look at your code you will notice "add rax, 64". Each iteration of the loop copies two halves (ymm register worth) of a cache line. IOW each iteration moves a cache line.

BTW, its

#pragma loop_count

not

#pragma loop count

Jim Dempsey

0 Kudos
Highlighted
New Contributor II
31 Views

jimdempseyatthecove wrote:

If you look at your code you will notice "add rax, 64". Each iteration of the loop copies two halves (ymm register worth) of a cache line. IOW each iteration moves a cache line.

OIC.

jimdempseyatthecove wrote:

BTW, its

#pragma loop_count

not

#pragma loop count

Oh, thank you for noticing this, compiler never thrown a warning and I have this bug in MANY places (nearly 300 times), so I'm about to correct it, and see performance for corrected code.

0 Kudos
Highlighted
New Contributor II
31 Views

And as a side note, maybe when I will have correct "#pragma loop_count" now, in other places, maybe compiler will unroll loops by bigger count, besides when I'm using "#pragma unroll' as well.

0 Kudos
Highlighted
Black Belt
31 Views

Unless your memcpy() routine will be scheduled to run on all CPU cores I think that unrolling by 2 can be sufficient, mainly because each core can perform 2 loads and 1 store per *cycle.

*Edited - correcting the answer.

0 Kudos
Highlighted
31 Views

I recommend for the given processor (AVX2), that unroll of 4 might be better. Reason being: each "source code" visible iteration copies 1/2 of a cache line. Unroll of 2 copies one cache line. Performing unroll of 4 may (can) provide for loads to be issued while waiting for prior load to complete.

*** caveat ***

You would like to construct the asm code to perform the 4 loads followed by 4 stores:

Load, Load, Load, Load, Store, Store, Store, Store

Generate some code using __intel_fast_memcpy (spelling may be off), then step into this with disassembly. This will show you the best juxtapositioning of the loads and stores, as well as loop unroll count.

I haven't checked on AVX2 system, but for SSE they performed 8 loads followed by 8 stores. AVX2 may need only perform half of this.... unless you are on a system with 4 memory channels, and in which case 8, 8 may be better.

Jim Dempsey

 

0 Kudos
Highlighted
New Contributor II
31 Views

Intel staff: this is a bug report.

See: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-30B361...

when I use:

...
#elif MY_USE_MM==256
#   define MY_USE_VDM_MEMCPY_TYPE   __m256i
#   define MY_STORE_SI              _mm256_store_si256
#   define MY_LOAD_SI               _mm256_load_si256
#   define MY_ZERO_SI               _mm256_setzero_si256
#endif

...

#if defined(__INTEL_COMPILER)
#   pragma vector nontemporal(d)
#   pragma vector always //assert
#   pragma vector vecremainder
#endif
#if defined(__INTEL_COMPILER) && 1 // VDM auto patch
#   pragma ivdep
#   pragma swp
#   pragma unroll(4)
#endif
    for(size_t i=0; i<cnt; ++i) {
#if defined(MY_USE_AVX_INTRIN) && MY_USE_AVX_INTRIN
        MY_STORE_SI(&d,MY_ZERO_SI());
#else
        d=MY_ZERO_SI();
#endif
    }

when both all pointers and trip count is divisible by 64 without remainder (i.e. aligned malloc with align of sizeof(__mm512)) on Haswell with MSVC and /QxHOST (i.e. AVX2), assembler output reports that loop was unrolled by 2, even when I forcibly specified to by 4.

it doesn't matter whether "MY_USE_AVX_INTRIN" is defined to 1 or not defined, it always unrolls by 2.

Or am I missing something?
 

EDIT: fixed wrong part of code snippet

0 Kudos
Highlighted
New Contributor II
31 Views

ping?

0 Kudos
Highlighted
New Contributor II
31 Views

I just want to experiment with unrolling by 4, without manual re-write, since I know "cnt" will always be more than 4.

0 Kudos
Highlighted
New Contributor II
31 Views

FWIW I have finally managed it by using assumed "cnt" count:

#ifdef _MSC_VER
#   define MY_ASSUME(cond) __assume(cond)
#else
#   define MY_ASSUME(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)
#endif

MY_ASSUME(cnt>=...);
#pragma unroll(4)

and now it unrolls the loop by factor of 4.

0 Kudos
Highlighted
New Contributor II
31 Views

of course I need to say I coded length dispatcher of my standard lengths ("cnt" <- see above) and unrolling by from 2 to 64 gives me performance boost of circa 20% compared to default Intel unrolling by 2, for all my memcpy, fused memcpy (2 arrays at once) and memset.

0 Kudos