Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

loop was unrolled by 2: is it sufficient?

Marián__VooDooMan__M
New Contributor II
701 Views

Greetings,

I use MSVC and /QxHOST on Haswell (AVX-256).

I have code under MSVC that is using __m256 type for my own memcpy, and ICC generates correct result, and it is working well.

But when I look at the assembler output, is it sufficient to unroll ONLY by 2 ?! when I have:

#define PACKET_SIZE_MIN             128
#define PACKET_SIZE_AVG             512
#define PACKET_SIZE_MAX             2048

...

#if defined(__INTEL_COMPILER)
#   pragma loop count min(PACKET_SIZE_MIN) avg(PACKET_SIZE_AVG) max(PACKET_SIZE_MAX)
#endi
#   pragma unroll

and the assembler output reads:

.B1.8::                         ; Preds .B1.6 .B1.8
L4::            ; optimization report
                ; LOOP WAS UNROLLED BY 2
                ; %s was not vectorized: operation cannot be vectorized
$LN15:
  00022 48 ff c1         inc rcx                                ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5
$LN16:
  00025 c5 fe 6f 04 10   vmovdqu ymm0, YMMWORD PTR [rax+rdx]    ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.14
$LN17:
  0002a c5 fe 6f 4c 10 
        20               vmovdqu ymm1, YMMWORD PTR [32+rax+rdx] ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.14
$LN18:
  00030 c4 a1 7e 7f 04 
        08               vmovdqu YMMWORD PTR [rax+r9], ymm0     ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.9
$LN19:
  00036 c4 a1 7e 7f 4c 
        08 20            vmovdqu YMMWORD PTR [32+rax+r9], ymm1  ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.9
$LN20:
  0003d 48 83 c0 40      add rax, 64                            ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5
$LN21:
  00041 49 3b c8         cmp rcx, r8                            ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5
$LN22:
  00044 72 dc            jb .B1.8 ; Prob 63%                    ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5
$LN23:
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r12 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15

 

PS: I need to "#undef" the "min" and the "max" because of MSVC defining these symbols in the other way...

TIA, best

0 Kudos
12 Replies
Marián__VooDooMan__M
New Contributor II
701 Views

My first code code was malformed. There is for loop:

#define PACKET_SIZE_MIN             128
#define PACKET_SIZE_AVG             512
#define PACKET_SIZE_MAX             2048

...
static unsigned g_num_frames=512; // this is set *ONCE* at program start, sometimes could be 128, but often it is always 512
...

const unsigned num_frames=g_num_frames;

#if defined(__INTEL_COMPILER)
#   pragma loop count min(PACKET_SIZE_MIN) avg(PACKET_SIZE_AVG) max(PACKET_SIZE_MAX)
#endif
#if defined(__INTEL_COMPILER)
#   pragma unroll
#endif
for(unsigned i1=0; i1<num_frames; i1++) {


 

0 Kudos
Marián__VooDooMan__M
New Contributor II
701 Views

IMO it should be unrolled by 4 at least (as MIN says), or by 8 (as MAX says) with a condition...

0 Kudos
jimdempseyatthecove
Honored Contributor III
701 Views

If you look at your code you will notice "add rax, 64". Each iteration of the loop copies two halves (ymm register worth) of a cache line. IOW each iteration moves a cache line.

BTW, its

#pragma loop_count

not

#pragma loop count

Jim Dempsey

0 Kudos
Marián__VooDooMan__M
New Contributor II
701 Views

jimdempseyatthecove wrote:

If you look at your code you will notice "add rax, 64". Each iteration of the loop copies two halves (ymm register worth) of a cache line. IOW each iteration moves a cache line.

OIC.

jimdempseyatthecove wrote:

BTW, its

#pragma loop_count

not

#pragma loop count

Oh, thank you for noticing this, compiler never thrown a warning and I have this bug in MANY places (nearly 300 times), so I'm about to correct it, and see performance for corrected code.

0 Kudos
Marián__VooDooMan__M
New Contributor II
701 Views

And as a side note, maybe when I will have correct "#pragma loop_count" now, in other places, maybe compiler will unroll loops by bigger count, besides when I'm using "#pragma unroll' as well.

0 Kudos
Bernard
Valued Contributor I
701 Views

Unless your memcpy() routine will be scheduled to run on all CPU cores I think that unrolling by 2 can be sufficient, mainly because each core can perform 2 loads and 1 store per *cycle.

*Edited - correcting the answer.

0 Kudos
jimdempseyatthecove
Honored Contributor III
701 Views

I recommend for the given processor (AVX2), that unroll of 4 might be better. Reason being: each "source code" visible iteration copies 1/2 of a cache line. Unroll of 2 copies one cache line. Performing unroll of 4 may (can) provide for loads to be issued while waiting for prior load to complete.

*** caveat ***

You would like to construct the asm code to perform the 4 loads followed by 4 stores:

Load, Load, Load, Load, Store, Store, Store, Store

Generate some code using __intel_fast_memcpy (spelling may be off), then step into this with disassembly. This will show you the best juxtapositioning of the loads and stores, as well as loop unroll count.

I haven't checked on AVX2 system, but for SSE they performed 8 loads followed by 8 stores. AVX2 may need only perform half of this.... unless you are on a system with 4 memory channels, and in which case 8, 8 may be better.

Jim Dempsey

 

0 Kudos
Marián__VooDooMan__M
New Contributor II
701 Views

Intel staff: this is a bug report.

See: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-30B36136-E399-4D7A-9BF8-06D96B8536E9.htm

when I use:

...
#elif MY_USE_MM==256
#   define MY_USE_VDM_MEMCPY_TYPE   __m256i
#   define MY_STORE_SI              _mm256_store_si256
#   define MY_LOAD_SI               _mm256_load_si256
#   define MY_ZERO_SI               _mm256_setzero_si256
#endif

...

#if defined(__INTEL_COMPILER)
#   pragma vector nontemporal(d)
#   pragma vector always //assert
#   pragma vector vecremainder
#endif
#if defined(__INTEL_COMPILER) && 1 // VDM auto patch
#   pragma ivdep
#   pragma swp
#   pragma unroll(4)
#endif
    for(size_t i=0; i<cnt; ++i) {
#if defined(MY_USE_AVX_INTRIN) && MY_USE_AVX_INTRIN
        MY_STORE_SI(&d,MY_ZERO_SI());
#else
        d=MY_ZERO_SI();
#endif
    }

when both all pointers and trip count is divisible by 64 without remainder (i.e. aligned malloc with align of sizeof(__mm512)) on Haswell with MSVC and /QxHOST (i.e. AVX2), assembler output reports that loop was unrolled by 2, even when I forcibly specified to by 4.

it doesn't matter whether "MY_USE_AVX_INTRIN" is defined to 1 or not defined, it always unrolls by 2.

Or am I missing something?
 

EDIT: fixed wrong part of code snippet

0 Kudos
Marián__VooDooMan__M
New Contributor II
701 Views

ping?

0 Kudos
Marián__VooDooMan__M
New Contributor II
701 Views

I just want to experiment with unrolling by 4, without manual re-write, since I know "cnt" will always be more than 4.

0 Kudos
Marián__VooDooMan__M
New Contributor II
701 Views

FWIW I have finally managed it by using assumed "cnt" count:

#ifdef _MSC_VER
#   define MY_ASSUME(cond) __assume(cond)
#else
#   define MY_ASSUME(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)
#endif

MY_ASSUME(cnt>=...);
#pragma unroll(4)

and now it unrolls the loop by factor of 4.

0 Kudos
Marián__VooDooMan__M
New Contributor II
701 Views

of course I need to say I coded length dispatcher of my standard lengths ("cnt" <- see above) and unrolling by from 2 to 64 gives me performance boost of circa 20% compared to default Intel unrolling by 2, for all my memcpy, fused memcpy (2 arrays at once) and memset.

0 Kudos
Reply