- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Greetings,
I use MSVC and /QxHOST on Haswell (AVX-256).
I have code under MSVC that is using __m256 type for my own memcpy, and ICC generates correct result, and it is working well.
But when I look at the assembler output, is it sufficient to unroll ONLY by 2 ?! when I have:
#define PACKET_SIZE_MIN 128 #define PACKET_SIZE_AVG 512 #define PACKET_SIZE_MAX 2048 ... #if defined(__INTEL_COMPILER) # pragma loop count min(PACKET_SIZE_MIN) avg(PACKET_SIZE_AVG) max(PACKET_SIZE_MAX) #endi # pragma unroll
and the assembler output reads:
.B1.8:: ; Preds .B1.6 .B1.8 L4:: ; optimization report ; LOOP WAS UNROLLED BY 2 ; %s was not vectorized: operation cannot be vectorized $LN15: 00022 48 ff c1 inc rcx ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5 $LN16: 00025 c5 fe 6f 04 10 vmovdqu ymm0, YMMWORD PTR [rax+rdx] ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.14 $LN17: 0002a c5 fe 6f 4c 10 20 vmovdqu ymm1, YMMWORD PTR [32+rax+rdx] ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.14 $LN18: 00030 c4 a1 7e 7f 04 08 vmovdqu YMMWORD PTR [rax+r9], ymm0 ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.9 $LN19: 00036 c4 a1 7e 7f 4c 08 20 vmovdqu YMMWORD PTR [32+rax+r9], ymm1 ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:149.9 $LN20: 0003d 48 83 c0 40 add rax, 64 ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5 $LN21: 00041 49 3b c8 cmp rcx, r8 ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5 $LN22: 00044 72 dc jb .B1.8 ; Prob 63% ;c:\Users\vdmn1.vdmn\Documents\develop\Recorder7.1\Recorder7_Processor\src\wav/my_frame.h:145.5 $LN23: ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r12 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
PS: I need to "#undef" the "min" and the "max" because of MSVC defining these symbols in the other way...
TIA, best
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My first code code was malformed. There is for loop:
#define PACKET_SIZE_MIN 128 #define PACKET_SIZE_AVG 512 #define PACKET_SIZE_MAX 2048 ... static unsigned g_num_frames=512; // this is set *ONCE* at program start, sometimes could be 128, but often it is always 512 ... const unsigned num_frames=g_num_frames; #if defined(__INTEL_COMPILER) # pragma loop count min(PACKET_SIZE_MIN) avg(PACKET_SIZE_AVG) max(PACKET_SIZE_MAX) #endif #if defined(__INTEL_COMPILER) # pragma unroll #endif for(unsigned i1=0; i1<num_frames; i1++) {
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IMO it should be unrolled by 4 at least (as MIN says), or by 8 (as MAX says) with a condition...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you look at your code you will notice "add rax, 64". Each iteration of the loop copies two halves (ymm register worth) of a cache line. IOW each iteration moves a cache line.
BTW, its
#pragma loop_count
not
#pragma loop count
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
If you look at your code you will notice "add rax, 64". Each iteration of the loop copies two halves (ymm register worth) of a cache line. IOW each iteration moves a cache line.
OIC.
jimdempseyatthecove wrote:
BTW, its
#pragma loop_count
not
#pragma loop count
Oh, thank you for noticing this, compiler never thrown a warning and I have this bug in MANY places (nearly 300 times), so I'm about to correct it, and see performance for corrected code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And as a side note, maybe when I will have correct "#pragma loop_count" now, in other places, maybe compiler will unroll loops by bigger count, besides when I'm using "#pragma unroll' as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unless your memcpy() routine will be scheduled to run on all CPU cores I think that unrolling by 2 can be sufficient, mainly because each core can perform 2 loads and 1 store per *cycle.
*Edited - correcting the answer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I recommend for the given processor (AVX2), that unroll of 4 might be better. Reason being: each "source code" visible iteration copies 1/2 of a cache line. Unroll of 2 copies one cache line. Performing unroll of 4 may (can) provide for loads to be issued while waiting for prior load to complete.
*** caveat ***
You would like to construct the asm code to perform the 4 loads followed by 4 stores:
Load, Load, Load, Load, Store, Store, Store, Store
Generate some code using __intel_fast_memcpy (spelling may be off), then step into this with disassembly. This will show you the best juxtapositioning of the loads and stores, as well as loop unroll count.
I haven't checked on AVX2 system, but for SSE they performed 8 loads followed by 8 stores. AVX2 may need only perform half of this.... unless you are on a system with 4 memory channels, and in which case 8, 8 may be better.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel staff: this is a bug report.
when I use:
... #elif MY_USE_MM==256 # define MY_USE_VDM_MEMCPY_TYPE __m256i # define MY_STORE_SI _mm256_store_si256 # define MY_LOAD_SI _mm256_load_si256 # define MY_ZERO_SI _mm256_setzero_si256 #endif ... #if defined(__INTEL_COMPILER) # pragma vector nontemporal(d) # pragma vector always //assert # pragma vector vecremainder #endif #if defined(__INTEL_COMPILER) && 1 // VDM auto patch # pragma ivdep # pragma swp # pragma unroll(4) #endif for(size_t i=0; i<cnt; ++i) { #if defined(MY_USE_AVX_INTRIN) && MY_USE_AVX_INTRIN MY_STORE_SI(&d,MY_ZERO_SI()); #else d=MY_ZERO_SI(); #endif }
when both all pointers and trip count is divisible by 64 without remainder (i.e. aligned malloc with align of sizeof(__mm512)) on Haswell with MSVC and /QxHOST (i.e. AVX2), assembler output reports that loop was unrolled by 2, even when I forcibly specified to by 4.
it doesn't matter whether "MY_USE_AVX_INTRIN" is defined to 1 or not defined, it always unrolls by 2.
Or am I missing something?
EDIT: fixed wrong part of code snippet
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ping?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just want to experiment with unrolling by 4, without manual re-write, since I know "cnt" will always be more than 4.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FWIW I have finally managed it by using assumed "cnt" count:
#ifdef _MSC_VER # define MY_ASSUME(cond) __assume(cond) #else # define MY_ASSUME(cond) do { if (!(cond)) __builtin_unreachable(); } while (0) #endif MY_ASSUME(cnt>=...); #pragma unroll(4)
and now it unrolls the loop by factor of 4.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
of course I need to say I coded length dispatcher of my standard lengths ("cnt" <- see above) and unrolling by from 2 to 64 gives me performance boost of circa 20% compared to default Intel unrolling by 2, for all my memcpy, fused memcpy (2 arrays at once) and memset.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page