_mm_xxx intrinsics and optimization

jimdempseyatthecove · ‎08-03-2010

The other day I was experimenting with SSE intrinsics and came across an issue that warrants mentioning. I have not submitted an issue with Premier Support as it has little impact on me, but this may be of interest to the other readers (including Intel) of this forum.

One of the techniques used in hand optimizing of applications is to manipulate the scheduling of memory access, reads as well as writes. In attempting to manage the scheduling of reads and writes, the programmer will re-order instruction sequences. In this case it is by re-ordering the _mm_xxx intrinsic instruction sequences in the source code.

The problem is,with compiler optimizations enabled, the sequence of the _mm_xxx intrinsics are re-arranged from that in the source code. In situations where the programmer is not attempting to schedule memory references, the re-arrangement of code is generally a good thing. But in the cases where the programmer is attempting to schedule the sequences ofmemory references the optimization code interferes with the programmer's declared sequence.

The correction for this behavior is NOT to disable optimizations (e.g. with #pragma...). The reason being that although disabling optimization fixes the instruction sequencing, it also disables the optimization of the integer registerization of indexes used in address calculations.

I think the proper way to handle this is to have an option and #pragma that specifies that you wish to maintain the code sequence of the _mm_xxx intrinsic while optimizing anything else.

Jim Dempsey

Om_S_Intel · ‎08-04-2010

What kind of performance degradation do you see here. It will be nice if you can share the test case.

Om

jimdempseyatthecove · ‎08-04-2010

Om,

I cannot measure the performance degratation because I cannot compile the code with optimizations (for integer/address optimizations) without having the compiler re-order my memory references. While I could produce an ASM source file, then fix the memory ordering, this becomes problematic (error prone).
See *** in ASM listing.

With optimizations enabled, the compiler is thwarting my efforts of pre-buffering memory fetches (there are sufficient xmm registers to buffer this way). My intentions were that once this works with prefetch to xmm registers, that I can hand tweek the optimizations by moving the _mm_mul_pd closer to the loads (IOW to interleave the memory fetching with the multiplication).

[bash]	for(; i < fourWayLoopEnd; i += 4)
	{
		// do all of v1 buffers first followed by v2 buffers
		// in the event that v1 and v2 align at cache false sharing
		// location we will at least attain L1 hit on 2nd, 3rd and 4th
		// doublet for each of v1 and v2.
		_v1_b0 = _v1;
		_v2_b0 = _v2;
		_v1_b1 = _v1[i+1];
		_v2_b1 = _v2[i+1];
		_v1_b2 = _v1[i+2];
		_v2_b2 = _v2[i+2];
		_v1_b3 = _v1[i+3];
		_v2_b3 = _v2[i+3];
		_v1_b0 = _mm_mul_pd(_v1_b0, _v2_b0);
		_v1_b1 = _mm_mul_pd(_v1_b1, _v2_b1);
		_v1_b2 = _mm_mul_pd(_v1_b2, _v2_b2);
		_v1_b3 = _mm_mul_pd(_v1_b3, _v2_b3);
		_temp_0 = _mm_add_pd(_temp_0, _v1_b0);
		_temp_1 = _mm_add_pd(_temp_1, _v1_b1);
		_temp_2 = _mm_add_pd(_temp_2, _v1_b2);
		_temp_3 = _mm_add_pd(_temp_3, _v1_b3);
	}
================================================
;;; 	for(; i < fourWayLoopEnd; i += 4)

        jle       .B66.5        ; Prob 10%                      ;170.2
$LN13831:
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 r15 xmm0 xmm1 xmm2 xmm3 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B66.2::                        ; Preds .B66.1
        movaps    XMMWORD PTR [48+rsp], xmm6                    ;
        movaps    XMMWORD PTR [32+rsp], xmm7                    ;
$LN13832:
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 r15 xmm0 xmm1 xmm2 xmm3 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B66.3::                        ; Preds .B66.3 .B66.2
$LN13833:

;;; 	{
;;; 		// do all of v1 buffers first followed by v2 buffers
;;; 		// in the event that v1 and v2 align at cache false sharing
;;; 		// location we will at least attain L1 hit on 2nd, 3rd and 4th
;;; 		// doublet for each of v1 and v2.
;;; 		_v1_b0 = _v1;

        movaps    xmm4, XMMWORD PTR [r10+rcx]                   ;176.3
$LN13834:
        add       rax, 4                                        ;170.28
$LN13835:

;;; 		_v2_b0 = _v2;
*** _v2 not copied to xmm register for_v2_b0  
;;; 		_v1_b1 = _v1[i+1];

        movaps    xmm5, XMMWORD PTR [16+r10+rcx]                ;178.3
$LN13836:

;;; 		_v2_b1 = _v2[i+1];
*** _v2[i+1] not copied to xmm register for_v2_b1  
;;; 		_v1_b2 = _v1[i+2];

        movaps    xmm6, XMMWORD PTR [32+r10+rcx]                ;180.3
$LN13837:

;;; 		_v2_b2 = _v2[i+2];
*** _v2[i+2] not copied to xmm register for_v2_b2  
;;; 		_v1_b3 = _v1[i+3];

        movaps    xmm7, XMMWORD PTR [48+r10+rcx]                ;182.3
$LN13838:

;;; 		_v2_b3 = _v2[i+3];
*** _v2[i+3] not copied to xmm register for_v2_b3  
;;; 		_v1_b0 = _mm_mul_pd(_v1_b0, _v2_b0);

        mulpd     xmm4, XMMWORD PTR [r10+rdx]                   ;184.3
*** Using XMMWORD PTR [r10+rdx] (_v2) instead of intended buffered _v2_b0 
$LN13839:

;;; 		_v1_b1 = _mm_mul_pd(_v1_b1, _v2_b1);

        mulpd     xmm5, XMMWORD PTR [16+r10+rdx]                ;185.3
*** Using XMMWORD PTR [16+r10+rdx] (_v2[i+1]) instead of intended buffered _v2_b1 
$LN13840:

;;; 		_v1_b2 = _mm_mul_pd(_v1_b2, _v2_b2);

        mulpd     xmm6, XMMWORD PTR [32+r10+rdx]                ;186.3
*** Using XMMWORD PTR [32+r10+rdx] (_v2[i+2]) instead of intended buffered _v2_b2 
$LN13841:

;;; 		_v1_b3 = _mm_mul_pd(_v1_b3, _v2_b3);

        mulpd     xmm7, XMMWORD PTR [48+r10+rdx]                ;187.3
*** Using XMMWORD PTR [47+r10+rdx] (_v2[i+3]) instead of intended buffered _v2_b3 
$LN13842:

;;; 		_temp_0 = _mm_add_pd(_temp_0, _v1_b0);

        addpd     xmm0, xmm4                                    ;188.3
$LN13843:

;;; 		_temp_1 = _mm_add_pd(_temp_1, _v1_b1);

        addpd     xmm2, xmm5                                    ;189.3
$LN13844:

;;; 		_temp_2 = _mm_add_pd(_temp_2, _v1_b2);

        addpd     xmm3, xmm6                                    ;190.3
$LN13845:

;;; 		_temp_3 = _mm_add_pd(_temp_3, _v1_b3);

        addpd     xmm1, xmm7                                    ;191.3
$LN13846:
        add       r10, 64                                       ;170.28
$LN13847:
        cmp       rax, r11                                      ;170.2
$LN13848:
        jl        .B66.3        ; Prob 82%                      ;170.2
$LN13849:
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 r15 xmm0 xmm1 xmm2 xmm3 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B66.4::                        ; Preds .B66.3
        movaps    xmm6, XMMWORD PTR [48+rsp]                    ;
        movaps    xmm7, XMMWORD PTR [32+rsp]                    ;
$LN13850:
                                ; LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r12 r13 r14 r15 xmm0 xmm1 xmm2 xmm3 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B66.5::                        ; Preds .B66.4 .B66.1
$LN13851:

;;; 	}
[/bash]

jimdempseyatthecove · ‎08-04-2010

Note, the comments just inside the for scope do not jive with the preloads. I interleaved these for you (Om) to observe the symptom. My code currently used has

[bash]		_v1_b0 = _v1;
		_v1_b1 = _v1[i+1];
		_v1_b2 = _v1[i+2];
		_v1_b3 = _v1[i+3];
		_v2_b0 = _v2;
		_v2_b1 = _v2[i+1];
		_v2_b2 = _v2[i+2];
		_v2_b3 = _v2[i+3];
[/bash]

As commented. The compiler is not buffering the _v2_b... as intended.

The intention is to interleave the memory fetches with the multiplicaton (of prior fetched data).

Jim