My question is: Did you use

Nadav_S_ · ‎06-06-2013

Hi

I noticed there are two popular ways when writing intrinsics for moving data into ymm registers. I'll use a simple vector addition example to clearify my question. Assuming a[], b[], c[] are three aligned memory buffers, I would like to do "c[] = a[] + b[]".

First option, use pointers:

__m256* vecAp = (__m256*)a;

__m256* vecBp = (__m256*)b;

__m256* vecCp = (__m256*)c;

for (int i=0; i < ARR_SIZE ; i+=8)

{

    *vecCp = _mm256_add_ps(*vecAp, *vecBp);

    vecAp++;

    vecBp++;

   vecCp++;

}

Second option, use _mm256_load_ps():

for (int i=0; i < ARR_SIZE ; i+=8)

{

     __m256 vecA = _mm256_load_ps(&a);

    __m256 vecB = _mm256_load_ps(&b);

     __m256 res = _mm256_add_ps(vecA,vecB);

   _mm256_store_ps(&c,res);

}

My question is, which of the above options is better, they both seem to compile and work and in this simple example give similar performance.

Thanks

SergeyKostrov · ‎06-06-2013

My question is: Did you use RDTSC instruction to measure performance of these two test-cases? Here are pseudo-codes: ... RaisePriorityToREALTIME EnterCriticalSection UseRDTSCtoTimeStart DoCalculationsVersion1 UseRDTSCtoTimeEnd LeaveCriticalSection LowerPriorityToNORMAL Print( TimeEnd - TimeStart )InClockCycles ... and ... RaisePriorityToREALTIME EnterCriticalSection UseRDTSCtoTimeStart DoCalculationsVersion2 UseRDTSCtoTimeEnd LeaveCriticalSection LowerPriorityToNORMAL Print( TimeEnd - TimeStart )InClockCycles ... A verification / comparison of generated assembler codes for both versions would answer your question.

jimdempseyatthecove · ‎06-08-2013

In the first example, I suggest you use a restrictive scope {} to ensure the scope of vecAp, vecBp, vecCp is limited the the immediately following for loop. IOW place { before the declaraton of vecAp and place } after the end of the for loop }. Doing this will inform the compiler that the lifetime of vecAp, vecBp, vecCp is limited to this enclosing scope, and thus may provide the compiler better opportunities to registerize these pointers. Note, in your specific test case, you may have wrapped these statements inside the scope of a timming loop, thus enclosing the scope of vecAp, vecBp, vecCp in your test case, but when later you use the enclosed statements, you may have additional code at the same scope level of vecAp, vecBp, vecCp and may potentially reuse these pointers. The reuse of these pointers may in turn cause the compiler to think/assume the pointers have a lifetime that exceeds the processing for loop. And as a consequence of that, generate code that performs the vecAp++ to memory as opposed to register.

If you run several tests, I do not think it material to use critical section and priority bump (unless your runtime per iteration is rather long).

Jim Dempsey

bronxzv · ‎06-08-2013

Nadav S. wrote:

Hi

I noticed there are two popular ways when writing intrinsics for moving data into ymm registers. I'll use a simple vector addition example to clearify my question. Assuming a[], b[], c[] are three aligned memory buffers, I would like to do "c[] = a[] + b[]".

First option, use pointers:

__m256* vecAp = (__m256*)a;

__m256* vecBp = (__m256*)b;

__m256* vecCp = (__m256*)c;

for (int i=0; i < ARR_SIZE ; i+=8)

{

    *vecCp = _mm256_add_ps(*vecAp, *vecBp);

    vecAp++;

    vecBp++;

   vecCp++;

}

Second option, use _mm256_load_ps():

for (int i=0; i < ARR_SIZE ; i+=8)

{

     __m256 vecA = _mm256_load_ps(&a);

    __m256 vecB = _mm256_load_ps(&b);

     __m256 res = _mm256_add_ps(vecA,vecB);

   _mm256_store_ps(&c,res);

}

My question is, which of the above options is better, they both seem to compile and work and in this simple example give similar performance.

Thanks

the 2nd option is generally slightly faster because there is a single induction variable instead of 4 (but a good compiler may well replace the 4 increments by a simplifed construct), though it will not show in your timings (whatever the measurement methodology) in this very example since it's clearly cache bandwidth bound, L1D bound on Sandy Bridge/Ivy Bridge, mostly L2 bound on Haswell, if you have a lot of LLC misses, it will be even worse since you'll end up system memory bound

the only potential optimization I see in this example is to replace the 256-bit moves with 128-bit moves (I know it's pretty counterintuitive), it will give you a nice speedup (around 10%) on Ivy Brige and Sandy Bridge for some worksets size, particulary with a high L1D miss rate but low LLC miss rate, a sensible choice may be to use 128-bit moves for your AVX code path and 256-bit moves for an alternate, future proof, AVX2 path since Haswell handles 256-bit moves way better, including unaligned moves

concerning the notation I suppose it's mainly an issue of personal taste, I'll go for the second one myself if I were using intrinsics directly, note that instead of the convoluted notation "&a", you can simpy write "a+i"

also as mentioned above by Jim it's always a good idea to restrict the scope for the pointers to help the compiler with register allocation, all in all the best option IMO will be along these lines:

[cpp]

{ // as local as possible scope for all variables used in your loops
const float *va = a, *vb = b; // always use const where it applies (may help the compiler in more complex examples)
float *vc = c;
for (int i=0; i<ARR_SIZE; i+=8) // single induction variable
_mm256_store_ps(vc+i,_mm256_add_ps(_mm256_load_ps(va+i),_mm256_load_ps(vb+i)));
}

[/cpp]

Nadav_S_ · ‎06-09-2013

Sergey Kostrov, Thank you for your reply.

I did not use the RDTSC clock to measue performance. I run my code runs on large vectors and the entire function is wrapped by a loop that runs for thousends of times. The entire thing takes several. If I implement it in simple C, or use IPP, I notice a real difference of several seconds in run time. Yet, the two AVX options I stated above, take about the same time.

I also liked your advice about looking at the assembler code. I pasted the assembler code below and to me the two options look very similar even in assembler (both have 6 commands inside the loop), so I still don't know which of the two options to use.

First option assembler code (using pointers):

                    ^{for (int i=0; i < ARR_SIZE ; i+=8)}

000000013F5A1081 sub         rdi,rax

000000013F5A1084 mov         ecx,200h

000000013F5A1089 sub         rbx,rax

000000013F5A108C nop         dword ptr [rax]

                   ^{

^{*vecCp = _mm256_add_ps(*vecAp, *vecBp);}

000000013F5A1090 vmovaps     ymm0,ymmword ptr [rbx+rax]

                  ^vecAp++;

^vecBp++;

000000013F5A1095 add         rax,20h

000000013F5A1099 dec         rcx

000000013F5A109C vaddps      ymm1,ymm0,ymmword ptr [rdi+rax-20h]

000000013F5A10A2 vmovaps     ymmword ptr [rax-20h],ymm1

000000013F5A10A7 jne         wmain+90h (13F5A1090h)

^vecCp++;

^}

Second option assembler code (using _mm256_load_ps):

                    ^{for (int i=0; i < ARR_SIZE ; i+=8)}

                   ^{

                                    ^{__m256 vecA = _mm256_load_ps(&a);}

                                    ^{__m256 vecB = _mm256_load_ps(&b);}

000000013FA51070 vmovaps     ymm0,ymmword ptr [rbp+rbx]

000000013FA51076 add         rbx,20h

000000013FA5107A dec         rcx

                                 ^{__m256 res = _mm256_add_ps(vecA,vecB);}

000000013FA5107D vaddps      ymm0,ymm0,ymmword ptr [rbp+rbx+3FE0h]

                                 ^{_mm256_store_ps(&c,res);}

000000013FA51086 vmovaps     ymmword ptr [rbx+rax-20h],ymm0

000000013FA5108C jne         wmain+70h (13FA51070h)

                     ^}

Nadav_S_ · ‎06-09-2013

Hi jimdempseyatthecove

Thanks for the advice about declaring the pointers inside the loop. What about the two options I stated in my question? Any thoughts about which one is better?

Bernard · ‎06-09-2013

>>>I pasted the assembler code below and to me the two options look very similar>>>

Those two vector addidion operations written in high level language at machine code level can be represented by almost the same code.

Bernard · ‎06-09-2013

>>>I did not use the RDTSC clock to measue performance>>>

Those two assembly loops contain almost the same instruction the only difference is related to various general purpose registers used to load the array's values.

levicki · ‎06-09-2013

Let me ask you a question -- how are you going to write more complex code using pointers? Where you will store intermediate values? Your example is too simple to understand the differences in writing with pointers and with load/store (registers).

My advice for you is to try writing a more complex algorithm using both approaches and then evaluate the following:

1. How long did it take you to write each version?
2. Which code has better readability?
3. Is there a difference in the generated assembler?
4. Is there a difference in performance?

After answering those questions you will understand which "style" is better.

SergeyKostrov · ‎06-09-2013

Thanks for the assembler codes. >>...I pasted the assembler code below and to me the two options look very similar even in assembler ( both have 6 commands >>inside the loop), so I still don't know which of the two options to use... If both versions are fast and there are No differences in performance ( let's say Not greater than 0.5% / also, you did Not provide any numbers ) then use a version you like! PS: I like the 2nd version.

jimdempseyatthecove · ‎06-10-2013

In looking at the generated code, you find some subtle difference that lead to different loop sizes.

1st)

000000013F5A1090 vmovaps     ymm0,ymmword ptr [rbx+rax]
                        vecAp++;
                                vecBp++;
000000013F5A1095 add         rax,20h

verses 2nd)

000000013FA51070 vmovaps ymm0,ymmword ptr [rbp+rbx]
000000013FA51076 add rbx,20h

Ignore the assembler comment for the ++ of the two pointers, instead look at the byte address of the add instruction. The first case is +5 bytes from the start of loop ...90, the second case is +6 from the start of loop ...+70. Apparently using rbp requires a prefix byte.

Next look at the vaddps

000000013F5A109C vaddps      ymm1,ymm0,ymmword ptr [rdi+rax-20h]
000000013F5A10A2 vmovaps     ymmword ptr [rax-20h],ymm1
verses
000000013FA5107D vaddps      ymm0,ymm0,ymmword ptr [rbp+rbx+3FE0h]
000000013FA51086 vmovaps     ymmword ptr [rbx+rax-20h],ymm0

Note the immediate value in the first case is 20h, this fits in imm8 (one byte) making the vaddps 6 bytes
The immediate value in the second case is 3FE0h, this requires imm32 (4 bytes) making the vaddps 9 bytes

The use of the (registerized) pointers permitted the use of shorter byte length instructions.

Jim Demspey

Use pointer to __m256 or use _mm256_load_ps