Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Use pointer to __m256 or use _mm256_load_ps

Nadav_S_
Beginner
2,304 Views

Hi

I noticed there are two popular ways when writing intrinsics for moving data into ymm registers. I'll use a simple vector addition example to clearify my question.  Assuming a[], b[], c[] are three aligned memory buffers, I would like to do  "c[] = a[] + b[]".

First option, use pointers:

 __m256* vecAp = (__m256*)a;

__m256* vecBp = (__m256*)b;

__m256* vecCp = (__m256*)c;

 

 for (int i=0; i < ARR_SIZE ; i+=8)

     *vecCp  = _mm256_add_ps(*vecAp, *vecBp);

     vecAp++;

     vecBp++;

     vecCp++;

}

Second option, use _mm256_load_ps():

 for (int i=0; i < ARR_SIZE ; i+=8)

{

     __m256 vecA = _mm256_load_ps(&a);

     __m256 vecB = _mm256_load_ps(&b);

     __m256 res  = _mm256_add_ps(vecA,vecB);

     _mm256_store_ps(&c,res);

}

My question is, which of the above options is better, they both seem to compile and work and in this simple example give similar performance.

Thanks 

0 Kudos
10 Replies
SergeyKostrov
Valued Contributor II
2,304 Views
My question is: Did you use RDTSC instruction to measure performance of these two test-cases? Here are pseudo-codes: ... RaisePriorityToREALTIME EnterCriticalSection UseRDTSCtoTimeStart DoCalculationsVersion1 UseRDTSCtoTimeEnd LeaveCriticalSection LowerPriorityToNORMAL Print( TimeEnd - TimeStart )InClockCycles ... and ... RaisePriorityToREALTIME EnterCriticalSection UseRDTSCtoTimeStart DoCalculationsVersion2 UseRDTSCtoTimeEnd LeaveCriticalSection LowerPriorityToNORMAL Print( TimeEnd - TimeStart )InClockCycles ... A verification / comparison of generated assembler codes for both versions would answer your question.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,304 Views

In the first example, I suggest you use a restrictive scope {} to ensure the scope of vecAp, vecBp, vecCp is limited the the immediately following for loop. IOW place { before the declaraton of vecAp and place } after the end of the for loop }. Doing this will inform the compiler that the lifetime of vecAp, vecBp, vecCp is limited to this enclosing scope, and thus may provide the compiler better opportunities to registerize these pointers. Note, in your specific test case, you may have wrapped these statements inside the scope of a timming loop, thus enclosing the scope of vecAp, vecBp, vecCp in your test case, but when later you use the enclosed statements, you may have additional code at the same scope level of vecAp, vecBp, vecCp and may potentially reuse these pointers. The reuse of these pointers may in turn cause the compiler to think/assume the pointers have a lifetime that exceeds the processing for loop. And as a consequence of that, generate code that performs the vecAp++ to memory as opposed to register.

If you run several tests, I do not think it material to use critical section and priority bump (unless your runtime per iteration is rather long).

Jim Dempsey

0 Kudos
bronxzv
New Contributor II
2,304 Views

Nadav S. wrote:

Hi

I noticed there are two popular ways when writing intrinsics for moving data into ymm registers. I'll use a simple vector addition example to clearify my question.  Assuming a[], b[], c[] are three aligned memory buffers, I would like to do  "c[] = a[] + b[]".

First option, use pointers:

 __m256* vecAp = (__m256*)a;

__m256* vecBp = (__m256*)b;

__m256* vecCp = (__m256*)c;

 

 for (int i=0; i < ARR_SIZE ; i+=8)

     *vecCp  = _mm256_add_ps(*vecAp, *vecBp);

     vecAp++;

     vecBp++;

     vecCp++;

}

Second option, use _mm256_load_ps():

 for (int i=0; i < ARR_SIZE ; i+=8)

{

     __m256 vecA = _mm256_load_ps(&a);

     __m256 vecB = _mm256_load_ps(&b);

     __m256 res  = _mm256_add_ps(vecA,vecB);

     _mm256_store_ps(&c,res);

}

My question is, which of the above options is better, they both seem to compile and work and in this simple example give similar performance.

Thanks 

the 2nd option is generally slightly faster because there is a single induction variable instead of 4 (but a good compiler may well replace the 4 increments by a simplifed construct), though it will not show in your timings (whatever the measurement methodology) in this very example since it's clearly cache bandwidth bound, L1D bound on Sandy Bridge/Ivy Bridge, mostly L2 bound on Haswell, if you have a lot of LLC misses, it will be even worse since you'll end up system memory bound

the only potential optimization I see in this example is to replace the 256-bit moves with 128-bit moves (I know it's pretty counterintuitive), it will give you a nice speedup (around 10%) on Ivy Brige and Sandy Bridge for some worksets size, particulary with a high L1D miss rate but low LLC miss rate, a sensible choice may be to use 128-bit moves for your AVX code path and 256-bit moves for an alternate, future proof, AVX2 path since Haswell handles 256-bit moves way better, including unaligned moves

concerning the notation I suppose it's mainly an issue of personal taste, I'll go for the second one myself if I were using intrinsics directly, note that instead of the convoluted notation "&a", you can simpy write "a+i"

also as mentioned above by Jim it's always a good idea to restrict the scope for the pointers to help the compiler with register allocation, all in all the best option IMO will be along these lines:

[cpp]

{ // as local as possible scope for all variables used in your loops
  const float *va = a, *vb = b;  // always use const where it applies (may help the compiler in more complex examples)
  float *vc = c;
  for (int i=0; i<ARR_SIZE; i+=8) // single induction variable
    _mm256_store_ps(vc+i,_mm256_add_ps(_mm256_load_ps(va+i),_mm256_load_ps(vb+i))); 
}

[/cpp]

0 Kudos
Nadav_S_
Beginner
2,304 Views

Sergey Kostrov, Thank you for your reply.

I did not use the RDTSC clock to measue performance. I run my code runs on large vectors and the entire function is wrapped by a loop that runs for thousends of times. The entire thing takes several. If I implement it in simple C, or use IPP, I notice a real difference of several seconds in run time. Yet, the two AVX options I stated above, take about the same time.

I also liked your advice about looking at the assembler code. I pasted the assembler code below and to me the two options look very similar even in assembler (both have 6 commands inside the loop), so I still don't know which of the two options to use.

First option assembler code (using pointers):

                    for (int i=0; i < ARR_SIZE ; i+=8)

  000000013F5A1081  sub         rdi,rax

  000000013F5A1084  mov         ecx,200h

  000000013F5A1089  sub         rbx,rax

  000000013F5A108C  nop         dword ptr [rax]

                   {

                               *vecCp  = _mm256_add_ps(*vecAp, *vecBp);

 000000013F5A1090  vmovaps     ymm0,ymmword ptr [rbx+rax] 

                        vecAp++;   

                                vecBp++;

 000000013F5A1095  add         rax,20h

 000000013F5A1099  dec         rcx

 000000013F5A109C  vaddps      ymm1,ymm0,ymmword ptr [rdi+rax-20h]

 000000013F5A10A2  vmovaps     ymmword ptr [rax-20h],ymm1

 000000013F5A10A7  jne         wmain+90h (13F5A1090h)

                                vecCp++;

                       }

 

Second option assembler code (using _mm256_load_ps):

                    for (int i=0; i < ARR_SIZE ; i+=8)

                   {

                                    __m256 vecA = _mm256_load_ps(&a);

                                    __m256 vecB = _mm256_load_ps(&b);

 000000013FA51070  vmovaps     ymm0,ymmword ptr [rbp+rbx]

 000000013FA51076  add         rbx,20h

 000000013FA5107A  dec         rcx

                                  __m256 res  = _mm256_add_ps(vecA,vecB);

 000000013FA5107D  vaddps      ymm0,ymm0,ymmword ptr [rbp+rbx+3FE0h]

                                 _mm256_store_ps(&c,res);

 000000013FA51086  vmovaps     ymmword ptr [rbx+rax-20h],ymm0

 000000013FA5108C  jne         wmain+70h (13FA51070h)

                     }

 

0 Kudos
Nadav_S_
Beginner
2,304 Views

Hi jimdempseyatthecove

Thanks for the advice about declaring the pointers inside the loop. What about the two options I stated in my question? Any thoughts about which one is better?

0 Kudos
Bernard
Valued Contributor I
2,304 Views

>>>I pasted the assembler code below and to me the two options look very similar>>>

Those two vector addidion operations written in high level language at machine code level  can be represented by almost the same code.

0 Kudos
Bernard
Valued Contributor I
2,304 Views

>>>I did not use the RDTSC clock to measue performance>>>

Those two assembly loops contain almost the same instruction the only difference is related to various general purpose registers used to load the array's values.

0 Kudos
levicki
Valued Contributor I
2,304 Views

Let me ask you a question -- how are you going to write more complex code using pointers? Where you will store intermediate values? Your example is too simple to understand the differences in writing with pointers and with load/store (registers).

My advice for you is to try writing a more complex algorithm using both approaches and then evaluate the following:

1. How long did it take you to write each version?
2. Which code has better readability?
3. Is there a difference in the generated assembler?
4. Is there a difference in performance?

After answering those questions you will understand which "style" is better.

0 Kudos
SergeyKostrov
Valued Contributor II
2,304 Views
Thanks for the assembler codes. >>...I pasted the assembler code below and to me the two options look very similar even in assembler ( both have 6 commands >>inside the loop), so I still don't know which of the two options to use... If both versions are fast and there are No differences in performance ( let's say Not greater than 0.5% / also, you did Not provide any numbers ) then use a version you like! PS: I like the 2nd version.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,304 Views

In looking at the generated code, you find some subtle difference that lead to different loop sizes.

1st)

000000013F5A1090  vmovaps     ymm0,ymmword ptr [rbx+rax]
                        vecAp++;  
                                vecBp++;
 000000013F5A1095  add         rax,20h

verses 2nd)

000000013FA51070  vmovaps     ymm0,ymmword ptr [rbp+rbx]
 000000013FA51076  add         rbx,20h
 

Ignore the assembler comment for the ++ of the two pointers, instead look at the byte address of the add instruction. The first case is +5 bytes from the start of loop ...90, the second case is +6 from the start of loop ...+70. Apparently using rbp requires a prefix byte.

Next look at the vaddps

000000013F5A109C  vaddps      ymm1,ymm0,ymmword ptr [rdi+rax-20h]
000000013F5A10A2  vmovaps     ymmword ptr [rax-20h],ymm1
verses
000000013FA5107D  vaddps      ymm0,ymm0,ymmword ptr [rbp+rbx+3FE0h]
000000013FA51086  vmovaps     ymmword ptr [rbx+rax-20h],ymm0

Note the immediate value in the first case is 20h, this fits in imm8 (one byte) making the vaddps 6 bytes
The immediate value in the second case is 3FE0h, this requires imm32 (4 bytes) making the vaddps 9 bytes

The use of the (registerized) pointers permitted the use of shorter byte length instructions.

Jim Demspey

0 Kudos
Reply