- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I noticed there are two popular ways when writing intrinsics for moving data into ymm registers. I'll use a simple vector addition example to clearify my question. Assuming a[], b[], c[] are three aligned memory buffers, I would like to do "c[] = a[] + b[]".
First option, use pointers:
__m256* vecAp = (__m256*)a;
__m256* vecBp = (__m256*)b;
__m256* vecCp = (__m256*)c;
for (int i=0; i < ARR_SIZE ; i+=8)
{
*vecCp = _mm256_add_ps(*vecAp, *vecBp);
vecAp++;
vecBp++;
vecCp++;
}
Second option, use _mm256_load_ps():
for (int i=0; i < ARR_SIZE ; i+=8)
{
__m256 vecA = _mm256_load_ps(&a);
__m256 vecB = _mm256_load_ps(&b);
__m256 res = _mm256_add_ps(vecA,vecB);
_mm256_store_ps(&c,res);
}
My question is, which of the above options is better, they both seem to compile and work and in this simple example give similar performance.
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the first example, I suggest you use a restrictive scope {} to ensure the scope of vecAp, vecBp, vecCp is limited the the immediately following for loop. IOW place { before the declaraton of vecAp and place } after the end of the for loop }. Doing this will inform the compiler that the lifetime of vecAp, vecBp, vecCp is limited to this enclosing scope, and thus may provide the compiler better opportunities to registerize these pointers. Note, in your specific test case, you may have wrapped these statements inside the scope of a timming loop, thus enclosing the scope of vecAp, vecBp, vecCp in your test case, but when later you use the enclosed statements, you may have additional code at the same scope level of vecAp, vecBp, vecCp and may potentially reuse these pointers. The reuse of these pointers may in turn cause the compiler to think/assume the pointers have a lifetime that exceeds the processing for loop. And as a consequence of that, generate code that performs the vecAp++ to memory as opposed to register.
If you run several tests, I do not think it material to use critical section and priority bump (unless your runtime per iteration is rather long).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nadav S. wrote:
Hi
I noticed there are two popular ways when writing intrinsics for moving data into ymm registers. I'll use a simple vector addition example to clearify my question. Assuming a[], b[], c[] are three aligned memory buffers, I would like to do "c[] = a[] + b[]".
First option, use pointers:
__m256* vecAp = (__m256*)a;
__m256* vecBp = (__m256*)b;
__m256* vecCp = (__m256*)c;
for (int i=0; i < ARR_SIZE ; i+=8)
{
*vecCp = _mm256_add_ps(*vecAp, *vecBp);
vecAp++;
vecBp++;
vecCp++;
}
Second option, use _mm256_load_ps():
for (int i=0; i < ARR_SIZE ; i+=8)
{
__m256 vecA = _mm256_load_ps(&a);
__m256 vecB = _mm256_load_ps(&b);
__m256 res = _mm256_add_ps(vecA,vecB);
_mm256_store_ps(&c,res);
}
My question is, which of the above options is better, they both seem to compile and work and in this simple example give similar performance.
Thanks
the 2nd option is generally slightly faster because there is a single induction variable instead of 4 (but a good compiler may well replace the 4 increments by a simplifed construct), though it will not show in your timings (whatever the measurement methodology) in this very example since it's clearly cache bandwidth bound, L1D bound on Sandy Bridge/Ivy Bridge, mostly L2 bound on Haswell, if you have a lot of LLC misses, it will be even worse since you'll end up system memory bound
the only potential optimization I see in this example is to replace the 256-bit moves with 128-bit moves (I know it's pretty counterintuitive), it will give you a nice speedup (around 10%) on Ivy Brige and Sandy Bridge for some worksets size, particulary with a high L1D miss rate but low LLC miss rate, a sensible choice may be to use 128-bit moves for your AVX code path and 256-bit moves for an alternate, future proof, AVX2 path since Haswell handles 256-bit moves way better, including unaligned moves
concerning the notation I suppose it's mainly an issue of personal taste, I'll go for the second one myself if I were using intrinsics directly, note that instead of the convoluted notation "&a", you can simpy write "a+i"
also as mentioned above by Jim it's always a good idea to restrict the scope for the pointers to help the compiler with register allocation, all in all the best option IMO will be along these lines:
[cpp]
{ // as local as possible scope for all variables used in your loops
const float *va = a, *vb = b; // always use const where it applies (may help the compiler in more complex examples)
float *vc = c;
for (int i=0; i<ARR_SIZE; i+=8) // single induction variable
_mm256_store_ps(vc+i,_mm256_add_ps(_mm256_load_ps(va+i),_mm256_load_ps(vb+i)));
}
[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov, Thank you for your reply.
I did not use the RDTSC clock to measue performance. I run my code runs on large vectors and the entire function is wrapped by a loop that runs for thousends of times. The entire thing takes several. If I implement it in simple C, or use IPP, I notice a real difference of several seconds in run time. Yet, the two AVX options I stated above, take about the same time.
I also liked your advice about looking at the assembler code. I pasted the assembler code below and to me the two options look very similar even in assembler (both have 6 commands inside the loop), so I still don't know which of the two options to use.
First option assembler code (using pointers):
for (int i=0; i < ARR_SIZE ; i+=8)
000000013F5A1081 sub rdi,rax
000000013F5A1084 mov ecx,200h
000000013F5A1089 sub rbx,rax
000000013F5A108C nop dword ptr [rax]
{
*vecCp = _mm256_add_ps(*vecAp, *vecBp);
000000013F5A1090 vmovaps ymm0,ymmword ptr [rbx+rax]
vecAp++;
vecBp++;
000000013F5A1095 add rax,20h
000000013F5A1099 dec rcx
000000013F5A109C vaddps ymm1,ymm0,ymmword ptr [rdi+rax-20h]
000000013F5A10A2 vmovaps ymmword ptr [rax-20h],ymm1
000000013F5A10A7 jne wmain+90h (13F5A1090h)
vecCp++;
}
Second option assembler code (using _mm256_load_ps):
for (int i=0; i < ARR_SIZE ; i+=8)
{
__m256 vecA = _mm256_load_ps(&a);
__m256 vecB = _mm256_load_ps(&b);
000000013FA51070 vmovaps ymm0,ymmword ptr [rbp+rbx]
000000013FA51076 add rbx,20h
000000013FA5107A dec rcx
__m256 res = _mm256_add_ps(vecA,vecB);
000000013FA5107D vaddps ymm0,ymm0,ymmword ptr [rbp+rbx+3FE0h]
_mm256_store_ps(&c,res);
000000013FA51086 vmovaps ymmword ptr [rbx+rax-20h],ymm0
000000013FA5108C jne wmain+70h (13FA51070h)
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi jimdempseyatthecove
Thanks for the advice about declaring the pointers inside the loop. What about the two options I stated in my question? Any thoughts about which one is better?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I pasted the assembler code below and to me the two options look very similar>>>
Those two vector addidion operations written in high level language at machine code level can be represented by almost the same code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I did not use the RDTSC clock to measue performance>>>
Those two assembly loops contain almost the same instruction the only difference is related to various general purpose registers used to load the array's values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me ask you a question -- how are you going to write more complex code using pointers? Where you will store intermediate values? Your example is too simple to understand the differences in writing with pointers and with load/store (registers).
My advice for you is to try writing a more complex algorithm using both approaches and then evaluate the following:
1. How long did it take you to write each version?
2. Which code has better readability?
3. Is there a difference in the generated assembler?
4. Is there a difference in performance?
After answering those questions you will understand which "style" is better.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In looking at the generated code, you find some subtle difference that lead to different loop sizes.
1st)
000000013F5A1090 vmovaps ymm0,ymmword ptr [rbx+rax]
vecAp++;
vecBp++;
000000013F5A1095 add rax,20h
verses 2nd)
000000013FA51070 vmovaps ymm0,ymmword ptr [rbp+rbx]
000000013FA51076 add rbx,20h
Ignore the assembler comment for the ++ of the two pointers, instead look at the byte address of the add instruction. The first case is +5 bytes from the start of loop ...90, the second case is +6 from the start of loop ...+70. Apparently using rbp requires a prefix byte.
Next look at the vaddps
000000013F5A109C vaddps ymm1,ymm0,ymmword ptr [rdi+rax-20h]
000000013F5A10A2 vmovaps ymmword ptr [rax-20h],ymm1
verses
000000013FA5107D vaddps ymm0,ymm0,ymmword ptr [rbp+rbx+3FE0h]
000000013FA51086 vmovaps ymmword ptr [rbx+rax-20h],ymm0
Note the immediate value in the first case is 20h, this fits in imm8 (one byte) making the vaddps 6 bytes
The immediate value in the second case is 3FE0h, this requires imm32 (4 bytes) making the vaddps 9 bytes
The use of the (registerized) pointers permitted the use of shorter byte length instructions.
Jim Demspey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page