- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I am wanting to run the following code using the AVX instruction set,
I compile without any problem but generates an error when I run:
./vec_avx.x
"Segmentation fault (core dumped)"
Reviewing the code the problem is in the instruction:
_mm256_store_ps(&total,acc); //Error
Could someone point me to to be.
Thank you
pd:
I compile with the following command:
gcc -O3 vec_avx.c -mavx -o vec_avx.x
And the main code is as follows:
DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{
int i;
float *total;
__m256 v1, v2, v3, acc;
acc = _mm256_setzero_ps(); // acc = |0|0|0|0|0|0|0|0|
for (i=0; i<(ARRAY_SZ-8); i+=8){
v1 = _mm256_loadu_ps(a+i);
v2 = _mm256_loadu_ps(b+i);
v3 = _mm256_mul_ps(v1, v2);
acc = _mm256_add_ps(acc, v3);
}
acc = _mm256_hadd_ps(acc,acc);
acc = _mm256_hadd_ps(acc,acc);
_mm256_store_ps(&total,acc); /////////////ERROR///////////////////////
for (; i<ARRAY_SZ; i++)
total += a * b;
return total;
}
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Probably float *total is not aligned on 32-byte boundary. Did you try zero filling total array?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't see where you allocate memory for "total". Is this omitted in your code snipped?
Apart from this, "total" is a pointer to float. You therefore can use it directly and shoudn't take its address:
_mm256_store_ps(total,acc);
Alternatively, you can use a
__m256 total_m256
and then store your intermediate result to this variable:
_mm256_store_ps(&total_256,acc);
Kind regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you would like to see disassembly float *total pointer will be probably declared, but not initialized. IIRC initialization could be done by loading &total[0] with the help of LEA REG,ADDR and filing it with 0.0 for example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is probably not alignment issue since _mm256_store_ps is probably translated to VMOVUPS which work as well for both aligned and unaligned addresses.
The problem is that your total should be of type "float" (instead of float *, because it is no pointer just a scalar value to hold the intermediary result of your AVX accumulation) and the _mm256_store_ps should be replaced with a store scalar instruction (i don't know if there is one) or something like:
_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);
Which is not very efficient but saves you from access violation (I let you find out more efficient way to write your algorithm)
Best regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
emmanuel.attia wrote:
should be replaced with a store scalar instruction (i don't know if there is one)
_mm_store_ss (float* mem_addr, __m128 a) is the one to use IMO
float total = 0.0f; _mm_store_ss(&total,acc);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Enmanuel. Thank you for your help.
Here is my final algorithm:
DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{
int i;
float total;
__m256 v1, v2, v3, acc;
acc = _mm256_setzero_ps(); // acc = |0|0|0|0|0|0|0|0|
for (i=0; i<(ARRAY_SZ-8); i+=8){
v1 = _mm256_loadu_ps(a+i);
v2 = _mm256_loadu_ps(b+i);
v3 = _mm256_mul_ps(v1, v2);
acc = _mm256_add_ps(acc, v3);
}
acc = _mm256_hadd_ps(acc,acc);
acc = _mm256_hadd_ps(acc,acc);
_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);
for (; i<ARRAY_SZ; i++)
total += a * b;
return total;
}
--------------------------------------------------------------------------------------------------------------------------------
I had 2 extra inquiries:
1)
About the end result I show (1000016.000000):
Array datatype : float
# of runs : 1000
Arrays size : 500000
Best Rate GB/s : 19.93
Avg Rate GB/s : 18.85
Median Rate GB/s: 18.74
Avg time : 0.00
Min time : 0.00
Max time : 0.00
Product Result : 1000016.000000
But the correct result of "Product Result" should be (Product Result : 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)
Kernel name : inner_prod
Array datatype : float
# of runs : 1000
Arrays size : 500000
Best Rate GB/s : 7.03
Avg Rate GB/s : 6.68
Avg time : 0.00
Min time : 0.00
Max time : 0.00
Product Result : 2000000.000000
-------------------------------------------------------------------------------------------------------------------------------------------------
2) About the most efficient way of:
_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);
It could use some CAST or CONVERT??
-------------------------------------------------------------------------------------------------------------------------------------------------
Thank you so much
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oh yes, right solution would be:
_mm_store_ss(&total, _mm256_extractf128_si256(acc, 0));
Thanks for the improvement
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Emanuel,
I compile with:
gcc -O3 vec_avx.c -mavx -o vec_avx.x
but with your instruction now i have the following error:
vec_avx.c: In function ‘inner_prod_vec’:
vec_avx.c:104: error: incompatible type for argument 1 of ‘_mm256_extractf128_si256’
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include/avxintrin.h:484: note: expected ‘__m256i’ but argument is of type ‘__m256’
My final code is:
DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{
int i;
float total;
__m256 v1, v2, v3, acc;
acc = _mm256_setzero_ps();
for (i=0; i<(ARRAY_SZ-8); i+=8){
v1 = _mm256_loadu_ps(a+i);
v2 = _mm256_loadu_ps(b+i);
v3 = _mm256_mul_ps(v1, v2);
acc = _mm256_add_ps(acc, v3);
}
acc = _mm256_hadd_ps(acc,acc);
acc = _mm256_hadd_ps(acc,acc);
_mm_store_ss(&total, _mm256_extractf128_si256(acc, 0)); //////ERROR/////////////////////////////////////////////////////////////////
for (; i<ARRAY_SZ; i++)
total += a * b;
return total;
}
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.
There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..
Best regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
lex wrote:
But the correct result of "Product Result" should be (Product Result : 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)
it looks like there is a missing horizontal add in your code
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thomas Willhalm (Intel) wrote:
You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.
There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..
Best regards
Thomas
Yes, sorry about my too quick answer.
As for extract_ps vs store_ss there is no much difference with a good compiler (like Intel C++), but sometime extract is indeed more handy (doesn't force to put a float variable on the stack when you only need it as a return value for instance).
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page