AVX _mm256_store_ps

Anonymous18 · ‎06-01-2014

Hi

I am wanting to run the following code using the AVX instruction set,

I compile without any problem but generates an error when I run:

./vec_avx.x

"Segmentation fault (core dumped)"

Reviewing the code the problem is in the instruction:

_mm256_store_ps(&total,acc); //Error

Could someone point me to to be.

Thank you

pd:

I compile with the following command:

gcc -O3 vec_avx.c -mavx -o vec_avx.x

And the main code is as follows:

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{
int i;
float *total;
__m256 v1, v2, v3, acc;
acc = _mm256_setzero_ps(); // acc = |0|0|0|0|0|0|0|0|
for (i=0; i<(ARRAY_SZ-8); i+=8){
v1 = _mm256_loadu_ps(a+i);
v2 = _mm256_loadu_ps(b+i);
v3 = _mm256_mul_ps(v1, v2);
acc = _mm256_add_ps(acc, v3);
}
acc = _mm256_hadd_ps(acc,acc);
acc = _mm256_hadd_ps(acc,acc);

_mm256_store_ps(&total,acc); /////////////ERROR///////////////////////

for (; i<ARRAY_SZ; i++)
total += a * b;
return total;
}

Bernard · ‎06-01-2014

Probably float *total is not aligned on 32-byte boundary. Did you try zero filling total array?

Thomas_W_Intel · ‎06-02-2014

I don't see where you allocate memory for "total". Is this omitted in your code snipped?

Apart from this, "total" is a pointer to float. You therefore can use it directly and shoudn't take its address:

_mm256_store_ps(total,acc);

Alternatively, you can use a

__m256 total_m256

and then store your intermediate result to this variable:

_mm256_store_ps(&total_256,acc);

Kind regards

Thomas

Bernard · ‎06-02-2014

If you would like to see disassembly float *total pointer will be probably declared, but not initialized. IIRC initialization could be done by loading &total[0] with the help of LEA REG,ADDR and filing it with 0.0 for example.

emmanuel_attia · ‎06-02-2014

This is probably not alignment issue since _mm256_store_ps is probably translated to VMOVUPS which work as well for both aligned and unaligned addresses.

The problem is that your total should be of type "float" (instead of float *, because it is no pointer just a scalar value to hold the intermediary result of your AVX accumulation) and the _mm256_store_ps should be replaced with a store scalar instruction (i don't know if there is one) or something like:

_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

Which is not very efficient but saves you from access violation (I let you find out more efficient way to write your algorithm)

Best regards

bronxzv · ‎06-02-2014

emmanuel.attia wrote:
should be replaced with a store scalar instruction (i don't know if there is one)

_mm_store_ss (float* mem_addr, __m128 a) is the one to use IMO

float total = 0.0f; _mm_store_ss(&total,acc);

Anonymous18 · ‎06-02-2014

Hi Enmanuel. Thank you for your help.

Here is my final algorithm:

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)

{
int i;
float total;
__m256 v1, v2, v3, acc;
acc = _mm256_setzero_ps(); // acc = |0|0|0|0|0|0|0|0|
for (i=0; i<(ARRAY_SZ-8); i+=8){
v1 = _mm256_loadu_ps(a+i);
v2 = _mm256_loadu_ps(b+i);
v3 = _mm256_mul_ps(v1, v2);
acc = _mm256_add_ps(acc, v3);
}
acc = _mm256_hadd_ps(acc,acc);
acc = _mm256_hadd_ps(acc,acc);

_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

for (; i<ARRAY_SZ; i++)
total += a * b;

return total;
}

--------------------------------------------------------------------------------------------------------------------------------

I had 2 extra inquiries:

1)

About the end result I show (1000016.000000):

Array datatype : float

# of runs : 1000

Arrays size : 500000

Best Rate GB/s : 19.93

Avg Rate GB/s : 18.85

Median Rate GB/s: 18.74

Avg time : 0.00

Min time : 0.00

Max time : 0.00

Product Result : 1000016.000000

But the correct result of "Product Result" should be (Product Result : 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)

Kernel name : inner_prod
Array datatype : float
# of runs : 1000
Arrays size : 500000
Best Rate GB/s : 7.03
Avg Rate GB/s : 6.68
Avg time : 0.00
Min time : 0.00
Max time : 0.00
Product Result : 2000000.000000

-------------------------------------------------------------------------------------------------------------------------------------------------

2) About the most efficient way of:

_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

It could use some CAST or CONVERT??

-------------------------------------------------------------------------------------------------------------------------------------------------

Thank you so much

emmanuel_attia · ‎06-02-2014

Oh yes, right solution would be:

_mm_store_ss(&total, _mm256_extractf128_si256(acc, 0));

Thanks for the improvement

Anonymous18 · ‎06-02-2014

Hi Emanuel,

I compile with:

gcc -O3 vec_avx.c -mavx -o vec_avx.x

but with your instruction now i have the following error:

vec_avx.c: In function ‘inner_prod_vec’:
vec_avx.c:104: error: incompatible type for argument 1 of ‘_mm256_extractf128_si256’
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include/avxintrin.h:484: note: expected ‘__m256i’ but argument is of type ‘__m256’

My final code is:

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{

int i;
float total;
__m256 v1, v2, v3, acc;
acc = _mm256_setzero_ps();
for (i=0; i<(ARRAY_SZ-8); i+=8){
v1 = _mm256_loadu_ps(a+i);
v2 = _mm256_loadu_ps(b+i);
v3 = _mm256_mul_ps(v1, v2);
acc = _mm256_add_ps(acc, v3);
}
acc = _mm256_hadd_ps(acc,acc);
acc = _mm256_hadd_ps(acc,acc);
_mm_store_ss(&total, _mm256_extractf128_si256(acc, 0)); //////ERROR/////////////////////////////////////////////////////////////////
for (; i<ARRAY_SZ; i++)
total += a * b;
return total;

}

Thank you

Thomas_W_Intel · ‎06-02-2014

You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.

There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..

Best regards

Thomas

bronxzv · ‎06-04-2014

lex wrote:

But the correct result of "Product Result" should be (Product Result : 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)

it looks like there is a missing horizontal add in your code

emmanuel_attia · ‎06-16-2014

Thomas Willhalm (Intel) wrote:

You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.

There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..

Best regards

Thomas

Yes, sorry about my too quick answer.

As for extract_ps vs store_ss there is no much difference with a good compiler (like Intel C++), but sometime extract is indeed more handy (doesn't force to put a float variable on the stack when you only need it as a return value for instance).