Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous18
Beginner
299 Views

AVX _mm256_store_ps

Hi

I am wanting to run the following code using the AVX instruction set,

 I compile without any problem but generates an error when I run:

./vec_avx.x 

"Segmentation fault (core dumped)"

Reviewing the code the problem is in the instruction: 

  _mm256_store_ps(&total,acc); //Error

Could someone point me to to be.

Thank you

pd:

I compile with the following command: 

gcc -O3 vec_avx.c -mavx -o vec_avx.x

And the main code is as follows:

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{
  int i;
  float *total;
  __m256 v1, v2, v3, acc;
  acc = _mm256_setzero_ps();  // acc = |0|0|0|0|0|0|0|0|
  for (i=0; i<(ARRAY_SZ-8); i+=8){
    v1  = _mm256_loadu_ps(a+i);
    v2  = _mm256_loadu_ps(b+i);
    v3  = _mm256_mul_ps(v1, v2);
    acc = _mm256_add_ps(acc, v3);
  }
  acc = _mm256_hadd_ps(acc,acc);
  acc = _mm256_hadd_ps(acc,acc);
 
  _mm256_store_ps(&total,acc); /////////////ERROR///////////////////////
  
  for (; i<ARRAY_SZ; i++)
    total += a * b;
  return total;
}

 

 

0 Kudos
11 Replies
Bernard
Black Belt
299 Views

Probably float *total is not aligned on 32-byte boundary. Did you try zero filling total array?

Thomas_W_Intel
Employee
299 Views

I don't see where you allocate memory for "total". Is this omitted in your code snipped?

Apart from this, "total" is a pointer to float. You therefore can use it directly and shoudn't take its address:

 _mm256_store_ps(total,acc);

Alternatively, you can use a

__m256 total_m256

and then store your intermediate result to this variable:

_mm256_store_ps(&total_256,acc);

 

Kind regards

Thomas

Bernard
Black Belt
299 Views

If you would like to see disassembly float *total pointer will be probably declared, but not initialized. IIRC initialization could be done by loading &total[0] with the help of LEA REG,ADDR and filing it with 0.0 for example.

emmanuel_attia
Beginner
299 Views

This is probably not alignment issue since _mm256_store_ps is probably translated to VMOVUPS which work as well for both aligned and unaligned addresses.

The problem is that your total should be of type "float" (instead of float *, because it is no pointer just a scalar value to hold the intermediary result of your AVX accumulation) and the _mm256_store_ps should be replaced with a store scalar instruction (i don't know if there is one) or something like:

_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

Which is not very efficient but saves you from access violation (I let you find out more efficient way to write your algorithm)
 

Best regards

bronxzv
New Contributor II
299 Views

emmanuel.attia wrote:
should be replaced with a store scalar instruction (i don't know if there is one)

_mm_store_ss (float* mem_addr, __m128 a) is the one to use IMO

float total = 0.0f; _mm_store_ss(&total,acc);

Anonymous18
Beginner
299 Views

Hi Enmanuel. Thank you for your help.

Here is my final algorithm: 

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)

{
  int i;
  float total;
  __m256 v1, v2, v3, acc;
  acc = _mm256_setzero_ps();  // acc = |0|0|0|0|0|0|0|0|
  for (i=0; i<(ARRAY_SZ-8); i+=8){
    v1  = _mm256_loadu_ps(a+i);
    v2  = _mm256_loadu_ps(b+i);
    v3  = _mm256_mul_ps(v1, v2);
    acc = _mm256_add_ps(acc, v3);
  }
  acc = _mm256_hadd_ps(acc,acc);
  acc = _mm256_hadd_ps(acc,acc);

  _mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

  for (; i<ARRAY_SZ; i++)
    total += a * b;

  return total;
}

--------------------------------------------------------------------------------------------------------------------------------

I had 2 extra inquiries: 

1)

About the end result I show (1000016.000000): 

 

Array datatype  : float

# of runs       : 1000

Arrays size     : 500000

Best Rate GB/s  :  19.93

Avg  Rate GB/s  :  18.85

Median Rate GB/s:  18.74

Avg time        :   0.00

Min time        :   0.00

Max time        :   0.00

Product Result  : 1000016.000000

But the correct result of "Product Result" should be (Product Result  : 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)

Kernel name     : inner_prod
Array datatype  : float
# of runs       : 1000
Arrays size     : 500000
Best Rate GB/s  :   7.03
Avg  Rate GB/s  :   6.68
Avg time        :   0.00
Min time        :   0.00
Max time        :   0.00
Product Result  : 2000000.000000

 

-------------------------------------------------------------------------------------------------------------------------------------------------

 

2) About the most efficient way of: 

 

  _mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

 

It could use some CAST or CONVERT??

 

-------------------------------------------------------------------------------------------------------------------------------------------------

 

Thank you so much

 

 

emmanuel_attia
Beginner
299 Views

Oh yes, right solution would be:

_mm_store_ss(&total, _mm256_extractf128_si256(acc, 0));

Thanks for the improvement

Anonymous18
Beginner
299 Views

Hi Emanuel,

I compile with:

gcc -O3 vec_avx.c -mavx -o vec_avx.x

but with your instruction now i have the following error:

vec_avx.c: In function ‘inner_prod_vec’:
vec_avx.c:104: error: incompatible type for argument 1 of ‘_mm256_extractf128_si256’
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include/avxintrin.h:484: note: expected ‘__m256i’ but argument is of type ‘__m256’

 

My final code is:

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{

 

  int i;
  float total;
  __m256 v1, v2, v3, acc;
  acc = _mm256_setzero_ps();
  for (i=0; i<(ARRAY_SZ-8); i+=8){
    v1  = _mm256_loadu_ps(a+i);
    v2  = _mm256_loadu_ps(b+i);
    v3  = _mm256_mul_ps(v1, v2);
    acc = _mm256_add_ps(acc, v3);
  }
  acc = _mm256_hadd_ps(acc,acc);
  acc = _mm256_hadd_ps(acc,acc);
  _mm_store_ss(&total, _mm256_extractf128_si256(acc, 0)); //////ERROR/////////////////////////////////////////////////////////////////
  for (; i<ARRAY_SZ; i++)
    total += a * b;
  return total;

 

}

Thank you

Thomas_W_Intel
Employee
299 Views

You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.

There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..

 

Best regards

Thomas

bronxzv
New Contributor II
299 Views

lex wrote:

But the correct result of "Product Result" should be (Product Result  : 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)

it looks like there is a missing horizontal add in your code

emmanuel_attia
Beginner
299 Views

Thomas Willhalm (Intel) wrote:

You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.

There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..

 

Best regards

Thomas

Yes, sorry about my too quick answer.

As for extract_ps vs store_ss there is no much difference with a good compiler (like Intel C++), but sometime extract is indeed more handy (doesn't force to put a float variable on the stack when you only need it as a return value for instance).