Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

AVX _mm256_store_ps

Anonymous18
Beginner
2,965 Views

Hi

I am wanting to run the following code using the AVX instruction set,

 I compile without any problem but generates an error when I run:

./vec_avx.x 

"Segmentation fault (core dumped)"

Reviewing the code the problem is in the instruction: 

  _mm256_store_ps(&total,acc); //Error

Could someone point me to to be.

Thank you

pd:

I compile with the following command: 

gcc -O3 vec_avx.c -mavx -o vec_avx.x

And the main code is as follows:

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{
  int i;
  float *total;
  __m256 v1, v2, v3, acc;
  acc = _mm256_setzero_ps();  // acc = |0|0|0|0|0|0|0|0|
  for (i=0; i<(ARRAY_SZ-8); i+=8){
    v1  = _mm256_loadu_ps(a+i);
    v2  = _mm256_loadu_ps(b+i);
    v3  = _mm256_mul_ps(v1, v2);
    acc = _mm256_add_ps(acc, v3);
  }
  acc = _mm256_hadd_ps(acc,acc);
  acc = _mm256_hadd_ps(acc,acc);
 
  _mm256_store_ps(&total,acc); /////////////ERROR///////////////////////
  
  for (; i<ARRAY_SZ; i++)
    total += a * b;
  return total;
}

 

 

0 Kudos
11 Replies
Bernard
Valued Contributor I
2,965 Views

Probably float *total is not aligned on 32-byte boundary. Did you try zero filling total array?

0 Kudos
Thomas_W_Intel
Employee
2,965 Views

I don't see where you allocate memory for "total". Is this omitted in your code snipped?

Apart from this, "total" is a pointer to float. You therefore can use it directly and shoudn't take its address:

 _mm256_store_ps(total,acc);

Alternatively, you can use a

__m256 total_m256

and then store your intermediate result to this variable:

_mm256_store_ps(&total_256,acc);

 

Kind regards

Thomas

0 Kudos
Bernard
Valued Contributor I
2,966 Views

If you would like to see disassembly float *total pointer will be probably declared, but not initialized. IIRC initialization could be done by loading &total[0] with the help of LEA REG,ADDR and filing it with 0.0 for example.

0 Kudos
emmanuel_attia
Beginner
2,966 Views

This is probably not alignment issue since _mm256_store_ps is probably translated to VMOVUPS which work as well for both aligned and unaligned addresses.

The problem is that your total should be of type "float" (instead of float *, because it is no pointer just a scalar value to hold the intermediary result of your AVX accumulation) and the _mm256_store_ps should be replaced with a store scalar instruction (i don't know if there is one) or something like:

_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

Which is not very efficient but saves you from access violation (I let you find out more efficient way to write your algorithm)
 

Best regards

0 Kudos
bronxzv
New Contributor II
2,965 Views

emmanuel.attia wrote:
should be replaced with a store scalar instruction (i don't know if there is one)

_mm_store_ss (float* mem_addr, __m128 a) is the one to use IMO

float total = 0.0f; _mm_store_ss(&total,acc);

0 Kudos
Anonymous18
Beginner
2,966 Views

Hi Enmanuel. Thank you for your help.

Here is my final algorithm: 

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)

{
  int i;
  float total;
  __m256 v1, v2, v3, acc;
  acc = _mm256_setzero_ps();  // acc = |0|0|0|0|0|0|0|0|
  for (i=0; i<(ARRAY_SZ-8); i+=8){
    v1  = _mm256_loadu_ps(a+i);
    v2  = _mm256_loadu_ps(b+i);
    v3  = _mm256_mul_ps(v1, v2);
    acc = _mm256_add_ps(acc, v3);
  }
  acc = _mm256_hadd_ps(acc,acc);
  acc = _mm256_hadd_ps(acc,acc);

  _mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

  for (; i<ARRAY_SZ; i++)
    total += a * b;

  return total;
}

--------------------------------------------------------------------------------------------------------------------------------

I had 2 extra inquiries: 

1)

About the end result I show (1000016.000000): 

 

Array datatype  : float

# of runs       : 1000

Arrays size     : 500000

Best Rate GB/s  :  19.93

Avg  Rate GB/s  :  18.85

Median Rate GB/s:  18.74

Avg time        :   0.00

Min time        :   0.00

Max time        :   0.00

Product Result  : 1000016.000000

But the correct result of "Product Result" should be (Product Result  : 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)

Kernel name     : inner_prod
Array datatype  : float
# of runs       : 1000
Arrays size     : 500000
Best Rate GB/s  :   7.03
Avg  Rate GB/s  :   6.68
Avg time        :   0.00
Min time        :   0.00
Max time        :   0.00
Product Result  : 2000000.000000

 

-------------------------------------------------------------------------------------------------------------------------------------------------

 

2) About the most efficient way of: 

 

  _mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

 

It could use some CAST or CONVERT??

 

-------------------------------------------------------------------------------------------------------------------------------------------------

 

Thank you so much

 

 

0 Kudos
emmanuel_attia
Beginner
2,966 Views

Oh yes, right solution would be:

_mm_store_ss(&total, _mm256_extractf128_si256(acc, 0));

Thanks for the improvement

0 Kudos
Anonymous18
Beginner
2,966 Views

Hi Emanuel,

I compile with:

gcc -O3 vec_avx.c -mavx -o vec_avx.x

but with your instruction now i have the following error:

vec_avx.c: In function ‘inner_prod_vec’:
vec_avx.c:104: error: incompatible type for argument 1 of ‘_mm256_extractf128_si256’
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include/avxintrin.h:484: note: expected ‘__m256i’ but argument is of type ‘__m256’

 

My final code is:

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{

 

  int i;
  float total;
  __m256 v1, v2, v3, acc;
  acc = _mm256_setzero_ps();
  for (i=0; i<(ARRAY_SZ-8); i+=8){
    v1  = _mm256_loadu_ps(a+i);
    v2  = _mm256_loadu_ps(b+i);
    v3  = _mm256_mul_ps(v1, v2);
    acc = _mm256_add_ps(acc, v3);
  }
  acc = _mm256_hadd_ps(acc,acc);
  acc = _mm256_hadd_ps(acc,acc);
  _mm_store_ss(&total, _mm256_extractf128_si256(acc, 0)); //////ERROR/////////////////////////////////////////////////////////////////
  for (; i<ARRAY_SZ; i++)
    total += a * b;
  return total;

 

}

Thank you

0 Kudos
Thomas_W_Intel
Employee
2,966 Views

You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.

There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..

 

Best regards

Thomas

0 Kudos
bronxzv
New Contributor II
2,966 Views

lex wrote:

But the correct result of "Product Result" should be (Product Result  : 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)

it looks like there is a missing horizontal add in your code

0 Kudos
emmanuel_attia
Beginner
2,966 Views

Thomas Willhalm (Intel) wrote:

You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.

There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..

 

Best regards

Thomas

Yes, sorry about my too quick answer.

As for extract_ps vs store_ss there is no much difference with a good compiler (like Intel C++), but sometime extract is indeed more handy (doesn't force to put a float variable on the stack when you only need it as a return value for instance).

0 Kudos
Reply