topic I don't see where you in Intel® ISA Extensions

AVX _mm256_store_ps

Anonymous18 — Sun, 01 Jun 2014 19:21:06 GMT

I am wanting to run the following code using the AVX instruction set,

I compile without any problem but generates an error when I run:

./vec_avx.x

"Segmentation fault (core dumped)"

Reviewing the code the problem is in the instruction:

_mm256_store_ps(&total,acc); //Error

Could someone point me to to be.

Thank you

pd:

I compile with the following command:

gcc -O3 vec_avx.c -mavx -o vec_avx.x

And the main code is as follows:

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{
int i;
float *total;
__m256 v1, v2, v3, acc;
acc = _mm256_setzero_ps(); // acc = |0|0|0|0|0|0|0|0|
for (i=0; i<(ARRAY_SZ-8); i+=8){
v1 = _mm256_loadu_ps(a+i);
v2 = _mm256_loadu_ps(b+i);
v3 = _mm256_mul_ps(v1, v2);
acc = _mm256_add_ps(acc, v3);
}
acc = _mm256_hadd_ps(acc,acc);
acc = _mm256_hadd_ps(acc,acc);

_mm256_store_ps(&total,acc); /////////////ERROR///////////////////////

for (; i<ARRAY_SZ; i++)
total += a * b;
return total;
}

Probably float *total is not

Bernard — Mon, 02 Jun 2014 05:18:25 GMT

Probably float *total is not aligned on 32-byte boundary. Did you try zero filling total array?

I don't see where you

Thomas_W_Intel — Mon, 02 Jun 2014 08:20:08 GMT

I don't see where you allocate memory for "total". Is this omitted in your code snipped?

Apart from this, "total" is a pointer to float. You therefore can use it directly and shoudn't take its address:

_mm256_store_ps(total,acc);

Alternatively, you can use a

__m256 total_m256

and then store your intermediate result to this variable:

_mm256_store_ps(&total_256,acc);

Kind regards

Thomas

If you would like to see

Bernard — Mon, 02 Jun 2014 10:30:58 GMT

If you would like to see disassembly float *total pointer will be probably declared, but not initialized. IIRC initialization could be done by loading &total[0] with the help of LEA REG,ADDR and filing it with 0.0 for example.

This is probably not

emmanuel_attia — Mon, 02 Jun 2014 12:22:00 GMT

This is probably not alignment issue since _mm256_store_ps is probably translated to VMOVUPS which work as well for both aligned and unaligned addresses.

The problem is that your total should be of type "float" (instead of float *, because it is no pointer just a scalar value to hold the intermediary result of your AVX accumulation) and the _mm256_store_ps should be replaced with a store scalar instruction (i don't know if there is one) or something like:

_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

Which is not very efficient but saves you from access violation (I let you find out more efficient way to write your algorithm)

Best regards

Quote:emmanuel.attia wrote

bronxzv — Mon, 02 Jun 2014 14:19:42 GMT

emmanuel.attia wrote:
should be replaced with a store scalar instruction (i don't know if there is one)

_mm_store_ss (float* mem_addr, __m128 a) is the one to use IMO

float total = 0.0f; _mm_store_ss(&total,acc);

Hi Enmanuel. Thank you for

Anonymous18 — Mon, 02 Jun 2014 14:31:56 GMT

Hi Enmanuel. Thank you for your help.

Here is my final algorithm:

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)

{
int i;
float total;
__m256 v1, v2, v3, acc;
acc = _mm256_setzero_ps(); // acc = |0|0|0|0|0|0|0|0|
for (i=0; i<(ARRAY_SZ-8); i+=8){
v1 = _mm256_loadu_ps(a+i);
v2 = _mm256_loadu_ps(b+i);
v3 = _mm256_mul_ps(v1, v2);
acc = _mm256_add_ps(acc, v3);
}
acc = _mm256_hadd_ps(acc,acc);
acc = _mm256_hadd_ps(acc,acc);

_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

for (; i<ARRAY_SZ; i++)
total += a * b;

return total;
}

--------------------------------------------------------------------------------------------------------------------------------

I had 2 extra inquiries:

About the end result I show (1000016.000000):

Array datatype : float

# of runs : 1000

Arrays size : 500000

Best Rate GB/s : 19.93

Avg Rate GB/s : 18.85

Median Rate GB/s: 18.74

Avg time : 0.00

Min time : 0.00

Max time : 0.00

Product Result : 1000016.000000

But the correct result of "Product Result" should be (Product Result : 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)

Kernel name : inner_prod
Array datatype : float
# of runs : 1000
Arrays size : 500000
Best Rate GB/s : 7.03
Avg Rate GB/s : 6.68
Avg time : 0.00
Min time : 0.00
Max time : 0.00
Product Result : 2000000.000000

-------------------------------------------------------------------------------------------------------------------------------------------------

2) About the most efficient way of:

_mm256_maskstore_ps(&total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);

It could use some CAST or CONVERT??

-------------------------------------------------------------------------------------------------------------------------------------------------

Thank you so much

Oh yes, right solution would

emmanuel_attia — Mon, 02 Jun 2014 15:21:00 GMT

Oh yes, right solution would be:

_mm_store_ss(&total, _mm256_extractf128_si256(acc, 0));

Thanks for the improvement

Hi Emanuel,

Anonymous18 — Mon, 02 Jun 2014 19:46:31 GMT

Hi Emanuel,

I compile with:

gcc -O3 vec_avx.c -mavx -o vec_avx.x

but with your instruction now i have the following error:

vec_avx.c: In function ‘inner_prod_vec’:
vec_avx.c:104: error: incompatible type for argument 1 of ‘_mm256_extractf128_si256’
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include/avxintrin.h:484: note: expected ‘__m256i’ but argument is of type ‘__m256’

My final code is:

DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)
{

int i;
float total;
__m256 v1, v2, v3, acc;
acc = _mm256_setzero_ps();
for (i=0; i<(ARRAY_SZ-8); i+=8){
v1 = _mm256_loadu_ps(a+i);
v2 = _mm256_loadu_ps(b+i);
v3 = _mm256_mul_ps(v1, v2);
acc = _mm256_add_ps(acc, v3);
}
acc = _mm256_hadd_ps(acc,acc);
acc = _mm256_hadd_ps(acc,acc);
_mm_store_ss(&total, _mm256_extractf128_si256(acc, 0)); //////ERROR/////////////////////////////////////////////////////////////////
for (; i<ARRAY_SZ; i++)
total += a * b;
return total;

}

Thank you

You should use _mm256

Thomas_W_Intel — Mon, 02 Jun 2014 21:20:48 GMT

You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.

There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..

Best regards

Thomas

Quote:lex wrote:

bronxzv — Wed, 04 Jun 2014 18:11:51 GMT

lex wrote:

But the correct result of "Product Result" should be (Product Result : 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)

it looks like there is a missing horizontal add in your code

Quote:Thomas Willhalm (Intel)

emmanuel_attia — Mon, 16 Jun 2014 16:49:29 GMT

Thomas Willhalm (Intel) wrote:

You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.

There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..

Best regards

Thomas

Yes, sorry about my too quick answer.

As for extract_ps vs store_ss there is no much difference with a good compiler (like Intel C++), but sometime extract is indeed more handy (doesn't force to put a float variable on the stack when you only need it as a return value for instance).