Ok, thanks for the info , but

George · ‎04-14-2015

Hello ,

I can't find a way to cast a __m256i variable to integer!

Any ideas?

Thanks!

TimP · ‎04-14-2015

Gcc supports the non-portable int128_t data type. For Intel compilers you must write out what you mean,e.g. a struct of smaller int.

George · ‎04-14-2015

Ok, I am using intel compiler.

And I am refering to an int value.

Ι have : __m256i T;

and I want to do: *( V + i ) = (int) T;

,where :int * V = malloc ( N * sizeof(*V) );

Thanks!

TimP · ‎04-14-2015

As it is equivalent to a struct of 8 int you might select one.

George · ‎04-15-2015

Can you elaborate a little more on this?

I found only this:

typedef struct __declspec(align(32)) { int i[8]; } __m256i;

So, in my case how can I cast "T" to an int?

Because something like:

(int) T.i[ 0 ]

doesn't work.

George · ‎04-15-2015

Oh! Ok , I got it!

I did :

int * a = (int*) &T;

and then I used 'a[ 0 ] '

Now , all the 8 ints , will have the same value?

jimdempseyatthecove · ‎04-15-2015

Compile for Debug, place a break point on statement following "int * a = (int*)&T;"

The examine T and enter "a" into the Memory window. You should see the same values (assuming you view as 4 byte integer).

I suspect you are assuming your T is containing something and the a[] is showing something different indicates that the a[] is at the wrong address.

Please note that in release build the __m256i T may live in a register, which has no address. With the "int * a = (int*)&T;" present in the code, the compiler will provide locations in memory for a copy of T, however (release build) the copy of T to those locations is only performed at places in the code where a copy operation appears to be necessary.

Please write up a simple example program that illustrates your problem. In the process you may discover your programming error. Something like:

__m256i T;
int* a = (int*)&T;

for(int i = 0; i < 8; ++i)
  a = i+1;

T = _mm256_add_epi32(T, T);

for(int i = 0; i < 8; ++i)
  printf("a[%d] = "d ", i, a);

By the way, this is a MIC forum. Why aren't you using __mm512i?

Jim Dempsey

George · ‎04-15-2015

Ok, thanks for the info , but maybe I was not clear enough.

I just want to cast a __m256i variable to an integer variable.

As , I wrote above ,

Ι have : __m256i T;

and I want to do:

*( V + x ) = (int) T;  // just cast T to an integer

,where :

int * V =  malloc ( N * sizeof(*V) );

Now , the problem is that V is filled in another loop and T in another:

__m256i T;
int * a = (int*) &T;

for ( int x = 0; x < h; x++ )
{

    for ( int i = 0; i < N; i+= 8 )
    {

       //fill  T here
     }

   
   *( V + x ) =   *a ????

What should I put in the right side where '*a' is now?

Thanks!

By the way, this is a MIC forum. Why aren't you using __mm512i?

Sorry for that but I can't find the proper forum!If you can point me!

Regarding __mm512 ,ok , I just use 256 for now.

jimdempseyatthecove · ‎04-15-2015

It might be beneficial to unzip the intrinsic_samples archive distributed with the compiler.

Start with intrin_dot_sample

Jim Dempsey

George · ‎04-16-2015

I 've checked the examples , thanks.

But ,in my case they aren't helpful.

Anyway , I wanted to replace this:

_mm256_storeu_si256( (__m256i*) &V[ x ] , T );

with:

*( V + x ) =   *a ????

jimdempseyatthecove · ‎04-17-2015

Your first statement is vector to memory

Your second statement is memory to vector.

int a[] = {1;2;3;4;5;6;7;8;};
int b[] = {0;0;0;0;0;0;0;0;};
__m256i Va = _mm256_loadu_si256(a); // Va = a[0:8]; (CEAN notation [start:length])
__m256i T = _mm256_add_epi32(Va, Va); / T = Va + Va
_mm256_storeu_si256(b, T); // b[0:8] = T

If you want to manipulate a single variable inside the vector .AND. if your CPU has AVX2 you have a large selection of mask operations (mask load, or mask add for example). If you have older CPU then to manipulate a single variable, load a vector containing 0's where you want no operation and -1's where you want operations, then perform _mm256_and... operation to an operator to force the addends not of interest to 0's, then add (or subtract).

Jim Dempsey

George · ‎04-18-2015

The 'V' as I wrote is not a vector :

int * V =  malloc ( N * sizeof(*V) );

but anyway ,thank you for the tips.

I will not use something like this , I prefer tp have same types.

Thanks!

jimdempseyatthecove · ‎04-18-2015

I suggest you write a very small (10 to 20 line) sample function of what you are trying to do, and explain where you are having the problem, and/or insert as comment into sample function "// here I want to ...."

This function has to have code and data sufficient enough for others to understand the problem you are having.

The issue you invariably are having is you haven't reached the "ah ha" (eureka) moment of discovering a fundamental principal of vector programming. We can help you, but only if we have sufficient information to go on.

Jim Dempsey

George · ‎04-20-2015

Hello , I am providing the full code ( sorry ,but 10 lines is only the headers :) )

Inside the "i" loop in the function , I want to perform X[ i ] - x , where x is the integer index.

That's why first ,I am using the set1_epi32 and castsi256_ps to cast x to a float in order to make the substraction using

sub_ps.Then , I am squaring the results and add them.Then , I am finding the minimum and then assigning V to T.Now,the i loop is closed and I am writting the result.

_mm256_storeu_si256( (__m256i*) &Vor[ x + y * W ] , T );

Here ,is my problem because Vor is unaligned.I don't know how to deal with this in order to have better results.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#include <time.h>
#include <sys/time.h>
#include <float.h>
#include <math.h>
#include <assert.h>
#include <immintrin.h>


void Vfunction( const int  N,const int  W,const int  H,float  * restrict X,float  * restrict Y,int  * restrict V,int * restrict Vor )
{

	__m256 Xd , Yd , XdSquared ,YdSquared ,D,xIdx,yIdx, LoadX,LoadY, TempD, Min;
	__m256i T, xxIdx,yyIdx,LoadV;
		
	int x , y;
	#pragma omp parallel for default( none ) shared( X , Y ,V , Vor ,H , W ,N ) private ( x ,y,xxIdx,xIdx,yyIdx,yIdx,LoadX,LoadY,LoadV,TempD,D,Min,Xd,Yd,XdSquared,YdSquared,T ) collapse(2)	
	for ( y = 0; y < H; y++ )
	{ 
		for ( x = 0; x < W; x++ )
		{				
			D = _mm256_set1_ps( FLT_MAX );
				
			__assume_aligned( X , 32 );
			__assume_aligned( Y , 32 );
			__assume_aligned( V , 32 );
				
			for ( int i = 0; i < N; i+= 8 )
			{		
				yyIdx = _mm256_set1_epi32( y );
				yIdx  = _mm256_castsi256_ps( yyIdx );
					
				xxIdx = _mm256_set1_epi32( x );
				xIdx  = _mm256_castsi256_ps( xxIdx );
					
				LoadX = _mm256_load_ps( &X[ i ] );
				LoadY = _mm256_load_ps( &Y[ i ] );

				Xd = _mm256_sub_ps( LoadX , xIdx );
				Yd = _mm256_sub_ps( LoadY , yIdx );
					
				XdSquared = _mm256_mul_ps( Xd , Xd );
				YdSquared = _mm256_mul_ps( Yd , Yd );
					
				TempD = _mm256_add_ps( XdSquared , YdSquared );
						
				Min = _mm256_min_ps( TempD , D );
				D = Min;
				LoadV = _mm256_load_si256( (__m256i*) &V[ i ] );
				T = LoadV;										
			} /* i */
			
			//write result    
			_mm256_storeu_si256( (__m256i*) &Vor[ x + y * W ] , T );
				
			} /* x */
		} /* y */
}

int main()
{
	    
	const int Nx = 32 , Ny = 32 ,w = 256 , h = 256;
	const int N = Nx * Ny;
	const int TotalNb = w * h;

	float * X = (float*) _mm_malloc( Nx * Ny * sizeof (*X) ,32 );
	float * Y = (float*) _mm_malloc( Nx * Ny * sizeof (*Y) ,32 );
	int * V = (int*) _mm_malloc( N * sizeof (*V) ,32 );
	//int * Vor = (int*) _mm_malloc ( TotalNb * N * sizeof(*Vor) ,32 ); //aligned 
	int * Vor = (int*) malloc ( TotalNb * N * sizeof(*Vor) ); // not aligned

	srand( (unsigned int) time(NULL) );

	for ( int i = 0; i < Nx * Ny; i++ )
	{
		X[ i ] = ( ( (float) rand() / (float) ( RAND_MAX ) ) * w );
		Y[ i ] = ( ( (float) rand() / (float) ( RAND_MAX ) ) * h );
	}

	for ( int i = 0; i < N; i++ )
	{
		V[ i ] = i;
	}
		
	Vfunction( N , w,h,X , Y ,V ,Vor );
		
	//free memory
	_mm_free( X );
	_mm_free( Y );
	_mm_free( V );
	 free( Vor );
	
	return 0;
}

Thanks!

jimdempseyatthecove · ‎04-20-2015

The cast functions are not conversion functions - no code is generated.

And in your inner loop, you have loop invariant code. I suspect your attempt at vectorization is not as you intend. Please provide a scalar (non-vector) version of working code.

Jim Dempsey

George · ‎04-20-2015

Ok, here is the scalar.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#include <time.h>
#include <sys/time.h>
#include <float.h>
#include <math.h>
#include <assert.h>

void Vfunction( int N, int W, int H, float     * X, float     * Y,int       * V,int       * Vor )
 {
				 
     float Xd , Yd , TempD ,D; 
     int T;
     int x , y;
				
#pragma omp parallel for default( none ) shared( X , Y ,V , Vor ,H , W ,N ) private ( x,y,Xd,Yd,TempD ,D ,T ) collapse(2)
for ( y = 0; y < H; y++ )
{
    for ( x = 0; x < W; x++ )
    {
        D = FLT_MAX;
	for ( int i = 0; i < N; i++ )
	{
	     Xd = X[ i ] - x;
	     Yd = Y[ i ] - y;
	     TempD = Xd * Xd + Yd * Yd;
								 
	    if ( TempD < D )
	    {
		D = TempD;
		T = V[ i ];
	    }
								 
           } /* i */
							 
	//write result
	*( Vor + ( x + y * W ) ) = T;
							 
	} /* x */
							 
} /* y */
							 
							 							 
 }
							 
int main()
{
								 
const int N = 80;
const int Nx = 32 , Ny = 32 ,w = 256 , h = 256;
const int TotalNb = w * h;
								 
float * X = malloc( Nx * Ny * sizeof (*X) );
float * Y = malloc( Nx * Ny * sizeof (*Y) );
int * V   = malloc( N * sizeof (*V) );
int * Vor = malloc ( TotalNb * N * sizeof(*Vor) );
								 
srand( (unsigned int) time(NULL) );
								 
for ( int i = 0; i < Nx * Ny; i++ )
{
    X[ i ] = ( ( (float) rand() / (float) ( RAND_MAX ) ) * w );
    Y[ i ] = ( ( (float) rand() / (float) ( RAND_MAX ) ) * h );
 }
								 
for ( int i = 0; i < N; i++ )	 V[ i ] = i;
								
Vfunction( N , w,h,X , Y ,V ,Vor );
																 
//free memory
free( X );
free( Y );
free( V );
free( Vor );
								 						 
return 0;
								 
}

Thank you!

jimdempseyatthecove · ‎04-20-2015

void Vfunction( int N, int W, int H, float * X, float * Y,int * V,int * Vor )
{
  assert((N % (sizeof(D) / sizeof(float)) == 0); // assure N is multiple of vector size
  int x , y;
    
  #pragma omp parallel for default( none ) shared( X, Y, V, Vor, H, W, N ) private ( x, y )
  for ( y = 0; y < H; y++ )
  {
    __m256 yVec = _mm256_set1_ps( (float)y );
    for ( x = 0; x < W; x++ )
    {
      __m256 xVec = _mm256_set1_ps( (float)x );
      float D = FLT_MAX;
      int T;
      for ( int i = 0; i < N; i += 8 )
      {
        __m256 Xd = _mm256_sub_ps(_mm256_loadu_ps(&X[ i ]), xVec); // Xd = X[ i ] - x;
        __m256 Yd = _mm256_sub_ps(_mm256_loadu_ps(&Y[ i ]), yVec); // Yd = Y[ i ] - y;
        float TempD[8]; // you might want to align this to 32 byte address
        _mm256_storeu_ps(TempD, _mm256_add_ps(
                           _mm256_mul_ps(Xd,Xd),  // Xd * Xd
                           _mm256_mul_ps(Yd,Yd))); //  + Yd * Yd
 // The following is not vectorizable, use loop over vector
        // thus the reason for storing into TempD[8]
        for(int j=0; j < 8; ++j)
        {        
   if ( TempD < D )
   {
            D = TempD;
            T = V[ i + j ];
   }
        }
      } /* i */
      //write result
      *( Vor + ( x + y * W ) ) = T;
    } /* x */       
  } /* y */
}

The above is untested code

Jim Dempsey

George · ‎04-21-2015

First of all , thank you very much for your help!I appreciate it.

Now , I have a few questions.

1) In the first assert statement ,do you mean this?

assert( ( N % sizeof(Vor) / sizeof(float) ) == 0 );

because you have

N % sizeof( D )

2) Checking the report , I can see that no loop is vectorized.

At loop :

for ( y = 0; y < H; y++ )

it gives me :

remark #25096: Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due to other Compiler Transformations)

remark #25451: Advice: Loop Interchange, if possible, might help loopnest. Suggested Permutation: ( 1 2 3 4 ) --> ( 1 3 4 2 ) 

remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details

remark #15346: vector dependence: assumed FLOW dependence between TempD line 49 and TempD line 58

line 49:

_mm256_store_ps(TempD, _mm256_add_ps(.......

line 58:

D = TempD;

The same goes for the second loop

for ( x = 0; x < W; x++ )

For this loop

for ( int i = 0; i < N; i += 8 )

loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details

remark #15346: vector dependence: assumed FLOW dependence between TempD line 49 and TempD line 58

Finally , the

for(int j=0; j < 8; ++j)

was completely unrolled.

3) Regarding the last loop

for(int j=0; j < 8; ++j)

can't we use a vectorized approach?

Using

_mm256_min_ps

as I tried?

And then store the result using :

_mm256_storeu_si256

4) I tried to use the

#pragma offload target (mic)

but it gives me:

error #13393: *MIC* Opcode unsupported on target architecture: shufps
error #13393: *MIC* Opcode unsupported on target architecture: movaps
error #13393: *MIC* Opcode unsupported on target architecture: insertf128
error #13393: *MIC* Opcode unsupported on target architecture: shufps
error #13393: *MIC* Opcode unsupported on target architecture: movaps
error #13393: *MIC* Opcode unsupported on target architecture: insertf128
catastrophic error: *MIC* Function contains unsupported data types or intrinsics on target architecture.

in lines :

__m256 yVec = _mm256_set1_ps( (float)y );

__m256 xVec = _mm256_set1_ps( (float)x );

If you can help me with these , I 'll appreciate.

Thank you very much!

jimdempseyatthecove · ‎04-21-2015

1) That was a cut (old code) and past (new code). Use ((N% (sizeof(__m256) / sizeof(float)))==0).

2) You only want the inner most loop vectorized. The outer loops effectively have manipulation of the loop control variables (plus one store). The inner loop reports to the effect "there is nothing for me to vectorize". This is due to you already vectorizing the loop to some extent by hand. While you could attempt to pack the output stores into a vector, then issue one write, the number of clock cycles required will exceed the latencies of storing the results individually (note, 15 out of 16 of the stores will be into a cache line sitting in L1).

3) The problem is you are scanning for min across the width of the vector while incrementally (on new min) back filling a second vector. Note, you are not seeking min for the whole vector.

4) If you are intending to offload to MIC, then change the algorithm to use __m512 types (and vector widths of 16 floats). MIC is more sensitive to requiring aligned data. Therefor align X, Y, xVec, yVec and TempD at 64 byte boundaries. (and use the non "u" form of load and store).

Jim Dempsey

George · ‎04-21-2015

Ok, so this is the most vectorized version of the code that could be?

I can't understand , this.

We want this loop to be vectorized:

 for ( int i = 0; i < N; i += 8 )
{
        __m256 Xd = _mm256_sub_ps(_mm256_loadu_ps(&X[ i ]), xVec); // Xd = X[ i ] - x;
        __m256 Yd = _mm256_sub_ps(_mm256_loadu_ps(&Y[ i ]), yVec); // Yd = Y[ i ] - y;
        float TempD[8]; // you might want to align this to 32 byte address

        _mm256_storeu_ps(TempD, _mm256_add_ps(
                      _mm256_mul_ps(Xd,Xd),  // Xd * Xd
	              _mm256_mul_ps(Yd,Yd))); //  + Yd * Yd

and using the above intrinsics , I was expecting to be vectorized ,but it is not.

You say :

This is due to you already vectorizing the loop to some extent by hand.

Yes, I am vectorizing the whole code by hand.

Shouldn't still the report say that loop is vectorized? ( ok by hand , not auto)

Finally , the MIC version of the code performs a lot worse than this code!

Thank you very much for your help.

jimdempseyatthecove · ‎04-21-2015

There is a distinction to be made whereby the compiler takes scalar code (one variable at a time) and converts it to vector code, verses taking user supplied vector code and converting it to.... what just exactly. vector-vector code???

A scalar loop such as:

for(int i=0; i < N; ++i)
C = sqrt(A*A +B*B);

Where the above code is written in scalar format, but the compiler vectorizes it into using the _mm.... vector instructions. This will report that the loop was vectorized. If you were to rewrite the loop using vector intrinsics, the compiler will not claim it converted the loop from scalar to vector.

Your loop (#20) is already vectorized. meaning no conversion took place. The compiler cannot convert your vector code into ?bigger? vector code. Therefore it reports to the effect: I could not (further) convert your code to substitute vector format for scalar format.

The compiler report does NOT mean: vectors were not used in the loop. It only means scalar code was not vectorized (seeing that you already vectorized the code).

I think the remainder of the loop can be vectorized. Starting with a vector of FLT_MAX (an __m512, say D), determine if any floats in the vector are less than D, if not - do nothing, else if any less than, use ...gmin() to obtain the min value as float, then use .._extload_... to broadcast to all lanes of D vector (new min), next use cmpeq to produce a __mask16 holding the location(s), save the mask and save the index i, these can be used after the for(i loop to produce the index into V[].

This should give you a sufficient enough hint to get you going on the vectorization.

Jim Dempsey

cast __m256i to int