_mm256_add_ps crashes program

George · ‎04-08-2015

Hello ,

I am using in my code something like:

int x , y;

float * TempD = (float*) _mm_malloc( N * sizeof(*TempD) ,64 );
    
__m256  * SIMDTempD = (__m256*) TempD;
__m256  * theX = (__m256*) X;
__m256  * theY = (__m256*) Y;
__m256i * theV = (__m256i*) V;
__m256i * theVoronoi = (__m256i*) Vor;

__m256 Xd ,Yd ,XdSquared ,YdSquared;

and then in a loop:

__m256i tempx = _mm256_set1_epi32( x );
__m256    xIdx  = _mm256_castsi256_ps( tempx );
                    
__m256i tempy = _mm256_set1_epi32( y );
 __m256    yIdx  = _mm256_castsi256_ps( tempy );
                    
Xd = _mm256_sub_ps( theX[ i ] , xIdx );
Yd = _mm256_sub_ps( theY[ i ] , yIdx );
                    
distXSquared = _mm256_mul_ps( Xd , Xd );
distYSquared = _mm256_mul_ps( Yd , Yd );
                    
SIMDTempD[ i ] = _mm256_add_ps( XdSquared , YdSquared );

                   
 __m256 theMin = _mm256_min_ps( SIMDTempDistance[ i ]  , D );

When I run the code it gives :

*** glibc detected *** .. double free or corruption

If I comment out the line :

SIMDTempD[ i ] = _mm256_add_ps( XdSquared , YdSquared );

then the code runs without a problem!

Am I missing something here?

George · ‎04-08-2015

I even tried ,instead of using SIMDTempD :

float * TempD = (float*) _mm_malloc( N * sizeof(*TempD) ,64 );

for (i int i = 0; i < N; i++ )

      TempD[ i ] = 0;

__m256 now = _mm256_load_ps( &TempD[ i ] );
now =  _mm256_add_ps( XdSquared , YdSquared );

(instead of

SIMDTempD[ i ] = _mm256_add_ps( XdSquared , YdSquared );

)

but I am receiving a segmentation fault..

jimdempseyatthecove · ‎04-08-2015

George,

You allocate TempD to N number of floats.

Your vector size is 8 floats.

In your #1 code it is not shown the loop control for i, but you do show it in #2 as indexing by float, not by vector strides (in this case not by i+=8)

Your #2 may have failed when i exceeded N-8-1 and ran off the end of the array. The fault would occur if you were lucky enough (or unlucky enough) to have the buffer overrun run over a mapped page boundary.

Jim Dempsey

George · ‎04-08-2015

Hello Jim and thanks for helping.

I am new to vector/intrinsics programming ,so I can't fully understand all concepts.

The allocation I am using for theX,theY... is like:

const int N = 80;
const int Nx = 32 , Ny = 32 ,width = 256 , height = 256;
const int TotalN = width * height;

float * X = (float*) _mm_malloc( Nx * Ny * sizeof (*X) ,64 );
float * Y = (float*) _mm_malloc( Nx * Ny * sizeof (*Y) ,64 );
int * V = (int*) _mm_malloc( NbOfPoints * sizeof (*V) ,64 );

int * Vor = (int*) _mm_malloc ( TotalN * N * sizeof(*Vor) ,64 );

and the loops I am using which have the intrinsics are:

for ( y = 0; y < height; y++ )
{
      for ( x = 0; x < width; x++ )
      {
                __m256 D = _mm256_set1_ps( FLT_MAX );
                
                for ( int i = 0; i < N; i++ )
                {
                    __m256i tempx = _mm256_set1_epi32( x );

             ..........

I would appreciate if you can explain me how I can program what I want using the intrinsics.

Thank you!

George · ‎04-08-2015

And lastly ,I want to ask .

After finding the miminum ( if you look at post 1):

__m256 theMin = _mm256_min_ps( SIMDTempDistance[ i ]  , D );

I am storing:

D = theMin;
theThreshold = theV[ i ];  ( theThreshold is  __m256i theThreshold; )

and then store theThreshold to output array:

_mm256_store_si256( &theVor[ x + y * width ] , theThreshold );

Is this right?Do I have to do something else?

I remind you that , in post 1 , I am doing:

__m256i * theVor = (__m256i*) Vor;

where Vor is my output float array .

Can I do something like:

 theVor[ x + y * width ] = theThreshold;

instead?

Thanks!

jimdempseyatthecove · ‎04-08-2015

George wrote:

The allocation I am using for theX,theY... is like:

const int N = 80;
const int Nx = 32 , Ny = 32 ,width = 256 , height = 256;
const int TotalN = width * height;

float * X = (float*) _mm_malloc( Nx * Ny * sizeof (*X) ,64 );
float * Y = (float*) _mm_malloc( Nx * Ny * sizeof (*Y) ,64 );
int * V = (int*) _mm_malloc( NbOfPoints * sizeof (*V) ,64 );

int * Vor = (int*) _mm_malloc ( TotalN * N * sizeof(*Vor) ,64 );

and the loops I am using which have the intrinsics are:

for ( y = 0; y < height; y++ )
{
      for ( x = 0; x < width; x++ )
      {
                __m256 D = _mm256_set1_ps( FLT_MAX );
                
                for ( int i = 0; i < N; i++ )
                {
                    __m256i tempx = _mm256_set1_epi32( x );

             ..........

I would appreciate if you can explain me how I can program what I want using the intrinsics.

Thank you!

If x and y are used to compute an index into X and/or Y then how does Nx*Ny relate to width and/or height?

(excepting for _mm256_set1_... and loads/stores and a few others) the _mm256_.... vector instructions require _mm256 (vector wide) data.

I suggest you take a non-intrinsic but vectored version of some arbitrary small function and compile it with assembler output with source included. Then look at the output for ideas of what to do. Next, copy the small function body into a differently named function and start adding intrinsics (a few statements at a time), compare the outputs of running both functions. Keep working at converting more statements to intrinsics checking results at each step.

Jim Dempsey