Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

_mm256_add_ps crashes program

George
Beginner
1,708 Views

Hello ,

I am using in my code something like:
 

int x , y;

float * TempD = (float*) _mm_malloc( N * sizeof(*TempD) ,64 );
    
__m256  * SIMDTempD = (__m256*) TempD;
__m256  * theX = (__m256*) X;
__m256  * theY = (__m256*) Y;
__m256i * theV = (__m256i*) V;
__m256i * theVoronoi = (__m256i*) Vor;

__m256 Xd ,Yd ,XdSquared ,YdSquared;

 

and then in a loop:
 

__m256i tempx = _mm256_set1_epi32( x );
__m256    xIdx  = _mm256_castsi256_ps( tempx );
                    
__m256i tempy = _mm256_set1_epi32( y );
 __m256    yIdx  = _mm256_castsi256_ps( tempy );
                    
Xd = _mm256_sub_ps( theX[ i ] , xIdx );
Yd = _mm256_sub_ps( theY[ i ] , yIdx );
                    
distXSquared = _mm256_mul_ps( Xd , Xd );
distYSquared = _mm256_mul_ps( Yd , Yd );
                    
SIMDTempD[ i ] = _mm256_add_ps( XdSquared , YdSquared );

                   
 __m256 theMin = _mm256_min_ps( SIMDTempDistance[ i ]  , D );

 

When I run the code it gives :

*** glibc detected *** .. double free or corruption 

 

If I comment out the line :

SIMDTempD[ i ] = _mm256_add_ps( XdSquared , YdSquared );

then the code runs without a problem!

Am I missing something here?

 

0 Kudos
5 Replies
George
Beginner
1,708 Views

I even tried ,instead of using SIMDTempD :
 

float * TempD = (float*) _mm_malloc( N * sizeof(*TempD) ,64 );

for (i int i = 0; i < N; i++ )

      TempD[ i ] = 0;

__m256 now = _mm256_load_ps( &TempD[ i ] );
now =  _mm256_add_ps( XdSquared , YdSquared );

(instead of

SIMDTempD[ i ] = _mm256_add_ps( XdSquared , YdSquared );

)

but I am receiving a segmentation fault..

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,708 Views

George,

You allocate TempD to N number of floats.

Your vector size is 8 floats.

In your #1 code it is not shown the loop control for i, but you do show it in #2 as indexing by float, not by vector strides (in this case not by i+=8)

Your #2 may have failed when i exceeded N-8-1 and ran off the end of the array. The fault would occur if you were lucky enough (or unlucky enough) to have the buffer overrun run over a mapped page boundary.

Jim Dempsey

0 Kudos
George
Beginner
1,708 Views

Hello Jim and thanks for helping.

I am new to vector/intrinsics programming ,so I can't fully understand all concepts.

The allocation I am using for theX,theY... is like:


 

const int N = 80;
const int Nx = 32 , Ny = 32 ,width = 256 , height = 256;
const int TotalN = width * height;

float * X = (float*) _mm_malloc( Nx * Ny * sizeof (*X) ,64 );
float * Y = (float*) _mm_malloc( Nx * Ny * sizeof (*Y) ,64 );
int * V = (int*) _mm_malloc( NbOfPoints * sizeof (*V) ,64 );

int * Vor = (int*) _mm_malloc ( TotalN * N * sizeof(*Vor) ,64 );

and the loops I am using which have the intrinsics are:

for ( y = 0; y < height; y++ )
{
      for ( x = 0; x < width; x++ )
      {
                __m256 D = _mm256_set1_ps( FLT_MAX );
                
                for ( int i = 0; i < N; i++ )
                {
                    __m256i tempx = _mm256_set1_epi32( x );

             ..........

 

I would appreciate if you can explain me how I can program what I want using the intrinsics.

Thank you!

 

 

0 Kudos
George
Beginner
1,708 Views

And lastly ,I want to ask .

After finding the miminum ( if you look at post 1):

__m256 theMin = _mm256_min_ps( SIMDTempDistance[ i ]  , D );

I am storing:

D = theMin;
theThreshold = theV[ i ];  ( theThreshold is  __m256i theThreshold; )

and then store theThreshold to output array:

_mm256_store_si256( &theVor[ x + y * width ] , theThreshold );

 

Is this right?Do I have to do something else?

I remind you that , in post 1 , I am doing:

__m256i * theVor = (__m256i*) Vor;

where Vor is my output float array .

Can I do something like:

 theVor[ x + y * width ] = theThreshold;

instead?

 

Thanks!

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,708 Views

George wrote:

The allocation I am using for theX,theY... is like:

 

const int N = 80;
const int Nx = 32 , Ny = 32 ,width = 256 , height = 256;
const int TotalN = width * height;

float * X = (float*) _mm_malloc( Nx * Ny * sizeof (*X) ,64 );
float * Y = (float*) _mm_malloc( Nx * Ny * sizeof (*Y) ,64 );
int * V = (int*) _mm_malloc( NbOfPoints * sizeof (*V) ,64 );

int * Vor = (int*) _mm_malloc ( TotalN * N * sizeof(*Vor) ,64 );

and the loops I am using which have the intrinsics are:

for ( y = 0; y < height; y++ )
{
      for ( x = 0; x < width; x++ )
      {
                __m256 D = _mm256_set1_ps( FLT_MAX );
                
                for ( int i = 0; i < N; i++ )
                {
                    __m256i tempx = _mm256_set1_epi32( x );

             ..........

 

I would appreciate if you can explain me how I can program what I want using the intrinsics.

Thank you!

If x and y are used to compute an index into X and/or Y then how does Nx*Ny relate to width and/or height?

(excepting for _mm256_set1_... and loads/stores and a few others) the _mm256_.... vector instructions require _mm256 (vector wide) data.

I suggest you take a non-intrinsic but vectored version of some arbitrary small function and compile it with assembler output with source included. Then look at the output for ideas of what to do. Next, copy the small function body into a differently named function and start adding intrinsics (a few statements at a time), compare the outputs of running both functions. Keep working at converting more statements to intrinsics checking results at each step.

Jim Dempsey

 

0 Kudos
Reply