- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello ,
I am using in my code something like:
int x , y; float * TempD = (float*) _mm_malloc( N * sizeof(*TempD) ,64 ); __m256 * SIMDTempD = (__m256*) TempD; __m256 * theX = (__m256*) X; __m256 * theY = (__m256*) Y; __m256i * theV = (__m256i*) V; __m256i * theVoronoi = (__m256i*) Vor; __m256 Xd ,Yd ,XdSquared ,YdSquared;
and then in a loop:
__m256i tempx = _mm256_set1_epi32( x ); __m256 xIdx = _mm256_castsi256_ps( tempx ); __m256i tempy = _mm256_set1_epi32( y ); __m256 yIdx = _mm256_castsi256_ps( tempy ); Xd = _mm256_sub_ps( theX[ i ] , xIdx ); Yd = _mm256_sub_ps( theY[ i ] , yIdx ); distXSquared = _mm256_mul_ps( Xd , Xd ); distYSquared = _mm256_mul_ps( Yd , Yd ); SIMDTempD[ i ] = _mm256_add_ps( XdSquared , YdSquared ); __m256 theMin = _mm256_min_ps( SIMDTempDistance[ i ] , D );
When I run the code it gives :
*** glibc detected *** .. double free or corruption
If I comment out the line :
SIMDTempD[ i ] = _mm256_add_ps( XdSquared , YdSquared );
then the code runs without a problem!
Am I missing something here?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I even tried ,instead of using SIMDTempD
:
float * TempD = (float*) _mm_malloc( N * sizeof(*TempD) ,64 ); for (i int i = 0; i < N; i++ ) TempD[ i ] = 0; __m256 now = _mm256_load_ps( &TempD[ i ] ); now = _mm256_add_ps( XdSquared , YdSquared );
(instead of
SIMDTempD[ i ] = _mm256_add_ps( XdSquared , YdSquared );
)
but I am receiving a segmentation fault..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
George,
You allocate TempD to N number of floats.
Your vector size is 8 floats.
In your #1 code it is not shown the loop control for i, but you do show it in #2 as indexing by float, not by vector strides (in this case not by i+=8)
Your #2 may have failed when i exceeded N-8-1 and ran off the end of the array. The fault would occur if you were lucky enough (or unlucky enough) to have the buffer overrun run over a mapped page boundary.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jim and thanks for helping.
I am new to vector/intrinsics programming ,so I can't fully understand all concepts.
The allocation I am using for theX,theY... is like:
const int N = 80; const int Nx = 32 , Ny = 32 ,width = 256 , height = 256; const int TotalN = width * height; float * X = (float*) _mm_malloc( Nx * Ny * sizeof (*X) ,64 ); float * Y = (float*) _mm_malloc( Nx * Ny * sizeof (*Y) ,64 ); int * V = (int*) _mm_malloc( NbOfPoints * sizeof (*V) ,64 ); int * Vor = (int*) _mm_malloc ( TotalN * N * sizeof(*Vor) ,64 ); and the loops I am using which have the intrinsics are: for ( y = 0; y < height; y++ ) { for ( x = 0; x < width; x++ ) { __m256 D = _mm256_set1_ps( FLT_MAX ); for ( int i = 0; i < N; i++ ) { __m256i tempx = _mm256_set1_epi32( x ); ..........
I would appreciate if you can explain me how I can program what I want using the intrinsics.
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And lastly ,I want to ask .
After finding the miminum ( if you look at post 1):
__m256 theMin = _mm256_min_ps( SIMDTempDistance[ i ] , D );
I am storing:
D = theMin; theThreshold = theV[ i ]; ( theThreshold is __m256i theThreshold; )
and then store theThreshold to output array:
_mm256_store_si256( &theVor[ x + y * width ] , theThreshold );
Is this right?Do I have to do something else?
I remind you that , in post 1 , I am doing:
__m256i * theVor = (__m256i*) Vor;
where Vor is my output float array .
Can I do something like:
theVor[ x + y * width ] = theThreshold;
instead?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
George wrote:
The allocation I am using for theX,theY... is like:
const int N = 80; const int Nx = 32 , Ny = 32 ,width = 256 , height = 256; const int TotalN = width * height; float * X = (float*) _mm_malloc( Nx * Ny * sizeof (*X) ,64 ); float * Y = (float*) _mm_malloc( Nx * Ny * sizeof (*Y) ,64 ); int * V = (int*) _mm_malloc( NbOfPoints * sizeof (*V) ,64 ); int * Vor = (int*) _mm_malloc ( TotalN * N * sizeof(*Vor) ,64 ); and the loops I am using which have the intrinsics are: for ( y = 0; y < height; y++ ) { for ( x = 0; x < width; x++ ) { __m256 D = _mm256_set1_ps( FLT_MAX ); for ( int i = 0; i < N; i++ ) { __m256i tempx = _mm256_set1_epi32( x ); ..........
I would appreciate if you can explain me how I can program what I want using the intrinsics.
Thank you!
If x and y are used to compute an index into X and/or Y then how does Nx*Ny relate to width and/or height?
(excepting for _mm256_set1_... and loads/stores and a few others) the _mm256_.... vector instructions require _mm256 (vector wide) data.
I suggest you take a non-intrinsic but vectored version of some arbitrary small function and compile it with assembler output with source included. Then look at the output for ideas of what to do. Next, copy the small function body into a differently named function and start adding intrinsics (a few statements at a time), compare the outputs of running both functions. Keep working at converting more statements to intrinsics checking results at each step.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page