Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Alignment problem

unrue
Beginner
742 Views

Dear Intel Developers,

I'm using Intel icc 15.0.1 version on a C program. I'm trying to align a structure of arrays and the same structure is passed to a computational kernel that uses Intrinsics. I'm not sure I'm doing the right allocations:

 

struct traces_32 {
    float32* r;
    float32* i;
};

typedef struct traces_32 traces32;

.....


traces32* traces = (traces32*)_mm_malloc(*ntr * sizeof(traces32), 16);

for (i = 0; i < *ntr; i++) {
      traces.r = (float32 *)_mm_malloc( (nsamples_padded) * sizeof(float32), 16);
      traces.i = (float32 *)_mm_malloc( (nsamples_padded) * sizeof(float32), 16);
  }

Is it right this way? The code dies on computational kernel on _mm_load_ps with traces involved. If I use _mm_loadu_ps and malloc instead of _mm_malloc kernel works well, so It seems an alignement problem. Could you help me? Thanks.

 

 

 

0 Kudos
8 Replies
jimdempseyatthecove
Honored Contributor III
742 Views

traces need not be aligned since it is an array of structures containing two array of float pointers. Typically you will not use SIMD instructions to manipulate these (except for possible copying one traces32 structure to another. Can you show more of the code. Also, it helps at time in debug build to insert asserts to assure you are going to use is in fact what you think you are going to use.

Jim Dempsey

0 Kudos
unrue
Beginner
742 Views

jimdempseyatthecove wrote:

traces need not be aligned since it is an array of structures containing two array of float pointers. Typically you will not use SIMD instructions to manipulate these (except for possible copying one traces32 structure to another. Can you show more of the code. Also, it helps at time in debug build to insert asserts to assure you are going to use is in fact what you think you are going to use.

Jim Dempsey

 

Hi Jim, the original code worked as is:

 traces = (complex32 **)malloc( *ntr * sizeof(complex32 *));
  for (i = 0; i < *ntr; i++) 
      traces = (complex32 *)malloc( *nsamples * sizeof(complex32));


for( n... {
   for(j ... {
       sample_r = traces.r
       sample_i = traces.i

       }
   }

 

And it is very bad to vectorize it, because each elements is a structure of complex. So, I changed that code in a posted way, in order to have contiguos elements for real imaginary part, my new usage is:

 

for( n... {
    for(j....{
        sample_r = traces.r
        sample_i = traces.i
    }
}

 

0 Kudos
TimP
Honored Contributor III
742 Views

Structure of arrays organization may be required to take advantage of avx256 and avx512 where sse3 has satisfactory simd support for complex data type.

0 Kudos
JWong19
Beginner
742 Views

how did it die? any screen capture as illustration?

could you show the corresponding disassembly and register values?

0 Kudos
unrue
Beginner
742 Views

Tim P. wrote:

Structure of arrays organization may be required to take advantage of avx256 and avx512 where sse3 has satisfactory simd support for complex data type.

 

Hi Tim. Could you explain better this point? Structure of arrays is not ever the best solution so? And apart this question, is my aligment right?

0 Kudos
TimP
Honored Contributor III
742 Views

I'm agreeing that you may have chosen a reasonable method to support AVX optimization, but I don't see that it would have an advantage on a non-AVX CPU.  So I'm guessing you are motivated by AVX, although you didn't show enough to evaluate that question.

0 Kudos
unrue
Beginner
742 Views

Hi Tim, I'm developing SSE and AVX version, in order to get best performance, so It would be interested if I'm doing a correct alignment, and I don't still understand if my alignment on traces structure is it right or not, by using _mm_malloc on the first post.

0 Kudos
jimdempseyatthecove
Honored Contributor III
742 Views

>>Structure of arrays is not ever the best solution so?

The above is a generalized statement. TimP was referring to the special condition of complex numbers. This is a two element structure with specific operational characteristics that make them somewhat compatible with AVX manipulations. See http://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX  near the bottom of the page.

That article illustrates vectorization of complex multiply.

As to if SOA or AOS is better for vectorization, this would depend on your application.

Jim Dempsey

0 Kudos
Reply