Alignment problem

unrue · ‎02-03-2016

Dear Intel Developers,

I'm using Intel icc 15.0.1 version on a C program. I'm trying to align a structure of arrays and the same structure is passed to a computational kernel that uses Intrinsics. I'm not sure I'm doing the right allocations:

struct traces_32 {
    float32* r;
    float32* i;
};

typedef struct traces_32 traces32;

.....


traces32* traces = (traces32*)_mm_malloc(*ntr * sizeof(traces32), 16);

for (i = 0; i < *ntr; i++) {
      traces.r = (float32 *)_mm_malloc( (nsamples_padded) * sizeof(float32), 16);
      traces.i = (float32 *)_mm_malloc( (nsamples_padded) * sizeof(float32), 16);
  }

Is it right this way? The code dies on computational kernel on _mm_load_ps with traces involved. If I use _mm_loadu_ps and malloc instead of _mm_malloc kernel works well, so It seems an alignement problem. Could you help me? Thanks.

jimdempseyatthecove · ‎02-03-2016

traces need not be aligned since it is an array of structures containing two array of float pointers. Typically you will not use SIMD instructions to manipulate these (except for possible copying one traces32 structure to another. Can you show more of the code. Also, it helps at time in debug build to insert asserts to assure you are going to use is in fact what you think you are going to use.

Jim Dempsey

unrue · ‎02-04-2016

jimdempseyatthecove wrote:

traces need not be aligned since it is an array of structures containing two array of float pointers. Typically you will not use SIMD instructions to manipulate these (except for possible copying one traces32 structure to another. Can you show more of the code. Also, it helps at time in debug build to insert asserts to assure you are going to use is in fact what you think you are going to use.

Jim Dempsey

Hi Jim, the original code worked as is:

 traces = (complex32 **)malloc( *ntr * sizeof(complex32 *));
  for (i = 0; i < *ntr; i++) 
      traces = (complex32 *)malloc( *nsamples * sizeof(complex32));


for( n... {
   for(j ... {
       sample_r = traces.r
       sample_i = traces.i

       }
   }

And it is very bad to vectorize it, because each elements is a structure of complex. So, I changed that code in a posted way, in order to have contiguos elements for real imaginary part, my new usage is:

for( n... {
    for(j....{
        sample_r = traces.r
        sample_i = traces.i
    }
}

TimP · ‎02-04-2016

Structure of arrays organization may be required to take advantage of avx256 and avx512 where sse3 has satisfactory simd support for complex data type.

JWong19 · ‎02-04-2016

how did it die? any screen capture as illustration?

could you show the corresponding disassembly and register values?

unrue · ‎02-04-2016

Tim P. wrote:

Structure of arrays organization may be required to take advantage of avx256 and avx512 where sse3 has satisfactory simd support for complex data type.

Hi Tim. Could you explain better this point? Structure of arrays is not ever the best solution so? And apart this question, is my aligment right?

TimP · ‎02-04-2016

I'm agreeing that you may have chosen a reasonable method to support AVX optimization, but I don't see that it would have an advantage on a non-AVX CPU. So I'm guessing you are motivated by AVX, although you didn't show enough to evaluate that question.

unrue · ‎02-05-2016

Hi Tim, I'm developing SSE and AVX version, in order to get best performance, so It would be interested if I'm doing a correct alignment, and I don't still understand if my alignment on traces structure is it right or not, by using _mm_malloc on the first post.

jimdempseyatthecove · ‎02-05-2016

>>Structure of arrays is not ever the best solution so?

The above is a generalized statement. TimP was referring to the special condition of complex numbers. This is a two element structure with specific operational characteristics that make them somewhat compatible with AVX manipulations. See http://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX near the bottom of the page.

That article illustrates vectorization of complex multiply.

As to if SOA or AOS is better for vectorization, this would depend on your application.

Jim Dempsey