- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Intel Developers,
I'm using Intel icc 15.0.1 version on a C program. I'm trying to align a structure of arrays and the same structure is passed to a computational kernel that uses Intrinsics. I'm not sure I'm doing the right allocations:
struct traces_32 { float32* r; float32* i; }; typedef struct traces_32 traces32; ..... traces32* traces = (traces32*)_mm_malloc(*ntr * sizeof(traces32), 16); for (i = 0; i < *ntr; i++) { traces.r = (float32 *)_mm_malloc( (nsamples_padded) * sizeof(float32), 16); traces.i = (float32 *)_mm_malloc( (nsamples_padded) * sizeof(float32), 16); }
Is it right this way? The code dies on computational kernel on _mm_load_ps with traces involved. If I use _mm_loadu_ps and malloc instead of _mm_malloc kernel works well, so It seems an alignement problem. Could you help me? Thanks.
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
traces need not be aligned since it is an array of structures containing two array of float pointers. Typically you will not use SIMD instructions to manipulate these (except for possible copying one traces32 structure to another. Can you show more of the code. Also, it helps at time in debug build to insert asserts to assure you are going to use is in fact what you think you are going to use.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
traces need not be aligned since it is an array of structures containing two array of float pointers. Typically you will not use SIMD instructions to manipulate these (except for possible copying one traces32 structure to another. Can you show more of the code. Also, it helps at time in debug build to insert asserts to assure you are going to use is in fact what you think you are going to use.
Jim Dempsey
Hi Jim, the original code worked as is:
traces = (complex32 **)malloc( *ntr * sizeof(complex32 *)); for (i = 0; i < *ntr; i++) traces = (complex32 *)malloc( *nsamples * sizeof(complex32)); for( n... { for(j ... { sample_r = traces.r sample_i = traces .i } }
And it is very bad to vectorize it, because each elements is a structure of complex. So, I changed that code in a posted way, in order to have contiguos elements for real imaginary part, my new usage is:
for( n... { for(j....{ sample_r = traces.r sample_i = traces .i } }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Structure of arrays organization may be required to take advantage of avx256 and avx512 where sse3 has satisfactory simd support for complex data type.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
how did it die? any screen capture as illustration?
could you show the corresponding disassembly and register values?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim P. wrote:
Structure of arrays organization may be required to take advantage of avx256 and avx512 where sse3 has satisfactory simd support for complex data type.
Hi Tim. Could you explain better this point? Structure of arrays is not ever the best solution so? And apart this question, is my aligment right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm agreeing that you may have chosen a reasonable method to support AVX optimization, but I don't see that it would have an advantage on a non-AVX CPU. So I'm guessing you are motivated by AVX, although you didn't show enough to evaluate that question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim, I'm developing SSE and AVX version, in order to get best performance, so It would be interested if I'm doing a correct alignment, and I don't still understand if my alignment on traces structure is it right or not, by using _mm_malloc on the first post.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Structure of arrays is not ever the best solution so?
The above is a generalized statement. TimP was referring to the special condition of complex numbers. This is a two element structure with specific operational characteristics that make them somewhat compatible with AVX manipulations. See http://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX near the bottom of the page.
That article illustrates vectorization of complex multiply.
As to if SOA or AOS is better for vectorization, this would depend on your application.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page