Data alignment problem

Nick_L_1 · ‎06-16-2015

Hi there

I was trying to offload some computation to MIC using "pragma", sending data addressed by a pointer p, then how to ensure the alignment of data on MIC after MIC recieved it? Does" __assume(p, 64)" work?I was trying to use instrinsics to load data to the vector RF, which requires the alignment of data.

Another problem, that I was trying to active lots of threads for the calculation using "#pragma omp parallel for", and some arrays inside the loop must be thread private while also 64-byte aligned.

I was using "_mm_malloc()" inside the loop to ensure these, but this leads to reduplicated and unnecessary allocation.

Thanks.

Frances_R_Intel · ‎06-16-2015

Could you possibly post a small sample code? Thanks.

Nick_L_1 · ‎06-17-2015

Frances Roth (Intel) wrote:

Could you possibly post a small sample code? Thanks.

In the main function:

.......

double * p;

p = (double * )malloc(sizeof(double)*1024);

#pragma offload target(mic:0) in(p:length(128)

foo(p);

.......

The data addressed by p is transfered into MIC And the function foo is defined like this:

__attribute__((target(mic)))void foo( double * p)

{

#ifdef __MIC__

......

long long iter;

#pragma omp parallel for private(iter)

for(iter = 0 ; iter < N ; iter ++)

{

__m512d _A, _B;

double * p1;

p1 = (double * )_mm_malloc(sizeof(double)*1024, 512); //p1 has to be thread-private

......

_A = _mm512_load_pd((void*)p); //p has to be aligned

_B = _mm512_load_pd((void*)p1); //p1 has to be aligned

......

/* Calculations */

......

_mm_free(p1);

}

#endif

}

Thus p1 is allocated repeatedly inside the loop to make sure it's thread-private, while p1 has to be aligned.

James_C_Intel2 · ‎06-17-2015

At the very least you should structure that more like this (which allocates once per thread, rather than once per iteration)

#pragma omp parallel
{
    long long iter;     // Though does it *really* need to be 64 bits!? How many iterations do you have?
                        // 64bit indexes are likely inefficient.
    double * p1 = (double *) _mm_malloc (sizeof(double)*1024, 512);

#pragma omp for
    for (iter=0; iter<N; iter++)
    {
        _mm_512d _A;
       ... etc ...
    }

    _mm_free (p1);
}

Nick_L_1 · ‎06-17-2015

James Cownie (Intel) wrote:

At the very least you should structure that more like this (which allocates once per thread, rather than once per iteration)

#pragma omp parallel
{
    long long iter;     // Though does it *really* need to be 64 bits!? How many iterations do you have?
                        // 64bit indexes are likely inefficient.
    double * p1 = (double *) _mm_malloc (sizeof(double)*1024, 512);

#pragma omp for
    for (iter=0; iter<N; iter++)
    {
        _mm_512d _A;
       ... etc ...
    }

    _mm_free (p1);
}

I really have that many iterations. Reconstructing the code helps ,thanks~