Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

Data alignment problem

Nick_L_1
Beginner
762 Views

Hi there

I was trying to offload some computation to MIC using "pragma", sending data addressed by a pointer p, then how to ensure the alignment of data on MIC after MIC recieved it? Does" __assume(p, 64)" work?I was trying to use instrinsics to load data to the vector RF, which requires the alignment of data.

Another problem, that I was trying to active lots of threads for the calculation using "#pragma omp parallel for", and some arrays inside the loop must be thread private while also 64-byte aligned.

I was using "_mm_malloc()" inside the loop to ensure these, but this leads to reduplicated and unnecessary allocation.

Thanks.

 

0 Kudos
4 Replies
Frances_R_Intel
Employee
762 Views

Could you possibly post a small sample code? Thanks.

0 Kudos
Nick_L_1
Beginner
762 Views

Frances Roth (Intel) wrote:

Could you possibly post a small sample code? Thanks.

In the main function:

.......         
double * p;
p = (double * )malloc(sizeof(double)*1024);
#pragma offload target(mic:0) in(p:length(128)
          foo(p);
.......
 

The data addressed by p is transfered into MIC And the function foo is defined like this:

__attribute__((target(mic)))void foo( double * p)
{
#ifdef __MIC__
......
long long iter;
#pragma omp parallel for private(iter)
        for(iter = 0 ; iter < N ; iter ++)
        {
                __m512d _A, _B;
                double * p1;
                p1 = (double * )_mm_malloc(sizeof(double)*1024, 512);      //p1 has to be thread-private
                ......
                _A = _mm512_load_pd((void*)p);                                                //p has to be aligned
                _B = _mm512_load_pd((void*)p1);                                             //p1 has to be aligned
                ......
                /* Calculations */
                ......
                _mm_free(p1);
        }
#endif
}
 

Thus p1 is allocated repeatedly inside the loop to make sure it's thread-private, while p1 has to be aligned.

 

0 Kudos
James_C_Intel2
Employee
762 Views

At the very least you should structure that more like this (which allocates once per thread, rather than once per iteration)

#pragma omp parallel
{
    long long iter;     // Though does it *really* need to be 64 bits!? How many iterations do you have?
                        // 64bit indexes are likely inefficient.
    double * p1 = (double *) _mm_malloc (sizeof(double)*1024, 512);

#pragma omp for
    for (iter=0; iter<N; iter++)
    {
        _mm_512d _A;
       ... etc ...
    }

    _mm_free (p1);
}

 

0 Kudos
Nick_L_1
Beginner
762 Views

James Cownie (Intel) wrote:

At the very least you should structure that more like this (which allocates once per thread, rather than once per iteration)

#pragma omp parallel
{
    long long iter;     // Though does it *really* need to be 64 bits!? How many iterations do you have?
                        // 64bit indexes are likely inefficient.
    double * p1 = (double *) _mm_malloc (sizeof(double)*1024, 512);

#pragma omp for
    for (iter=0; iter<N; iter++)
    {
        _mm_512d _A;
       ... etc ...
    }

    _mm_free (p1);
}

 

 

I really have that many iterations. Reconstructing the code helps ,thanks~

0 Kudos
Reply