Simple offloaded code, enormous time consuming

Flavio_F_ · ‎04-23-2015

Dear all,

I recently started using Xeon Phi cards for parallel programming, so I am still a newbie in this field.

I wrote this code as a simple example to start understanding this fascinating world, but I got surprised when I looked at the time of executions.

When I run the code on the host, execution time is 0,08 s. When I run the code adding the pragma offload and pragma omp parallel for, execution time increase up to 9s!

When I compiled the codes, I used -O3 optimization for both of them.

Is there something I am missing?

Thanks for your help,

Flavio

#include<stdio.h>
#ifdef _OPENMP
#include<omp.h>
#endif

#define ALLOC alloc_if(1) free_if(0)
#define RETAIN alloc_if(0) free_if(0)
#define FREE alloc_if(0) free_if(1)

#define LD long double

#define MAX 100000

main(int argc, char **argv)
{
    int i, j;
    LD *M = NULL;
    __declspec(target(mic))int cycles = 240;

    printf("array lenght: %d\n", cycles);

    //start time
    char            fmt[64], buf1[64], buf2[64];
    struct timeval  tv;
    struct tm       *tm;
    gettimeofday(&tv, NULL);
    if((tm = localtime(&tv.tv_sec)) != NULL){
        strftime(fmt, sizeof fmt, "((%H*1440)+(%M*60)+%S,%%06u)", tm);
        snprintf(buf1, sizeof buf1, fmt, tv.tv_usec);
    }

    //array creation
        M = (LD*)calloc(cycles, sizeof(LD));

    //allocating space on MIC
    #pragma offload target(mic) in(M:length(cycles) ALLOC)
    {}

    for (i=0; i<MAX; i++){
        #pragma offload target(mic) inout(M:length(cycles) RETAIN) \
                                    in(cycles)
        {
            #pragma omp parallel for private(j)
            #pragma ivdep
            for (j=0; j<cycles; j++)
                M += 1;
        } //offload
    } //for

    //freeing space on MIC
    #pragma offload target(mic) nocopy(M:length(0) FREE)
    {}

    printf("number of cycles: %LG\n", M[0]);

    //tempo finale
    gettimeofday(&tv, NULL);
    if((tm = localtime(&tv.tv_sec)) != NULL){
        strftime(fmt, sizeof fmt, "=((%H*1440)+(%M*60)+%S,%%06u)", tm);
        snprintf(buf2, sizeof buf2, fmt, tv.tv_usec);
        printf("%s-%s\n", buf2, buf1);
    }

    return 0;
} //main

jimdempseyatthecove · ‎04-23-2015

Some issues:

1) Any section of code you want to consider for parallelization should have sufficient amount of work per thread that can be done in parallel where the gain in parallel code at least exceeds the overhead in setting up the parallel region. In your sample program above, assuming you are running on 60 core 4 threads per core, your inner loop of 240 cycles reduces the work per thread to one issue of "M +=1". Further, this division (one per thread) results in cache line evictions for adjacent .

2) long double on your host processor may be 8 bytes and supported in hardware, whereas long double on MIC may be 16 bytes, not supported by hardware (requires software emulation).

3) Depending on the compiler optimization wims, the "M +=1", may take the integer 1, and then promoted it to long double (which on MIC may result in first promotion to double, then promotion of double to long double (if long double is 16 bytes)).

I suggest you perform two experiments:

a) Define LD as "double", set cycles to 2400000, set MAX to 10 (iow reduce outer loop by factor of 10,000 and increase inner loop by factor of 10,000)

b) Define LD as "long double", set cycles to 2400000, set MAX to 10, and add printf("sizeof(LD) = %d\n", sizeof(LD)); on the line in the offload immediately before your "#pragma omp..."

Then post your results back here.

Jim Dempsey

Flavio_F_ · ‎04-24-2015

Hi Jim,

thankyou very much for your instant reply and for your help.

I have performed the experiments you suggested, and here are the results:

a) - serial CPU: 0,024275 s

array lenght: 2400000

number of cycles: 7.87677E-1759

- offload (240 threads): 0,668659 s

array lenght: 2400000

number of cycles: 1.02798E-2264

b) array lenght: 2400000
number of cycles: 10

0,700117 s

sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16
sizeof(LD) = 16

Kind regards,

Flavio

jimdempseyatthecove · ‎04-24-2015

What is the size of LD on the CPU side?

In looking at the execution time for doubles, the offload was ~27.5 times slower.

I think I know what is going on. Your timed interval includes the very first offload. This offload will include the time to copy the MIC version of the application into the MIC and establish the initial OpenMP thread pool.

I suggest you place a loop around the timed interval, say 3 iterations, and observe the execution times to see what the overhead is for app copy + OpenMP pool initialization. Also do this for the host version of the program such that you can remove the host OpenMP pool initialization.

Jim Dempsey