The sample code runs slowly at offload mode, when compiler option is -O0

yu__frank · ‎12-16-2016

I copied a example ' helloglops3offload ' from the book named ' Intel Xeon Phi Coprocessor High-performance Programming '.

When I compile it with optimization option -O3 , It takes 2.6 second to complete test, but When I change optimization option to -O0, It takes 2670 second.

Is this is a bug ?

MPSS: 3.6.1

icc: 2017

OS: Centos 6.7

James_C_Intel2 · ‎12-16-2016

Umm, wouldn't you expect the code to run more slowly when you tell the compiler not to optimize?

My view here is that you should be impressed that the compiler can improve the code by a factor of 100x, not that when you tell it to produce slow code it does.

(This is like the chap who goes to the doctor and says "When I poke a stick in my eye it hurts", to which the doctor replies "Well, don't do that, then.")

yu__frank · ‎12-18-2016

The difference between O0 and O3 is too large, a thousand times (1000x).

So you think this is correct?

Ok, I just did not think the difference between O0 and O3 is so great.

Thanks.

James_C_Intel2 · ‎12-19-2016

You made the measurements and know what you changed and precisely how you did them. If you're confident in your technique and that the only change was the compiler flag, it's hard to argue that that isn't the cause.

yu__frank · ‎12-19-2016

The full code of test:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#include <sys/time.h>

double dtime()
{
    double tseconds = 0.0;
    struct timeval mytime;
    gettimeofday(&mytime, (struct timezone*)0);
    tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6);
    return tseconds;
}

#define FLOPS_ARRAY_SIZE (1024*1024)
#define MAXFLOPS_ITERS 100000000
#define LOOP_COUNT 128

#define FLOPSPERCALC 2

__declspec (target(mic)) float fa[FLOPS_ARRAY_SIZE] __attribute__((aligned(64)));
__declspec (target(mic)) float fb[FLOPS_ARRAY_SIZE] __attribute__((aligned(64)));

int main(int argc, char *argv[])
{
    int i,j,k;
    int numthreads = 2;
    double tstart, tstop, ttime;
    double gflops = 0.0;
    float a = 1.1;

#pragma offload target (mic)
#pragma omp parallel
#pragma omp master
    numthreads = omp_get_num_threads();

    printf("Initializing\r\n");

#pragma omp parallel for
    for(i=0; i<FLOPS_ARRAY_SIZE; i++)
    {
        fa = (float)i + 0.1;
        fb = (float)i + 0.2;
    }
    printf("Starting Compute on %d threads\r\n", numthreads);

    tstart = dtime();


#pragma offload target (mic)
#pragma omp parallel for private(j,k)
    for(i=0; i<numthreads; i++)
    {
        int offset = i*LOOP_COUNT;
    
        for(j=0; j<MAXFLOPS_ITERS; j++)
        {
            for(k=0; k<LOOP_COUNT; k++)
            {
                 fa[k+offset] = fa[k+offset] + fb[k+offset];
             }
        }
    }


    tstop = dtime();
    gflops = (double)(1.0e-9 * numthreads * LOOP_COUNT * MAXFLOPS_ITERS * FLOPSPERCALC);

    ttime = tstop - tstart;

    if((ttime) > 0.0)
    {
        printf("GFlops = %10.3lf, Secs = %10.3lf, GFlops per sec = %10.3lf\r\n", gflops, ttime, gflops/ttime);
    }
    return (0);
}

Only change the optimization option.

icc -qopenmp -O0 test.cpp

&

icc -qopenmp -O3 test.cpp