Reduce overhead for repeated offload convolutions on Xeon Phi

John_F_1 · ‎02-04-2014

I have a series of arrays that I need to convolve. They are not all the same size, but their size does have an upper bound which is ~50. The offload looks something like this (simplified):

#pragma offload target(mic)
    {
#pragma omp parallel for
        for (long j=0; j<(long)1e9; ++j)
        {
            VSLConvTaskPtr task;
            float x[30]={1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30};
            float y[30]={1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30};
            float z[59]={0};
            MKL_INT xshape=30, yshape=30, zshape=59;
            int status;

            int mode = VSL_CONV_MODE_AUTO;

            status = vslsConvNewTask1D(&task,mode,xshape,yshape,zshape);
            CheckVslError(status);

            status = vslsConvExec1D(task,x,1,y,1,z,1);
            CheckVslError(status);

            status = vslConvDeleteTask(&task);
            CheckVslError(status);
        }
    }

When I profile this, over half the time is spent in mkl_serv_malloc, mkl_serv_free, and mkl_conv_newtask. Note that in the actual code, xshape, yshape, and zshape are not constant (but are bounded). Is there any way to drive this so that the allocation overhead is eliminated?

John_F_1 · ‎02-04-2014

Sorry about the formatting, here it is fixed:
[cpp]
#pragma offload target(mic)
    {
#pragma omp parallel for
        for (long j=0; j<(long)1e9; ++j)
        {
            VSLConvTaskPtr task;
            float x[30]={1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30};
            float y[30]={1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30};
            float z[59]={0};
            MKL_INT xshape=30, yshape=30, zshape=59;
            int status;

            int mode = VSL_CONV_MODE_AUTO;

            status = vslsConvNewTask1D(&task,mode,xshape,yshape,zshape);
            CheckVslError(status);

            status = vslsConvExec1D(task,x,1,y,1,z,1);
            CheckVslError(status);

            status = vslConvDeleteTask(&task);
            CheckVslError(status);
        }
    }

[/cpp]

Zhang_Z_Intel · ‎02-04-2014

You can consider to use a feature called "data persistence". The offload compiler allows you to control memory allocation for offloaded computation. This is done via pragmas. See the example below to get an idea. Read the compiler's reference guide to get more details.

[cpp]

__declspec(target(mic)) static float *x, *y, *z;

/* Offload the 1st convolution. Allocate all memory space you need, and do not free it */
#pragma offload target(mic) \\
    in(x:length(MAXLEN) free_if(0)) \\
    in(y:length(MAXLEN) free_if(0)) \\
    out(x:length(MAXLEN))
    {
        // convolution to be offloaded
    }

/* Offload as many convolutions as needed. Reuse the memory space */
for (int i = 1; i < repeat-1; ++i)
{
#pragma offload target(mic) \\
    in(x:length(MAXLEN) alloc_if(0) free_if(0)) \\
    in(y:length(MAXLEN) alloc_if(0) free_if(0)) \\
    out(x:length(MAXLEN))
    {
        // convolution to be offloaded
    }
}

/* Offload the last convolution. Free memory when done */
#pragma offload target(mic) \\
    in(x:length(MAXLEN) free_if(1)) \\
    in(y:length(MAXLEN) free_if(1)) \\
    out(x:length(MAXLEN))
    {
        // convolution to be offloaded
    }

[/cpp]

John_F_1 · ‎02-04-2014

Hi Zhang,

The profile is of the sample code that I posted. There is only one offload region, the loop is inside the offload region, and I do not call any MKL memory facilities. In other words, there is no memory allocation as result of the offload, this overhead is happening inside the convolution routine itself. Is there anything that I can do about that?

Dmitry_B_Intel · ‎02-05-2014

John,

If preallocated thread-local buffers could be passed to the convolution routines that would partially help the situation. Unfortunately, the MKL convolution routines do not support such usage model.

One thing you could possibly do is to manage creating the tasks in advance, thereby moving the preparation stage away from the computation path. Maybe you would need to have an array of tasks, one per size. With this alternative it is important to have the tasks thread-private!

Another possible option is to allocate the space for the largest task and zero fill smaller data for the larger size convolution. That will increase flop count, but may work faster anyway.

Thanks
Dima

John_F_1 · ‎02-05-2014

Hi Dima,

Thanks for the answer and suggestions. I actually already do zero fill anyways, so I will have to try doing the larger convolution. I assumed that it would be slower, but perhaps with the wider vector unit on the Phi it doesn't make any difference.