Software Archive
Read-only legacy content
17061 Discussions

KNC problem with performance - offload model

H__Kamil
Beginner
220 Views

Hi everyone,

I apologize, if my problem was described elsewhere. I try use offload model for utilization of all available resources of my platform to the joint problem solving (2 x CPU and 2 x KNC card). In my work i focus on the utilization of offload mode and hStreams library. However, I stumbled on a problem with performance which I can not to reslove. Well, transfers of data to the coprocessors have impact on the prefromance of the CPUs computations. 

In my application, at the beggining I transfer near to 7GB of input data to coprocessors (this data are reused in whole aplication, and dealocated at the end) . Transfers are executed sequentially, to the mic0 and next to the mic1 (in both cases offload and hStreams). After finishing data transfers, I run some part of computations using CPUs only, and then computations which utilize all available resources. I noticed that the performance of CPUs in a 20% smaller than I expect. Futhermore, I noticed that this difference is caused by data transfers to coprocessors. For example, when I remove data transfers, total time of CPUs computations takes 69~70 seconds. When I transfer data to the cards execution time of the same code takes 90~96 seconds. Structure of my offload application shows code bellow.
 

//Data initialization
/.../

#pragma offload target(mic : 0) \
        in(... : length(lWez) alloc_if(1) free_if(0)) \
        .... \\
{ }

#pragma offload target(mic : 1) \
        in(... : length(lWez) alloc_if(1) free_if(0)) \
        .... \\
{ }


//Parallel computations performed by CPUs
Time t1;
t1..start();
#pragma omp prallel
{
   /.../
}
t1.stop();

//Hybrid computations
/.../


#pragma offload target(mic : 0) \
        out(... : length(lWez) alloc_if(0) free_if(1)) \
        .... \\
{ }

#pragma offload target(mic : 1) \
        out(... : length(lWez) alloc_if(0) free_if(1)) \
        .... \\
{ }

 

I copy to coprocessors 20 one-dimensional arrays. Presented performance (69~70 seconds and 90~96 seconds) are obtained for CPU parallel region. I noticed that the performance of CPUs computations in hybrid region are also slower.  Code is compiled using icpc compiler 17.0.1, with -O3 optimization flag. Platform uses MPSS 3.7.2 version. I know that it sounds strange. But I do not know what is the problem. 

 

0 Kudos
0 Replies
Reply