- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
I apologize, if my problem was described elsewhere. I try use offload model for utilization of all available resources of my platform to the joint problem solving (2 x CPU and 2 x KNC card). In my work i focus on the utilization of offload mode and hStreams library. However, I stumbled on a problem with performance which I can not to reslove. Well, transfers of data to the coprocessors have impact on the prefromance of the CPUs computations.
In my application, at the beggining I transfer near to 7GB of input data to coprocessors (this data are reused in whole aplication, and dealocated at the end) . Transfers are executed sequentially, to the mic0 and next to the mic1 (in both cases offload and hStreams). After finishing data transfers, I run some part of computations using CPUs only, and then computations which utilize all available resources. I noticed that the performance of CPUs in a 20% smaller than I expect. Futhermore, I noticed that this difference is caused by data transfers to coprocessors. For example, when I remove data transfers, total time of CPUs computations takes 69~70 seconds. When I transfer data to the cards execution time of the same code takes 90~96 seconds. Structure of my offload application shows code bellow.
//Data initialization /.../ #pragma offload target(mic : 0) \ in(... : length(lWez) alloc_if(1) free_if(0)) \ .... \\ { } #pragma offload target(mic : 1) \ in(... : length(lWez) alloc_if(1) free_if(0)) \ .... \\ { } //Parallel computations performed by CPUs Time t1; t1..start(); #pragma omp prallel { /.../ } t1.stop(); //Hybrid computations /.../ #pragma offload target(mic : 0) \ out(... : length(lWez) alloc_if(0) free_if(1)) \ .... \\ { } #pragma offload target(mic : 1) \ out(... : length(lWez) alloc_if(0) free_if(1)) \ .... \\ { }
I copy to coprocessors 20 one-dimensional arrays. Presented performance (69~70 seconds and 90~96 seconds) are obtained for CPU parallel region. I noticed that the performance of CPUs computations in hybrid region are also slower. Code is compiled using icpc compiler 17.0.1, with -O3 optimization flag. Platform uses MPSS 3.7.2 version. I know that it sounds strange. But I do not know what is the problem.
Link Copied

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page