Offload model - strange problem with performance of computation

H__Kamil · ‎03-31-2016

Hi,

I wrote offload application that utilize all avaliable resources of single node (2 x CPU + 2 x MIC). In my application i use asynchronous offload pragmas for data transfers and computations to the coprocessors. Since, i would like to achive good performance of computation, transfers of data and computations to the coprocessors are overlapped with computation performed by the host processor.

When i run my application a few times i obtain different time of computation. Since I want to overlap coprocessors management with computation performed by CPU, total time of computation should be similar to time of computation peformed by the CPU.

Examples of times:

367 seconds,
515 seconds,
369 seconds,
505 seconds,
509 seconds.

I know that in case of the first and the third time are equal to time of computation performed by the CPU. I don't expect that time for all tests will be the same. But why occurs so large divergence? What can be problem? Are coprocessors unstable? Or mayby is it a problem with PCIe bus?

Rajiv_D_Intel · ‎03-31-2016

Some CPU threads are needed to drive the MIC activity. If you are using full CPU resources then those threads will be starved and that will lead to increased MIC offload time.

H__Kamil · ‎03-31-2016

Can You explain it clearly? I don't understand, what do you mean. :)

Rajiv_D_Intel · ‎03-31-2016

2 to 4 CPU threads are used in programming the MIC DMAs, doing I/O proxy and other activity related to offloading to MIC. So, for the CPU computations don't use all the threads, use available threads - 2, for instance. See if performance of offload improves.

H__Kamil · ‎03-31-2016

Ok. It's perfect. :) I will check it and i will let You know. :)

H__Kamil · ‎04-03-2016

I changed amout of threads used by CPU to computation, and it didn't help. So, still i don't know what is the problem. In my application offload pragmas are called inside parallel region (master thread is responsible for calling asynchronous pragmas). And after that master thread join to rest of threads and performds computation.

Maybe source of my problem is improper configuration of platform?

jimdempseyatthecove · ‎04-03-2016

Are you using NUMA configuration for your Host's main RAM (separate nodes), or interleaved memory?

If NUMA, are you pinning threads on host?

If pinning threads on host, are you scheduling threads in a NUMA friendly manner?

Another thing that can cause these symptoms is if the outermost parallel region has slightly more tasks (iterations) to run than threads, and that where the individual task have different run times. Resulting in some threads running fewer tasks than others, and due to (random) timing the work performed by individual threads greatly vary. To mitigate this, if possible, schedule your longest running tasks first/early.

Jim Dempsey