I wrote offload application that utilize all avaliable resources of single node (2 x CPU + 2 x MIC). In my application i use asynchronous offload pragmas for data transfers and computations to the coprocessors. Since, i would like to achive good performance of computation, transfers of data and computations to the coprocessors are overlapped with computation performed by the host processor.
When i run my application a few times i obtain different time of computation. Since I want to overlap coprocessors management with computation performed by CPU, total time of computation should be similar to time of computation peformed by the CPU.
Examples of times:
I know that in case of the first and the third time are equal to time of computation performed by the CPU. I don't expect that time for all tests will be the same. But why occurs so large divergence? What can be problem? Are coprocessors unstable? Or mayby is it a problem with PCIe bus?
Some CPU threads are needed to drive the MIC activity. If you are using full CPU resources then those threads will be starved and that will lead to increased MIC offload time.
2 to 4 CPU threads are used in programming the MIC DMAs, doing I/O proxy and other activity related to offloading to MIC. So, for the CPU computations don't use all the threads, use available threads - 2, for instance. See if performance of offload improves.
I changed amout of threads used by CPU to computation, and it didn't help. So, still i don't know what is the problem. In my application offload pragmas are called inside parallel region (master thread is responsible for calling asynchronous pragmas). And after that master thread join to rest of threads and performds computation.
Maybe source of my problem is improper configuration of platform?
Are you using NUMA configuration for your Host's main RAM (separate nodes), or interleaved memory?
If NUMA, are you pinning threads on host?
If pinning threads on host, are you scheduling threads in a NUMA friendly manner?
Another thing that can cause these symptoms is if the outermost parallel region has slightly more tasks (iterations) to run than threads, and that where the individual task have different run times. Resulting in some threads running fewer tasks than others, and due to (random) timing the work performed by individual threads greatly vary. To mitigate this, if possible, schedule your longest running tasks first/early.