Software Archive
Read-only legacy content
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
17065 Discussions

Offload model - strange problem with performance of computation

H__Kamil
Beginner
200 Views

Hi,

I wrote offload application that utilize all avaliable resources of single node (2 x CPU + 2 x MIC). In my application i use asynchronous offload pragmas for data transfers and computations to the coprocessors. Since, i would like to achive good performance of computation, transfers of data and computations to the coprocessors are overlapped with computation performed by the host processor. 

When i run my application a few times i obtain different time of computation. Since I want to overlap coprocessors management with computation performed by CPU, total time of computation should be similar to time of computation peformed by the CPU. 

Examples of times:

367 seconds,
515 seconds,
369 seconds,
505 seconds,
509 seconds.

I know that in case of the first and the third time are equal to time of computation performed by the CPU. I don't expect that time for all tests will be the same. But why occurs so large divergence? What can be problem? Are coprocessors unstable? Or mayby is it a problem with PCIe bus?

0 Kudos
6 Replies
Rajiv_D_Intel
Employee
200 Views

Some CPU threads are needed to drive the MIC activity. If you are using full CPU resources then those threads will be starved and that will lead to increased MIC offload time.

H__Kamil
Beginner
200 Views

Can You explain it clearly? I don't understand, what do you mean. :)

Rajiv_D_Intel
Employee
200 Views

2 to 4 CPU threads are used in programming the MIC DMAs, doing I/O proxy and other activity related to offloading to MIC. So, for the CPU computations don't use all the threads, use available threads - 2, for instance. See if performance of offload improves.

H__Kamil
Beginner
200 Views

Ok. It's perfect. :) I will check it  and i will let You know. :)

H__Kamil
Beginner
200 Views

I changed amout of threads used by CPU to computation, and it didn't help. So, still i don't know what is the problem. In my application offload pragmas are called inside parallel region (master thread is responsible for calling asynchronous pragmas). And after that master thread join to rest of threads and performds computation. 

Maybe source of my problem is improper configuration of platform?

jimdempseyatthecove
Black Belt
200 Views

Are you using NUMA configuration for your Host's main RAM (separate nodes), or interleaved memory?

If NUMA, are you pinning  threads on host?

If pinning threads on host, are you scheduling threads in a NUMA friendly manner?

Another thing that can cause these symptoms is if the outermost parallel region has slightly more tasks (iterations) to run than threads, and that where the individual task have different run times. Resulting in some threads running fewer tasks than others, and due to (random) timing the work performed by individual threads greatly vary. To mitigate this, if possible, schedule your longest running tasks first/early.

Jim Dempsey

Reply