- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Knowledgeable,
Could anyone please comment on the case outlined below:
If 6 CPU threads concurrently execute synchronous MIC offload sections that contain parallel for loops with 20 iterations each, what MIC core utilization should be expected while the CPU threads are waiting for their offload sections to complete?
In case this is relevant: memory in/out transfers are minimal, all offload sections are working on the data previously uploaded to MIC, the offload sections do not access same memory, the code is running on Windows.
Thank you in advance.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I guess you want to run 6 host threads (or, in the more often documented case, MPI processes), each off-loading 20 threads to coprocessor, with the work divided evenly among 20 parallel loop iterations.
You would arrange that each group of 20 coprocessor threads is spread across a distinct group of cores according to its own affinity assignment, e.g.
MIC_OMP_NUM_THREADS=20
MIC_KMP_AFFINITY=balanced
MIC_KMP_PLACE_THREADS=10c,20t,0o
...
MIC_MPI_PLACE_THREADS=10c,20t,40o
....
Then you should be able to approach the situation where each of the 60 cores in use is nearly 50% utilized on micsmc graph (but the VPUs could be 90% utilized in the ideal case).
If you don't have 60 cores available, you could assign the 20 threads to 5 cores and hope to exceed 90% reported usage of a total of 30 cores.
It might be simpler to use the team concept, with teams of 20 threads, but I don't know whether there is a facility to spread those threads evenly.
TBB should have equivalent but possibly more complex facilities.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the host running a single application with 6 threads or six applications wit one thread?
In the case of single application, is the application one process (e.g. OpenMP, TBB, Cilk Plus) or six processes (MPI)?
In the case of single application, are the six threads working on different areas of a problem or six slices of the same problem?
In the case of six independent applications then each could have a different set of environment variables as TimP suggests.
From your description "offload sections that contain parallel for loops with 20 iterations each" I assume it is six slices of the same problem. In this case a sketch of what you might try is:
#pragma omp parallel
{
// (six threads running slice here)
int nHostThreads = omp_get_num_threads();
int iHostThread = omp_get_thread_num();
printf("Begin HOST %d %d\n", nHostThreads, iHostThread);
// each host thread issues:
#pragma offload target(mic) ...
{ // each host thread's offload section on MIC
#pragma omp parallel num_threads(omp_get_max_threads()/nHostThreads)
{
int nMICThreads = omp_get_num_threads();
int iMICThread = omp_get_thread_num();
printf("Begin MIC %d %d HOST %d %d\n", nMICThreads, iMICThread, nHostThreads, iHostThread);
// possible action based on nHostThreads and/or iHostThread
// use only a portion of available target
#pragma omp for
for(...) {
... // run for interval long enough for other HOST threads to enter their MIC's parallel region
} // for
printf("End MIC %d %d HOST %d %d\n", nMICThreads, iMICThread, nHostThreads, iHostThread);
} // end MIC's parallel
} // end offload
...
printf("End HOST %d %d\n", nHostThreads, iHostThread);
} // end HOST"s parallel
*** caution, I have not tried the above sketch. The printf's will trace what happens.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are some clarifications:
I am running a single Windows process. The process performs processing of a large set of data. The data is divided into independent segments that are processed in parallel by multiple threads (N_threds = N_cpu_cores). The CPU cores get fully saturated with work. I am trying to accelerate processing by offloading some of the work to MIC (the entire application is complex and cannot be easily transformed into a native MIC executable, which would be ideal). I extracted different portions of the top-level algorithm into MIC-'offload-able' pieces. None of the single pieces can fully saturate MIC, but my strategy is to add some extra CPU threads each of which would run the same top-level algorithm but offload portions of it to MIC. Since the extra threads will be blocked (while waiting for the offload sections to complete) for a good portion of their lifespan, the overhead on the CPU side resulted from adding these extra threads is acceptable. The time it takes to execute the offloaded sections on MIC is better or comparable with running the same portion on the CPU. So overall expectation is faster processing of the job.
The problem I am facing is that it looks like the offload calls made concurrently by multiple threads get 'serialized' at some level (below my code) resulting in a major bottleneck. I know MIC supports concurrent offloads, but it doesn't seem to be happening in my case. I may be missing some configuration settings, offload parameter or something else along those lines.
So, logically, my code is similar to something like this:
#pragma omp parallel for
for (int i = 0; i < 10; i++)
{
step1();
#pragma offload (mic) // parallel for with 10 iterations inside
step2();
step3();
#pragma offload(mic) // parallel for with 10 iterations inside
step4();
step5();
}
Tim's comment confirmed that my expectation is reasonable and provided some good pointers, but I am still facing the 'offload call serialization' problem.
Any advice?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One more detail:
The offload sections do run on MIC concurrently (I confirmed that by adding tracing), except it seems that they are somehow bound to the same threads(s), if that makes sense. If I run a single offload section its execution time is T. If the same section is submitted concurrently the execution time of each section run is noticeably greater than T (the more concurrent submissions, the greater the performance hit), and the core utilization reported by micsmc is as low as for a single submission.
Is MIC core/thread assignment somehow bound to the COI channel associated with the calling process? If yes, is there a way to change that?
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Serialization is expected in data transfer (which you said we could ignore) and memory allocation, which also seemed should have occurred earlier according to your scenario. Several seconds of effective parallel execution will be needed to overcome that.
Jim and I both suggested that you must take specific actions to pin your offloaded threads evenly across distinct groups of cores.
The COI and MPSS activity should be pinned to the last core (threads 0 and the highest 3 numbered) without any action taken by you. Using the affinity tools to bind your threads to the other cores should avoid conflicts there; the offloading doesn't use those thread contexts unless you specifically over-ride its normal setting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I ran the sketch code on 5110P here. Added printout for getpid() and pthread_self(). All host threads offloads used same pid, different tid (as expected). Code ran concurrently though seemed somewhat "in batches". Added stall loop, saw interleaved outputs.
Host had 12 threads (6 cores with HT)
As to if these were "pinned" to the first 20 logical processors, I cannot say as I did not include that information in my diagnostic report.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page