Here are some clarifications:

Vladimir_S_ · ‎01-07-2014

Dear Knowledgeable,

Could anyone please comment on the case outlined below:

If 6 CPU threads concurrently execute synchronous MIC offload sections that contain parallel for loops with 20 iterations each, what MIC core utilization should be expected while the CPU threads are waiting for their offload sections to complete?

In case this is relevant: memory in/out transfers are minimal, all offload sections are working on the data previously uploaded to MIC, the offload sections do not access same memory, the code is running on Windows.

Thank you in advance.

TimP · ‎01-08-2014

I guess you want to run 6 host threads (or, in the more often documented case, MPI processes), each off-loading 20 threads to coprocessor, with the work divided evenly among 20 parallel loop iterations.

You would arrange that each group of 20 coprocessor threads is spread across a distinct group of cores according to its own affinity assignment, e.g.

MIC_OMP_NUM_THREADS=20

MIC_KMP_AFFINITY=balanced

MIC_KMP_PLACE_THREADS=10c,20t,0o

...

MIC_MPI_PLACE_THREADS=10c,20t,40o

....

Then you should be able to approach the situation where each of the 60 cores in use is nearly 50% utilized on micsmc graph (but the VPUs could be 90% utilized in the ideal case).

If you don't have 60 cores available, you could assign the 20 threads to 5 cores and hope to exceed 90% reported usage of a total of 30 cores.

It might be simpler to use the team concept, with teams of 20 threads, but I don't know whether there is a facility to spread those threads evenly.

TBB should have equivalent but possibly more complex facilities.

jimdempseyatthecove · ‎01-08-2014

Is the host running a single application with 6 threads or six applications wit one thread?
In the case of single application, is the application one process (e.g. OpenMP, TBB, Cilk Plus) or six processes (MPI)?
In the case of single application, are the six threads working on different areas of a problem or six slices of the same problem?

In the case of six independent applications then each could have a different set of environment variables as TimP suggests.

From your description "offload sections that contain parallel for loops with 20 iterations each" I assume it is six slices of the same problem. In this case a sketch of what you might try is:

#pragma omp parallel
{
// (six threads running slice here)
int nHostThreads = omp_get_num_threads();
int iHostThread = omp_get_thread_num();
printf("Begin HOST %d %d\n", nHostThreads, iHostThread);
// each host thread issues:
#pragma offload target(mic) ...
{ // each host thread's offload section on MIC
#pragma omp parallel num_threads(omp_get_max_threads()/nHostThreads)
{
int nMICThreads = omp_get_num_threads();
int iMICThread = omp_get_thread_num();
printf("Begin MIC %d %d HOST %d %d\n", nMICThreads, iMICThread, nHostThreads, iHostThread);
// possible action based on nHostThreads and/or iHostThread
// use only a portion of available target
#pragma omp for
for(...) {
... // run for interval long enough for other HOST threads to enter their MIC's parallel region
} // for
printf("End MIC %d %d HOST %d %d\n", nMICThreads, iMICThread, nHostThreads, iHostThread);
} // end MIC's parallel
} // end offload
...
printf("End HOST %d %d\n", nHostThreads, iHostThread);
} // end HOST"s parallel

*** caution, I have not tried the above sketch. The printf's will trace what happens.

Jim Dempsey

Vladimir_S_ · ‎01-08-2014

Here are some clarifications:

I am running a single Windows process. The process performs processing of a large set of data. The data is divided into independent segments that are processed in parallel by multiple threads (N_threds = N_cpu_cores). The CPU cores get fully saturated with work. I am trying to accelerate processing by offloading some of the work to MIC (the entire application is complex and cannot be easily transformed into a native MIC executable, which would be ideal). I extracted different portions of the top-level algorithm into MIC-'offload-able' pieces. None of the single pieces can fully saturate MIC, but my strategy is to add some extra CPU threads each of which would run the same top-level algorithm but offload portions of it to MIC. Since the extra threads will be blocked (while waiting for the offload sections to complete) for a good portion of their lifespan, the overhead on the CPU side resulted from adding these extra threads is acceptable. The time it takes to execute the offloaded sections on MIC is better or comparable with running the same portion on the CPU. So overall expectation is faster processing of the job.

The problem I am facing is that it looks like the offload calls made concurrently by multiple threads get 'serialized' at some level (below my code) resulting in a major bottleneck. I know MIC supports concurrent offloads, but it doesn't seem to be happening in my case. I may be missing some configuration settings, offload parameter or something else along those lines.

So, logically, my code is similar to something like this:

#pragma omp parallel for

for (int i = 0; i < 10; i++)

{

step1();

#pragma offload (mic) // parallel for with 10 iterations inside

step2();

step3();

#pragma offload(mic) // parallel for with 10 iterations inside

step4();

step5();

}

Tim's comment confirmed that my expectation is reasonable and provided some good pointers, but I am still facing the 'offload call serialization' problem.

Any advice?

Vladimir_S_ · ‎01-08-2014

One more detail:

The offload sections do run on MIC concurrently (I confirmed that by adding tracing), except it seems that they are somehow bound to the same threads(s), if that makes sense. If I run a single offload section its execution time is T. If the same section is submitted concurrently the execution time of each section run is noticeably greater than T (the more concurrent submissions, the greater the performance hit), and the core utilization reported by micsmc is as low as for a single submission.

Is MIC core/thread assignment somehow bound to the COI channel associated with the calling process? If yes, is there a way to change that?

Thank you.

TimP · ‎01-08-2014

Serialization is expected in data transfer (which you said we could ignore) and memory allocation, which also seemed should have occurred earlier according to your scenario. Several seconds of effective parallel execution will be needed to overcome that.

Jim and I both suggested that you must take specific actions to pin your offloaded threads evenly across distinct groups of cores.

The COI and MPSS activity should be pinned to the last core (threads 0 and the highest 3 numbered) without any action taken by you. Using the affinity tools to bind your threads to the other cores should avoid conflicts there; the offloading doesn't use those thread contexts unless you specifically over-ride its normal setting.

jimdempseyatthecove · ‎01-08-2014

I ran the sketch code on 5110P here. Added printout for getpid() and pthread_self(). All host threads offloads used same pid, different tid (as expected). Code ran concurrently though seemed somewhat "in batches". Added stall loop, saw interleaved outputs.

Host had 12 threads (6 cores with HT)

As to if these were "pinned" to the first 20 logical processors, I cannot say as I did not include that information in my diagnostic report.

Jim Dempsey

Expected MIC core utilization