Strange Offload Behaviour - Most Offloads acting as Startup Offloads

Amlesh_K_ · ‎12-10-2016

Hi,

I am trying to offload computation to two Xeon Phi's using a code similar to the following -

!$omp parallel do num_threads(16) ....

do i = 1,n


some computations


if (some condition true only once) then

offload to phi 1

else if (a different condition true only once) then

offload to phi2

end if


end do

The above code is executed for several timesteps (with two offloads per timestep). Whatever offloading I have done till now, I saw that only the first offload (to each phi) includes overheads and the subsequent offloads take similar time (for similar regular computation). Earlier, only in the first offload, the details of thread placement were printed (eg, OMP: Info #242: KMP_AFFINITY: pid 55645 thread 1919 bound to OS proc set {240}).

In the above code, I see that most of the offloads (not all) include the above mentioned overheads, and for them, the details of thread placement were printed (ie, most of them acted like startup offloads).

Any hints to why this might be happenning?

Thanks,

Amlesh

Amlesh_K_ · ‎12-23-2016

Please note that generally, the thread placement message used to be printed only once (in the first offload to each MIC), with a message like - OMP: Info #242: KMP_AFFINITY: pid 55645 thread 239 bound to OS proc set {240}

The maximum thread would go till 239 (ie, 240 threads), but in the above case, the message is not only printed multiple times (ie, for almost all offloads), but also with thread id's till 1919, ie, - OMP: Info #242: KMP_AFFINITY: pid 55645 thread 1919 bound to OS proc set {240}

Can someone please point me towards the most likely sources which can be causing the issue.

Thanks,

Amlesh

Rajiv_D_Intel · ‎12-23-2016

You say " the thread placement message used to be printed only once" and that now you are seeing different behavior.

What did you change between the two scenarios?

Can you create a minimal test-case that demonstrates the problem?

The offload library will create a new process on each MIC device only once.

Rajiv_D_Intel · ‎12-23-2016

That you are seeing thread numbers as high as 1919 hints that somehow you have enabled nested parallelism on MIC and are running four parallel teams of 240 threads simultaneously.