Software Archive
Read-only legacy content
17060 Discussions

command line tool to find data races and dead locks on xeon phi

shiva_rama_krishna_b
587 Views

Hi,

I have written a multithread program in which each thread makes a offload call. I observed that my program is getting hanged(if i check top on MIC  offload_main is getting killed but program on CPU was not terminated). Does that mean it entered a deadlock? Is there any tool to give insights on these kind of issues?

Thanks

sivaramakrishna

0 Kudos
3 Replies
McCalpinJohn
Honored Contributor III
587 Views

A deadlock would not kill the process on the MIC -- it would just sit there spinning forever.

Things that are likely to kill the offload task on the MIC are out-of-memory errors, out-of-range memory access errors, illegal instructions, etc.

0 Kudos
shiva_rama_krishna_b
587 Views

Hi John,

What happening here is, Both the CPU threads and Xeonphi threads are sleeping(I noticed in top).

I have also seen the memory that is getting created on xeonphi is 6GB(1.5 GB is empty). I think if it would be due to out of memory errors, then It should cm every time i run the program.

Suppose if it would due to out-of-range memory errors, It should through a segmentation fault.(I have seen MIC throwing segment fault if i access out of range.).

Similarly if i write any illegal instructions in the code(vectorized instructions) then also MIC throw-ed illegal instruction error.

Can MIC kill any a thread if it does some thing wrong(like out of memory erros, illegal instruction...) with out notifying us?

Please correct me if my understanding is wrong.

 

Thanks

sivaramakrishna

0 Kudos
jimdempseyatthecove
Honored Contributor III
587 Views

Not knowing anything about your application we can only guess as to what is going on.

Any (non-trapped by application) memory access fault, either on host or in MIC would terminate the process. GP faults (illegal instructions, stack related issues) from my understanding are not trappable and would also terminate the process. Therefore this leaves you with a deadlock like coding error.

From you description (threads sleeping) it sounds like the host is in a "are you (MIC) done yet?" wait, and the MIC is in a "what do I do next" wait.

The general culprit for that is throwing an asynchronous offload at the MIC with a signal(N), and then later performing a wait(M). Note N and M are different.

I do not know off hand, maybe someone here can answer, is what happens of you throw multiple offloads with the same N in the signal(N) while the prior offload with signal(N) is pending. The doc does not define this behavior. If you have this coding error, use a different N.

Jim Dempsey

0 Kudos
Reply