Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Christof_Soeger
Beginner
93 Views

offload got stuck

Hello,

I have a problem that my program randomly gets stuck while offloading. Our program runs on the host and uses all available mics (4 for us) to offload parts of the computation to them using signals. It periodically checks if a mic has finished its computation and then sends new work. It can happen, that it runs for a long without problem, but most of the time it got stuck after a few hours or even earlier. If this happens the program just does not continue.

The position in the code is consistently the same and the backtrace at this position looks like this:

(gdb) bt
#0  0x00007ffff417ff4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007ffff417bd1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007ffff417bc3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007ffff1ed99ba in ?? () from /opt/intel/mic/coi/host-linux-release/lib/libcoi_host.so.0
#4  0x00007ffff1ecb0b9 in COIBufferCopy ()
   from /opt/intel/mic/coi/host-linux-release/lib/libcoi_host.so.0
#5  0x00007ffff48c640a in OffloadDescriptor::receive_pointer_data(bool, bool, void*) ()
   from /opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/intel64/liboffload.so.5
#6  0x00007ffff48bfa10 in OffloadDescriptor::offload_wrap(char const*, bool, VarDesc*, VarDesc2*, int, void const**, int, void const**, int, void const*, OffloadFlags) ()
   from /opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/intel64/liboffload.so.5
#7  0x00007ffff48bf059 in OffloadDescriptor::offload(char const*, bool, VarDesc*, VarDesc2*, int, void const**, int, void const**, int, void const*, OffloadFlags) ()
   from /opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/intel64/liboffload.so.5
#8  0x00007ffff48d27ee in __offload_offload1 ()
   from /opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/intel64/liboffload.so.5
#9  0x00007ffff5927c8b in libnormaliz::OffloadHandler<long long>::transfer_pyramids_inner (this=0x1,
    data=0x0, size=0) at /home/math/csoeger/normaliz/source/libnormaliz/offload_handler.cpp:381
#10 0x00007ffff5927742 in libnormaliz::MicOffloader<long long>::offload_pyramids (this=0x4fc04e8,
    fc=..., level=0) at /home/math/csoeger/normaliz/source/libnormaliz/offload_handler.cpp:368

...

and continues up in program.

Btw, I introduced this ..._inner method in #9 to work around a problem that existed earlier, see

https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/610384

So it is waiting inside the offload library code for some lock, I cannot tell why and it doesn't continue even after waiting days. Here is the output of the program with REPORT_OFFLOAD=3, it also contains some debugging output:

all pyramids on level 0 done!
**************************************************
level 0 pyramids remaining: 197116
level 1 pyramids remaining: 4661
**************************************************
[Offload] [MIC 3] [Tag 358] [State]           Target->host copyout data   0
From place 1 level 2
r0t r1t r2t transfer inner on mic 3   size: 837161
[Offload] [MIC 3] [File]                    /home/math/csoeger/normaliz/source/libnormaliz/offload_handler.cpp
[Offload] [MIC 3] [Line]                    381
[Offload] [MIC 3] [Tag]                     Tag 363
[Offload] [HOST]  [Tag 363] [State]           Start target
[Offload] [HOST]  [Tag 363] [State]           Setup target entry: __offload_entry_offload_handler_cpp_381transfer__6c2ddfccd55b664500c5674d94d1bc2b
[Offload] [HOST]  [Tag 363] [Signal]          signal : none
[Offload] [HOST]  [Tag 363] [Signal]          waits  : none
[Offload] [HOST]  [Tag 363] [State]           Gather copyin data: base=0x5360b10 length=3348644
[Offload] [HOST]  [Tag 363] [State]           Create target buffer: size=3351476 offset=2832
[Offload] [HOST]  [Tag 363] [State]           Gather copyin data: base=0x1a66360 length=24
[Offload] [HOST]  [Tag 363] [State]           Create target buffer: size=888 offset=864
[Offload] [HOST]  [Tag 363] [State]           Host->target pointer data 3348668
[Offload] [HOST]  [Tag 363] [State]           Host->target copyin data 8
[Offload] [HOST]  [Tag 363] [State]           Execute task on target
[Offload] [MIC 2] [Tag 362] [State]           Target->host copyout data   0
[Offload] [MIC 0] [Tag 360] [State]           Target->host copyout data   0
[Offload] [MIC 1] [Tag 344] [State]           Target->host copyout data   0

The line

[Offload] [MIC 3] [Tag 358] [State]           Target->host copyout data   0

indicates that mic3 has completed its previous task. The next time that the program checks, it notices that and wants to send new data, but that does not complete. The last 3 lines are only appearing later, reporting that the other mics completed their tasks.

And in comparison the successful offload just before the one that got stuck:

r0t r1t transfer inner on mic 2   size: 856152
[Offload] [MIC 2] [File]                    /home/math/csoeger/normaliz/source/libnormaliz/offload_handler.cpp
[Offload] [MIC 2] [Line]                    381
[Offload] [MIC 2] [Tag]                     Tag 361
[Offload] [HOST]  [Tag 361] [State]           Start target
[Offload] [HOST]  [Tag 361] [State]           Setup target entry: __offload_entry_offload_handler_cpp_381transfer__6c2ddfccd55b664500c5674d94d1bc2b      
[Offload] [HOST]  [Tag 361] [Signal]          signal : none
[Offload] [HOST]  [Tag 361] [Signal]          waits  : none
[Offload] [HOST]  [Tag 361] [State]           Gather copyin data: base=0x5360e40 length=3424608
[Offload] [HOST]  [Tag 361] [State]           Create target buffer: size=3428256 offset=3648
[Offload] [HOST]  [Tag 361] [State]           Gather copyin data: base=0x1a4b780 length=24
[Offload] [HOST]  [Tag 361] [State]           Create target buffer: size=1944 offset=1920
[Offload] [HOST]  [Tag 361] [State]           Host->target pointer data 3424632
[Offload] [HOST]  [Tag 361] [State]           Host->target copyin data 8
[Offload] [HOST]  [Tag 361] [State]           Execute task on target
[Offload] [HOST]  [Tag 361] [State]           Target->host pointer data 24
[Offload] [MIC 2] [Tag 361] [State]           Start target entry: __offload_entry_offload_handler_cpp_381transfer__6c2ddfccd55b664500c5674d94d1bc2b
[Offload] [MIC 2] [Tag 361] [Var]             size  IN
[Offload] [MIC 2] [Tag 361] [Var]             data  IN
[Offload] [MIC 2] [Tag 361] [Var]             this  INOUT
on mic: size = 856152   data = 0x7fd6780a1e40
[Offload] [MIC 2] [Tag 361] [State]           Target->host copyout data   0
[Offload] [HOST]  [Tag 361] [CPU Time]        0.026411(seconds)
[Offload] [MIC 2] [Tag 361] [CPU->MIC Data]   3424640 (bytes)
[Offload] [MIC 2] [Tag 361] [MIC Time]        0.032101(seconds)
[Offload] [MIC 2] [Tag 361] [MIC->CPU Data]   24 (bytes)

mic 2: transfered 24691 pyramids. avg. key size:33.6747
r3t mic 2: evaluate_pyramids

Here you can see that next

[Offload] [HOST]  [Tag 361] [State]           Target->host pointer data 24

should be next, which matches the backtrace sitting in OffloadDescriptor::receive_pointer_data(bool, bool, void*) ()

I checked that the offloaded process was still running on mic3, and also the reserved memory was below 20%.

Has anybody any idea what could happen or how to debug this further?

0 Kudos
3 Replies
Rajiv_D_Intel
Employee
93 Views

Which compiler version are you using? If it is 16.0, there was a problem with asynchronous OUTs which may be occurring here.

Christof_Soeger
Beginner
93 Views

Thanks for your reply. We now updated to
icpc (ICC) 16.0.3 20160415
 

But now an old problem reapeared, that I was also talking about in this thread https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/490294

I now always get in the first offload of this type an

offload error: process on the device 0 was terminated by signal 11 (SIGSEGV)

Here is a code snipet

template<typename Integer>
void OffloadHandler<Integer>::transfer_bools()
{
  cout << "mic " << mic_nr<< ": transfer_bools" << endl;
  Full_Cone<Integer>& foo_loc = local_fc_ref;  // prevents segfault
  //TODO segfaults should be resolved in intel compiler version 2015
  bool is_computed_pointed = local_fc_ref.isComputed(ConeProperty::IsPointed);
  #pragma offload target(mic:mic_nr) in(mic_nr)
  {
    bool foo = offload_fc_ptr->inhomogeneous;  // prevents segfault
    offload_fc_ptr->inhomogeneous      = foo_loc.inhomogeneous;
    offload_fc_ptr->do_Hilbert_basis   = foo_loc.do_Hilbert_basis;
    offload_fc_ptr->do_h_vector        = foo_loc.do_h_vector;
    offload_fc_ptr->keep_triangulation = foo_loc.keep_triangulation;
    offload_fc_ptr->do_multiplicity    = foo_loc.do_multiplicity;

The comments sound funny, but it really was like this. But now even this does not work anymore. Do you have any advice?

And also is there a list of known issues about offloads? I even cannot find a list of fixed issues. This would be very helpful. It took really a lot of time to debug this kind of problems.

 

EDIT: this new problem is in the second offload of every run, so I cannot say whether or not the original problem was solved.

93 Views

It seems that we have found a workaround for the problem in Christof's very last posting.

But now we are back to the problem that started this thread. I have not yet tried to produce a new log file, but there is no doubt that we are back at the same problem.

Winfried

Reply