- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
I'm encountering a non-deterministic crash (it happens once every ~50 executions) that is making me scratch my head. The crash seems to happen near the start of the application, and the dialog box that pops up in windows reports the following:
Problem Event Name: APPCRASH Application Name: M21Driver.exe Application Version: 0.0.0.0 Application Timestamp: 5357af5b Fault Module Name: coi_host.dll Fault Module Version: 0.0.0.0 Fault Module Timestamp: 533c374b Exception Code: c0000005 Exception Offset: 00000000000182a6 OS Version: 6.1.7601.2.1.0.274.10 Locale ID: 1033 Additional Information 1: 4f29 Additional Information 2: 4f294221479b9e035dee5100eb4b064e Additional Information 3: 666a Additional Information 4: 666ae0416fc870b83fed8f7dde00c38b
The offload_main process on the MIC is still alive, and I've attached GDB to it and found it in this state:
#0 0x00007f52ae4ab9d0 in sem_wait () from /lib64/libpthread.so.0 #1 0x00007f52afa392d7 in ?? () from /usr/lib64/libcoi_device.so.0 #2 0x00007f52afa3936e in ?? () from /usr/lib64/libcoi_device.so.0 #3 0x00007f52afa0b240 in COIProcessWaitForShutdown () from /usr/lib64/libcoi_device.so.0 #4 0x00007f52af391706 in __offload_target_main () from /tmp/coi_procs/1/44736/liboffload.so.5 #5 0x00000000004009e5 in main ()
On the host side, I have attached a remote debugger (I do not have Visual Studio + Intel on the machine with the MIC, so I am attaching the remote debugger from my desktop, thus I have no way of debugging the offload regions!) and found the code in this state:
Not Flagged > 8304 0 Main Thread Main Thread coi_host.dll!000007fee4db82a6 Normal Not Flagged 7420 0 Worker Thread libiomp5md.dll!__kmp_launch_monitor() libiomp5md.dll!__kmp_launch_monitor Highest Not Flagged 8620 0 Worker Thread libiomp5md.dll!__kmp_launch_monitor() libiomp5md.dll!__kmp_suspend Normal Not Flagged 8552 0 Worker Thread libiomp5md.dll!__kmp_launch_monitor() libiomp5md.dll!__kmp_suspend Normal Not Flagged 6864 0 Worker Thread libiomp5md.dll!__kmp_launch_monitor() libiomp5md.dll!__kmp_suspend Normal Not Flagged 6280 0 Worker Thread ntdll.dll thread ntdll.dll!0000000077c5186a Normal Not Flagged 9024 0 Worker Thread coi_host.dll thread uSCIF.dll!000007feeb452019 Normal Not Flagged 8932 0 Worker Thread coi_host.dll thread uSCIF.dll!000007feeb452019 Normal Not Flagged 7324 0 Worker Thread coi_host.dll thread coi_host.dll!000007fee4e12d84 Normal Not Flagged 9064 0 Worker Thread msvcr110.dll!_threadstartex coi_host.dll!000007fee4e64430 Normal Not Flagged 7548 0 Worker Thread msvcr110.dll!_threadstartex mswsock.dll!000007fefcf1c699 Normal Not Flagged 8896 0 Worker Thread coi_host.dll thread uSCIF.dll!000007feeb452019 Normal Not Flagged 6148 0 Worker Thread coi_host.dll thread coi_host.dll!000007fee4e4135f Normal Not Flagged 8576 0 Worker Thread msvcr110.dll!_threadstartex zlibwapi.dll!000007feea625543 Normal Not Flagged 6140 0 Worker Thread msvcr110.dll!_threadstartex coi_host.dll!000007fee4e64430 Normal Not Flagged 6080 0 Worker Thread coi_host.dll thread uSCIF.dll!000007feeb452019 Normal
You can see that I have several threads that offload, as well as some threads doing other work. I am pretty confident that the threads doing the work on the CPU are not causing the crash, as when I run without offloads through the Inspector it doesn't show anything interesting. I've really trimmed down the code that is being executed to the point that it is nearly doing nothing at all besides a few offloads. This makes me wonder, is it allowed to offload from multiple threads concurrently in the same process? Is there anything else that can be gleaned from the program's state at the time coi_host.dll goes belly up? I am using MPSS 3.2.1 and Composer XE 2013 SP1 (w_ccompxe_2013_sp1.2.176) Any ideas or suggestions?
Lien copié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
We were able to extract the call stack from the thread that throws an exception:
Child-SP RetAddr Call Site 00000000`002bbc58 000007fe`fd3b1430 ntdll!NtWaitForMultipleObjects+0xa 00000000`002bbc60 00000000`77422ce3 KERNELBASE!GetCurrentProcess+0x40 00000000`002bbd60 00000000`77499105 kernel32!WaitForMultipleObjectsEx+0xb3 00000000`002bbdf0 00000000`77499287 kernel32!WinExec+0x3b5 00000000`002bbe90 00000000`774992df kernel32!WinExec+0x537 00000000`002bbec0 00000000`774994fc kernel32!WinExec+0x58f 00000000`002bbef0 00000000`775b3398 kernel32!UnhandledExceptionFilter+0x1fc 00000000`002bbfd0 00000000`775385c8 ntdll!MD5Final+0x1de8 00000000`002bc000 00000000`77549d2d ntdll!_C_specific_handler+0x9c 00000000`002bc070 00000000`775391cf ntdll!RtlDecodePointer+0xad 00000000`002bc0a0 00000000`77571248 ntdll!RtlUnwindEx+0xbbf 00000000`002bc780 000007fe`eb7b9d3d ntdll!KiUserExceptionDispatcher+0x2e 00000000`002bceb0 000007fe`eb7bfefc coi_host!fragcount_node::IncNumCompleted+0x11fd 00000000`002bcfc0 000007fe`eb7c0292 coi_host!TaskScheduler::Complete+0x56c 00000000`002bd010 000007fe`eb79e2be coi_host!TaskScheduler::RunReady+0x12 00000000`002bd040 000007fe`eb7b2b55 coi_host+0xe2be 00000000`002bd190 00000001`80011f40 coi_host!COIBufferCopy+0x215 00000000`002bd240 00000001`8000d80e LIBOFFLOAD!_dbg_target_so_unloaded+0x4a30 00000000`002bd300 00000001`8002ce0e LIBOFFLOAD!_dbg_target_so_unloaded+0x2fe 00000000`002bd400 00000001`3ff89d3c LIBOFFLOAD!_offload_offload1+0x5e 00000000`002bd470 00000001`3ff86f2e M21Driver!PreProcess+0xb2c
It appears the function
fragcount_node::IncNumCompleted+0x11fd
is the function that throws an unhandled exception. This further makes me question if offloading in multiple threads is OK.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Child-SP RetAddr Call Site 00000000`8bc5e718 000007fe`fd4d1430 ntdll!NtWaitForMultipleObjects+0xa 00000000`8bc5e720 00000000`77412ce3 KERNELBASE!GetCurrentProcess+0x40 00000000`8bc5e820 00000000`77489105 kernel32!WaitForMultipleObjectsEx+0xb3 00000000`8bc5e8b0 00000000`77489287 kernel32!WinExec+0x3b5 00000000`8bc5e950 00000000`774892df kernel32!WinExec+0x537 00000000`8bc5e980 00000000`774894fc kernel32!WinExec+0x58f 00000000`8bc5e9b0 00000000`775a3398 kernel32!UnhandledExceptionFilter+0x1fc 00000000`8bc5ea90 00000000`775285c8 ntdll!MD5Final+0x1de8 00000000`8bc5eac0 00000000`77539d2d ntdll!_C_specific_handler+0x9c 00000000`8bc5eb30 00000000`775291cf ntdll!RtlDecodePointer+0xad 00000000`8bc5eb60 00000000`77561248 ntdll!RtlUnwindEx+0xbbf 00000000`8bc5f240 000007fe`ed4382a6 ntdll!KiUserExceptionDispatcher+0x2e 00000000`8bc5f950 000007fe`ed4486fc coi_host+0x182a6 00000000`8bc5f980 000007fe`ed44ad55 coi_host!fragcount_node::AllFragCompleted+0x186c 00000000`8bc5fa20 000007fe`ed44fefc coi_host!fragcount_node::IncNumCompleted+0x2215 00000000`8bc5fb10 000007fe`ed450292 coi_host!TaskScheduler::Complete+0x56c 00000000`8bc5fb60 000007fe`ed4930b0 coi_host!TaskScheduler::RunReady+0x12 00000000`8bc5fb90 000007fe`ed492da8 coi_host!COIPipelineSetCPUMask+0x9dc0 00000000`8bc5fc50 000007fe`ed494ef9 coi_host!COIPipelineSetCPUMask+0x9ab8 00000000`8bc5fd10 00000000`7740652d coi_host!COIPipelineSetCPUMask+0xbc09 00000000`8bc5fd40 00000000`7753c541 kernel32!BaseThreadInitThunk+0xd 00000000`8bc5fd70 00000000`00000000 ntdll!RtlUserThreadStart+0x21As you can see, the entire call stack comes from coi_host.dll. How can I figure out what is happening? This crashes once out of 150 executions. Furthermore I can see that offload_main is still executing on the MIC when it crashes, and when I attach the debugger my offload regions are all spinning waiting for data from the host which will never arrive because the host has crashed. Is there an issue with coi_host!fragcount_node::IncNumCompleted+0x2215?
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Pardon the delayed response, John. Thank you for your extra effort to debug this and details that have already provided.
In consulting with out offload compiler developers, from the offload aspect, if you can collect the OFFLOAD_REPORT=3 output (or use the API mentioned below) then we can analyze the offload activity to see whether that sheds any additional information.
There is an _Offload_report() API that you can use if you feel there’s certain points of offload that are suspect. That would trim down the amount of data produced versus using the OFFLOAD_REPORT environment variable for all offload activity in the application. (More information about the Offload Report is discussed in the UG here)
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Thanks for your response. It would appear that the offload that fails is one that reuses a buffer in a dedicated thread. Here's what I'm doing:
#pragma offload_transfer target (mic : m_deviceId) \ if (m_useMic) \ nocopy(buffer1[0:4421000] : alloc_if(1) free_if(0)) \ nocopy(buffer2[0:4421000] : alloc_if(1) free_if(0)) \ nocopy(buffer3[0:4421000] : alloc_if(1) free_if(0)) \ nocopy(buffer4[0:4421000] : alloc_if(1) free_if(0))
And then later on in the same thread I re-use these buffers multiple times:
#pragma offload target (mic : m_deviceId) \ if (m_useMic) \ in(vector1_data[0:vector1_size] : alloc_if(0) free_if(0) into(buffer1[0:vector1_size])) \ in(vector2_data[0:vector2_size] : alloc_if(0) free_if(0) into(buffer2[0:vector2_size])) \ in(vector3_data[0:vector3_size] : alloc_if(0) free_if(0) into(buffer3[0:vector3_size])) \ in(vector4_data[0:vector4_size] : alloc_if(0) free_if(0) into(buffer4[0:vector4_size])) { //assign vectors on the MIC }
However, the exception is always thrown from coi_host.dll, and it is always this function
fragcount_node::IncNumCompleted
I should mention that I am running multiple copies of this thread, and they all have their own private buffers. At the time of the exception, at least two such threads are in the "assign" offload.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Thank you for the additional details. I repeated a private reply that did not reach you last week to request some additional information. I'll also inquire w/Developers whether collecting OFFLOAD_REPORT details for at least those two offloads would really be helpful at this point.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
To formally close this thread, the underlying root cause related to some internal clean-up within COI. The changes (tracked under the internal Intel Tracking ID: HSD# 4868965) to address the issue appear in the MPSS 3.3 (July 14 2014) release.

- S'abonner au fil RSS
- Marquer le sujet comme nouveau
- Marquer le sujet comme lu
- Placer ce Sujet en tête de liste pour l'utilisateur actuel
- Marquer
- S'abonner
- Page imprimable