Dear MIC architecture experts,
Offload errors are usually reported to stdout like this:
"offload error: cannot find data associated with pointer variable xxxxx"
and the application automatically quits.
Some errors are due to bad programming, so it makes sense to abort the execution as soon as possible when debugging.
However, some errors are just due to bad data passed to the offload code (bad MIC_LD_LIBRARY_PATH, etc.) and should not require
the program to exit, merely to skip the offload parts/revert to Host processing and report the error.
This is especially important when using offload instructions in a dll:
the program that loaded the dll will exit as well when such errors occur (which appears like a crash to the user of the program).
Is there a way to gather offload errors without forcing the program to exit?
The offload dll could then report this error and guide the user to correctly use the offload dll...
Maybe this could be done with low level APIs like COI or SCIF,
checking ourselves the state of the variables, etc before executing high level offload code?
or is there a high level way to report these errors without exiting the program.
At the program level, I thought that it may be possible to catch the termination signal and stop the termination procedure
but this method does not seem very safe (can we recover the state?) or robust...
Any help very appreciated!
Since you are talking DLL, I will assume you are on Windows.
What you may need to (can) do is not declare the library entry points as dynamically linked. Rather you explicitly use the Windows LoadLibrary API to explicitly load a library or return an error indicator (e.g. not found). When not found you return your error code, and then take the alternate (non-offload) route. If/when the LoadLibrary succeeds, then for every entry point you wish to use, you must use GetProcAddress to obtain the address of the entrypoint (an save for future calls). Yes this is cumbersome.
Note, the entire list of dependency libraries your current offload will eventually call will have to be loaded in a similar manner. Else the error will occur partway into the offload, and at which point it may not be possible to unwind back to the host to restart the code on the host. This is not an easy task (well until you write the code).
Thank you for your input!
I have done something partially similar in order to check the library dependencies on the host side.
I did not check for statically linked dlls but for for dynamically linked onces in a first time (like offload.dll, coi_host.dll, etc.) by trying to load them prior to any #pragma offload directive with LoadLibrary, report an error if not found and unload them to let the runtime load them itself afterwards (althought I could also have kept them loaded and just unload them at the end of the program I guess)
I will see if I go all the way in by loading myself dynamically all dlls, like you suggest. Thanks for the idea!
I guess this could also be done on the target side using dlopen but it would add an extra layer of complexity:
- how to remove entry points of the MIC binary embedded inside the host binary?
- how to manage pointers to functions on the MIC side? (file-scope native int64_t pointers that we cast to pointers to function in MIC code?)
My other concern is for other potential errors (other than "library not found"), like machine configuration problems, etc that could be fixed easily by the user if an error was reported but requires the intervention of a developer when the app just exits without message (in the offload dll called by UI program case).