Re: Cannot create DLL which offloads part of the code to Intel Xe GPU

dtroncho · ‎05-16-2022

Hello,

I have a DLL comprised of 3 cpp files. I have separated the "offloadable" (basically matrix multiplications) code into one of those 3 files.

I have downloaded & installed the latest version of the Intel OneAPI HPC Toolkit.

If I try to compile & link all them into a DLL with this command (executed in the Intel OneAPI Command Prompt for Intel64 for Visual Studio 2022):

icx /LD /out:MatMul.dll DLLMAIN.CPP MAIN.CPP OFFLOADABLE.CPP /c /nologo /Qiopenmp /Qopenmp-targets=spir64

I obtain:

clang-cl: error: The use of '-LD' is not supported with '/Qiopenmp /Qopenmp-targets=spir64'.

If I compile these 3 files separatedly, with these commands (same console):

icx DLLMAIN.CPP /c /nologo /Qiopenmp

icx MAIN.CPP /c /Qiopenmp

icx OFFLOADABLE.CPP /c /nologo /Qiopenmp /Qopenmp-targets=spir64

It creates the corresponding .obj correctly, but when I try to compile them with a source.def file which defines what this DLL must export, with:

icx MAIN.OBJ DLLMAIN.OBJ OFFLOADABLE.OBJ /LD -o MatMul.dll /DEF:source.def /Qiopenmp /Qopenmp-targets=spir64

Response is:

a-3292ef.obj : warning LNK4078: multiple '__CLANG_OFFLOAD_BUNDLE__openmp-s' sections found with different attributes (40500040)

And it creates the DLL, but when I try to use it from a .NET program which has a declaration to use one of its methods, an error pops-up saying:

"Error in MyInitialize: Unable to load DLL 'MatMul.dll' or one of its dependencies: The specified module could not be found. (0x8007007E).

Can you please help me? If I could solve this, then parts of DLLs could be also off-loaded to Intel's hardware opening lots of possibilities for businesses.

I look forward to your help, thanks in advance.

David.

SantoshY_Intel · ‎05-16-2022

Hi,

Thank you for posting in Intel Communities.

Could you please provide us with the sample reproducer codes(DLLMAIN.CPP, MAIN.CPP, OFFLOADABLE.CPP & source.def file) to investigate more on your issue?

Thanks & Regards,

Santosh

dtroncho · ‎05-17-2022

Hello Santosh,

Thank you for your reply.

Please find attached the compressed (with default windows compressor) file GPU_CPU_Example.zip including:

Dll1 subfolder. It is a very simple Visual C++ DLL project which is usable from .NET. It has been developed with Microsoft Visual Studio Community 2022 (64-bit) - Version 17.3.0 Preview 1.0.
Dll1_VB_Test subfolder. You probably do not need to change this project. It is a very simple Visual Basic .NET Console project which uses the previous DLL. Developed with Microsoft Visual Studio Community 2019 (64-bit), because I don't have VB.NET on my VS2022, but it is very simple and should open with Visual Studio 2022.

The Dll1_VB_Test is just a project to test the DLL.

If you open a CMD console and run \GPU_CPU_Example\Dll1_VB_Test\bin\Release\net5.0\Dll1_VB_Test.exe as I send it, it should run ok but only on CPU.

If you open (with Visual Studio 2022) the Visual C++ DLL project Dll1 and change the parameter Enable OpenMP Offloading to Generate x86 + SPIR64 fat binary (/Qopenmp-targets:spir64), and follow the steps:

Rebuild the Dll1 project.
Copy the Dll1.dll from \GPU_CPU_Example\Dll1\x64\Release to \GPU_CPU_Example\Dll1_VB_Test\bin\Release\net5.0
Open a CMD console, CD to \GPU_CPU_Example\Dll1_VB_Test\bin\Release\net5.0
Run: \GPU_CPU_Example\Dll1_VB_Test\bin\Release\net5.0\Dll1_VB_Test.exe

The program will return:

Unhandled exception. System.DllNotFoundException: Unable to load DLL 'Dll1.dll' or one of its dependencies: The specified module could not be found. (0x8007007E)
at Dll1_VB_Test.Module1.CPULoad()
at Dll1_VB_Test.Program.Main(String[] args) in C:\temp\Dll1_VB_Test\Program.vb:line 10

I have tried the GPU code of the Dll1 as an EXE file, and it worked on the GPU. Then, what should I do for this DLL to run (partially, the part which is OpenMP "target") on the GPU?

I have also tried to compile the cpp files separatedly, indicating /Qopenmp-targets:spir64 only to the offloadable.cpp file, and then (icx) linking them, but then same error occurs in the tester.

If you need any other info, please request.

I look forward to your feedback and thanks in advance.

Best regards,

David.

SantoshY_Intel · ‎05-17-2022

Hi,

We can see that you are using an unsupported version of Visual Studio. Could you please try with any of the supported versions of Visual Studio and let us know if you still face the same issue?

To check the supported versions of Visual Studio please refer to the below link:

https://www.intel.com/content/www/us/en/developer/articles/reference-implementation/intel-compilers-compatibility-with-microsoft-visual-studio-and-xcode.html

Thanks & Regards,

Santosh

dtroncho · ‎05-17-2022

Hi,

Thanks for your response.

I have uninstalled and installed Microsoft Visual Studio Community 2022 (64-bit) - Current Version 17.2.0.

After installing the indicated MVS2022, I reinstalled Intel oneAPI 2022.2 Base Toolkit and, later, the HPC Toolkit 2022.2.

I repeated the steps indicated above and obtained same results.

If you need any other info, please request.

I look forward to your help

Best regards and thanks in advance,

David.

SantoshY_Intel · ‎05-17-2022

Hi,

Could you please confirm whether you are using VS2022 17.0.2 or VS2022 17.2.0 version ?

Because VS2022 17.0.2 is a supported version whereas VS2022 17.2.0 is NOT yet supported. Refer to the below screenshot:

Thanks & Regards,

Santosh

dtroncho · ‎05-17-2022

Hi,

Yes, for the response of the previous message I used VS2022 version 17.2.0, but I assumed forward compatibility. As far as I know, I cannot downgrade my VS2022 to 17.0.2.

Anyways, I have also tried with Microsoft Visual Studio Community 2019 Version 16.11.14 and, unfortunately, same result.

I have provided you with the same example that I am running, so you can try it on a supported version and see if it comes out the same result (most probably) that I indicated above, and then, if you find a solution it will (most probably) apply to the other versions of Visual Studio.

We need to solve this problem, as it will allow DLLs developed with VS and Intel oneAPI to offload code to Intel GPUs.

I look forward to your help and thanks in advance,

David.

SantoshY_Intel · ‎05-19-2022

Hi,

We generated a dynamic linking library(DLL) and created a C++ application that uses the DLL to offload the code on Intel GPUs successfully using the Intel C++ compiler 2022.

We tried it using the Visual Studio 17.0.0 version & we have followed the steps from the link below:

https://docs.microsoft.com/en-us/cpp/build/walkthrough-creating-and-using-a-dynamic-link-library-cpp?view=msvc-170

Please find the attachments below:

DLL1: This project will generate a Dynamic Linking Library.

usingDLL: It is a sample C++ project which uses the DLL and offloads the code on Intel GPUs.

We also tried running your .net application using the command line and it worked fine at our end as shown in the below screenshot. Before running the program(Module.exe), copy the Dll1.dll file to the directory where we have Module1.exe.

Thanks & Regards,

Santosh

dtroncho · ‎05-19-2022

Hello Santosh,

Thank you very much for your response and your effort.

Unfortunately and apparently, your code does not offload to my GPU Intel Xe. For you to have all the info, the laptop where I am running your code is:

O.S.: Windows 10 PRO

CPU/GPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz

RAM: 16,0 GB (15,7 GB usable)

Let me explain the steps that I have followed (I did not change any other thing from your example):

I downloaded and extracted Downloads.zip
I opened (with Microsoft Visual Studio Community 2022 (64-bit) - Current Version 17.2.0) the project usingDLL and changed your local paths to mine.
I rebuilt as it is. Rebuilds fine.
I opened a CMD session, went to path and executed \UsingDLL\x64\Debug\UsingDLL.exe
Pops-up error: "... execution cannot continue... omptarget.dll nof found...".
I searched in my computer omptarget.dll, and found it here: C:\Program Files (x86)\Intel\oneAPI\compiler\2022.1.0\windows\bin
Copied omptarget.dll from that path to: \UsingDLL\x64\Debug
I opened a CMD session and executed again UsingDLL.exe: it works! But, is it off-loading to GPU? Let us check.
I opened with VS2022 project Dll1 and added some very little change to validate where the offloadable code is being executed. What I added is a module variable (a) which is changed inside the code which should run on the GPU but that variable is not included in the map(tofrom) clause (omp directive), therefore, any change in this variable cannot come back to CPU. Also added new public function executedOnGPU which returns true if the a variable remains unchanged, indicating that the variable was changed inside the GPU.
I rebuilt the Dll1.dll and copied it to \UsingDLL\x64\Debug.
Open with VS2002 project UsingDll1. I needed to add to the project the include directory of the Dll1 project, and a call to check with executedOnGPU and inform where was the code executed. With that, it rebuilds fine.
I open a CMD session and execute UsingDLL.exe, and returns:

C:\Users\david\Downloads\Downloads\usingDll\usingDll\x64\Debug>usingDll.exe
3 3 3
3 3 3
3 3 3
Executed on CPU

I have then compressed the example and uploaded it here.

I look forward to your feedback on what should I do to really execute this example on the GPU.

Thank you very much in advance.

Best regards,

David.

dtroncho · ‎05-19-2022

Hi,

For your convenience, I have changed all paths to be relative in the usingDll project, so that you can just download, open and try. Uploaded again.

I look forward to your help. Thanks in advance.

David.

SantoshY_Intel · ‎05-20-2022

Hi,

>>" it works! But, is it off-loading to GPU? Let us check"

To check whether the code is offloading to GPU or CPU, we need to set LIBOMPTARGET_DEBUG to 1.

set LIBOMPTARGET_DEBUG=1

Now, run the executable(usingDLL.exe). It will generate debug information through which we can know whether the GPU offloading is done or not.

I tried using this LIBOMPTARGET_DEBUG flag and was able to get the offloading information as shown below:

For your reference, I am attaching the complete debug log below.

Thanks & Regards,

Santosh

dtroncho · ‎05-20-2022

Hello Santosh,

Thank you for your email and effort.

My problem persists because the only way I have been able to off-load and run code on my Intel Xe GPU is by compiling the offloadable.cpp file as an .OBJ and then linking that offloadable.obj to build the executable file, both with icx and options /Qiopenmp /Qopenmp-targets=spir64

In other words: I am still not able to compile a Visual C++ dll which eventually runs on an Intel GPU.

As I mentioned, I have confirmed that example to run on the GPU because:

A variable created outside of the OMP section, and changed inside it, does not show that change outside the omp section.
Every time I run it, I can see 100% activity in my GPU with the Windows task manager. And if I increase the iterations, the 100% activity of the GPU lasts longer and proportionally with the iterations.
It even blocks my whole Windows for a while, I assume when transferrig the data CPU<->GPU. Which by the way, I find that worrying.

On the other hand, unfortunately, I do not think that your example really runs on the GPU. Please take a look below to a debug dumped by my successful example and compare it with yours. Your debug finishes like this:

Libomptarget --> Done registering entries!
3 3 3
3 3 3
3 3 3
Libomptarget --> Deinit target library!

But take a look to the things that mine does between the 2 Libomptarget messages:

Libomptarget --> Done registering entries!
Libomptarget --> Entering target region with entry point 0x00007ff7c4958e70 and device Id 0
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Call to omp_get_initial_device returning 1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target LEVEL0 RTL --> Initialize requires flags to 1
Target LEVEL0 RTL --> Allocated a host memory object 0x0000022cba0a0000
Target LEVEL0 RTL --> Initialized host memory pool for device 0x0000000000000000: AllocUnit = 65536, AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Allocated a shared memory object 0x0000022cba0a0000
Target LEVEL0 RTL --> Initialized shared memory pool for device 0x0000022cb98047a8: AllocUnit = 65536, AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Allocated a device memory object 0xffffb80200010000
Target LEVEL0 RTL --> Initialized device memory pool for device 0x0000022cb98047a8: AllocUnit = 65536, AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Created a command queue 0x0000022cb98dff48 (Ordinal: 0, Index: 0) for device 0.
Target LEVEL0 RTL --> Created a command list 0x0000022cb9d4cd98 (Ordinal: 0) for device 0.
Target LEVEL0 RTL --> Initialized Level0 device 0
Libomptarget --> Device 0 is ready to use.
Target LEVEL0 RTL --> Device 0: Loading binary from 0x00007ff7c498c000
Target LEVEL0 RTL --> Expecting to have 1 entries defined
Target LEVEL0 RTL --> Base L0 module compilation options: -cl-std=CL2.0
Target LEVEL0 RTL --> Created module from image #0.
Target LEVEL0 RTL --> Looking up device global variable '__omp_offloading_eeec4b03_1bfa1__Z7Compute_l9_kernel_info' of unknown size on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 80 bytes).
Target LEVEL0 RTL --> Created a command list 0x0000022cb98e43c8 (Ordinal: 1) for device 0.
Target LEVEL0 RTL --> Created a command queue 0x0000022cb7647858 (Ordinal: 1, Index: 0) for device 0.
Target LEVEL0 RTL --> Kernel 0: Entry = 0x00007ff7c4958e70, Name = __omp_offloading_eeec4b03_1bfa1__Z7Compute_l9, NumArgs = 6, Handle = 0x0000022cc0f72cf0
Target LEVEL0 RTL --> Looking up device global variable '__omp_spirv_program_data' of size 48 bytes on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 48 bytes).
Libomptarget --> Entry 0: Base=0x000000c977367f08, Begin=0x000000c977367f08, Size=8, Type=0x23, Name=unknown
Libomptarget --> Entry 1: Base=0x000000c977367ef8, Begin=0x000000c977367ef8, Size=8, Type=0x21, Name=unknown
Libomptarget --> Entry 2: Base=0x000000c977367f00, Begin=0x000000c977367f00, Size=8, Type=0x21, Name=unknown
Libomptarget --> Entry 3: Base=0x00007ff7c4976d80, Begin=0x00007ff7c4976d80, Size=4, Type=0x21, Name=unknown
Libomptarget --> Entry 4: Base=0x0000000000000000, Begin=0x0000000000000000, Size=0, Type=0x120, Name=unknown
Libomptarget --> Entry 5: Base=0x0000000000000059, Begin=0x0000000000000059, Size=0, Type=0x120, Name=unknown
Libomptarget --> Entry 6: Base=0x000000c977367f80, Begin=0x000000c977367f80, Size=32, Type=0x800, Name=unknown
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367f08, Size=8)...
Target LEVEL0 RTL --> Ptr 0x000000c977367f08 is not a device accessible memory pointer.
Target LEVEL0 RTL --> Allocated a shared memory object 0x0000022cba0b0000
Target LEVEL0 RTL --> New block allocation for shared memory pool: base = 0x0000022cba0b0000, size = 65536, pool size = 65536
Target LEVEL0 RTL --> Allocated target memory 0x0000022cba0b0000 (Base: 0x0000022cba0b0000, Size: from memory pool for host ptr 0x000000c977367f08
Libomptarget --> Creating new map entry with HstPtrBegin=0x000000c977367f08, TgtPtrBegin=0x0000022cba0b0000, Size=8, DynRefCount=1, HoldRefCount=0, Name=unknown
Libomptarget --> Moving 8 bytes (hst:0x000000c977367f08) -> (tgt:0x0000022cba0b0000)
Target LEVEL0 RTL --> Copied 8 bytes (hst:0x000000c977367f08) -> (tgt:0x0000022cba0b0000)
Libomptarget --> There are 8 bytes allocated at target address 0x0000022cba0b0000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367ef8, Size=8)...
Target LEVEL0 RTL --> Ptr 0x000000c977367ef8 is not a device accessible memory pointer.
Target LEVEL0 RTL --> Allocated target memory 0x0000022cba0b0020 (Base: 0x0000022cba0b0020, Size: from memory pool for host ptr 0x000000c977367ef8
Libomptarget --> Creating new map entry with HstPtrBegin=0x000000c977367ef8, TgtPtrBegin=0x0000022cba0b0020, Size=8, DynRefCount=1, HoldRefCount=0, Name=unknown
Libomptarget --> Moving 8 bytes (hst:0x000000c977367ef8) -> (tgt:0x0000022cba0b0020)
Target LEVEL0 RTL --> Copied 8 bytes (hst:0x000000c977367ef8) -> (tgt:0x0000022cba0b0020)
Libomptarget --> There are 8 bytes allocated at target address 0x0000022cba0b0020 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367f00, Size=8)...
Target LEVEL0 RTL --> Ptr 0x000000c977367f00 is not a device accessible memory pointer.
Target LEVEL0 RTL --> Allocated target memory 0x0000022cba0b0040 (Base: 0x0000022cba0b0040, Size: from memory pool for host ptr 0x000000c977367f00
Libomptarget --> Creating new map entry with HstPtrBegin=0x000000c977367f00, TgtPtrBegin=0x0000022cba0b0040, Size=8, DynRefCount=1, HoldRefCount=0, Name=unknown
Libomptarget --> Moving 8 bytes (hst:0x000000c977367f00) -> (tgt:0x0000022cba0b0040)
Target LEVEL0 RTL --> Copied 8 bytes (hst:0x000000c977367f00) -> (tgt:0x0000022cba0b0040)
Libomptarget --> There are 8 bytes allocated at target address 0x0000022cba0b0040 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ff7c4976d80, Size=4)...
Target LEVEL0 RTL --> Ptr 0x00007ff7c4976d80 is not a device accessible memory pointer.
Target LEVEL0 RTL --> Allocated target memory 0x0000022cba0b0060 (Base: 0x0000022cba0b0060, Size: 4) from memory pool for host ptr 0x00007ff7c4976d80
Libomptarget --> Creating new map entry with HstPtrBegin=0x00007ff7c4976d80, TgtPtrBegin=0x0000022cba0b0060, Size=4, DynRefCount=1, HoldRefCount=0, Name=unknown
Libomptarget --> Moving 4 bytes (hst:0x00007ff7c4976d80) -> (tgt:0x0000022cba0b0060)
Target LEVEL0 RTL --> Copied 4 bytes (hst:0x00007ff7c4976d80) -> (tgt:0x0000022cba0b0060)
Libomptarget --> There are 4 bytes allocated at target address 0x0000022cba0b0060 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367f08, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c977367f08, TgtPtrBegin=0x0000022cba0b0000, Size=8, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0x0000022cba0b0000, Offset: 0) from host pointer 0x000000c977367f08
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367ef8, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c977367ef8, TgtPtrBegin=0x0000022cba0b0020, Size=8, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0x0000022cba0b0020, Offset: 0) from host pointer 0x000000c977367ef8
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367f00, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c977367f00, TgtPtrBegin=0x0000022cba0b0040, Size=8, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0x0000022cba0b0040, Offset: 0) from host pointer 0x000000c977367f00
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ff7c4976d80, Size=4)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ff7c4976d80, TgtPtrBegin=0x0000022cba0b0060, Size=4, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0x0000022cba0b0060, Offset: 0) from host pointer 0x00007ff7c4976d80
Libomptarget --> Forwarding first-private value 0x0000000000000000 to the target construct
Libomptarget --> Forwarding first-private value 0x0000000000000059 to the target construct
Libomptarget --> Launching target execution __omp_offloading_eeec4b03_1bfa1__Z7Compute_l9 with pointer 0x0000022cbc17f9c0 (index=0).
Target LEVEL0 RTL --> Executing a kernel 0x0000022cbc17f9c0...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 32
Target LEVEL0 RTL --> Preferred group size is multiple of 64
Target LEVEL0 RTL --> Max group size is set to 80 (thread_limit clause)
Target LEVEL0 RTL --> Level 0: Lb = 0, Ub = 89, Stride = 1
Target LEVEL0 RTL --> Group sizes = {80, 1, 1}
Target LEVEL0 RTL --> Group counts = {2, 1, 1}
Target LEVEL0 RTL --> Created a command list 0x0000022cbc102f98 (Ordinal: 0) for device 0.
Target LEVEL0 RTL --> Created a command queue 0x0000022cb7647928 (Ordinal: 0, Index: 0) for device 0.
Target LEVEL0 RTL --> Kernel Pointer argument 0 (value: 0x0000022cba0b0000) was set successfully.
Target LEVEL0 RTL --> Kernel Pointer argument 1 (value: 0x0000022cba0b0020) was set successfully.
Target LEVEL0 RTL --> Kernel Pointer argument 2 (value: 0x0000022cba0b0040) was set successfully.
Target LEVEL0 RTL --> Kernel Pointer argument 3 (value: 0x0000022cba0b0060) was set successfully.
Target LEVEL0 RTL --> Kernel Scalar argument 4 (value: 0x0000000000000000) was set successfully.
Target LEVEL0 RTL --> Kernel Scalar argument 5 (value: 0x0000000000000059) was set successfully.
Target LEVEL0 RTL --> Setting indirect access flags 0x0000000000000004
Target LEVEL0 RTL --> Executed a kernel 0x0000022cbc17f9c0
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ff7c4976d80, Size=4)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ff7c4976d80, TgtPtrBegin=0x0000022cba0b0060, Size=4, DynRefCount=1 (deferred final decrement), HoldRefCount=0
Libomptarget --> There are 4 bytes allocated at target address 0x0000022cba0b0060 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367f00, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c977367f00, TgtPtrBegin=0x0000022cba0b0040, Size=8, DynRefCount=1 (deferred final decrement), HoldRefCount=0
Libomptarget --> There are 8 bytes allocated at target address 0x0000022cba0b0040 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367ef8, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c977367ef8, TgtPtrBegin=0x0000022cba0b0020, Size=8, DynRefCount=1 (deferred final decrement), HoldRefCount=0
Libomptarget --> There are 8 bytes allocated at target address 0x0000022cba0b0020 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367f08, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c977367f08, TgtPtrBegin=0x0000022cba0b0000, Size=8, DynRefCount=1 (deferred final decrement), HoldRefCount=0
Libomptarget --> There are 8 bytes allocated at target address 0x0000022cba0b0000 - is last
Libomptarget --> Moving 8 bytes (tgt:0x0000022cba0b0000) -> (hst:0x000000c977367f08)
Target LEVEL0 RTL --> Copied 8 bytes (tgt:0x0000022cba0b0000) -> (hst:0x000000c977367f08)
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ff7c4976d80, Size=4)...
Libomptarget --> Deleting tgt data 0x0000022cba0b0060 of size 4
Target LEVEL0 RTL --> Returned device memory 0x0000022cba0b0060 to memory pool
Libomptarget --> Removing map entry with HstPtrBegin=0x00007ff7c4976d80, TgtPtrBegin=0x0000022cba0b0060, Size=4, Name=unknown
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367f00, Size=8)...
Libomptarget --> Deleting tgt data 0x0000022cba0b0040 of size 8
Target LEVEL0 RTL --> Returned device memory 0x0000022cba0b0040 to memory pool
Libomptarget --> Removing map entry with HstPtrBegin=0x000000c977367f00, TgtPtrBegin=0x0000022cba0b0040, Size=8, Name=unknown
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367ef8, Size=8)...
Libomptarget --> Deleting tgt data 0x0000022cba0b0020 of size 8
Target LEVEL0 RTL --> Returned device memory 0x0000022cba0b0020 to memory pool
Libomptarget --> Removing map entry with HstPtrBegin=0x000000c977367ef8, TgtPtrBegin=0x0000022cba0b0020, Size=8, Name=unknown
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c977367f08, Size=8)...
Libomptarget --> Deleting tgt data 0x0000022cba0b0000 of size 8
Target LEVEL0 RTL --> Returned device memory 0x0000022cba0b0000 to memory pool
Libomptarget --> Removing map entry with HstPtrBegin=0x000000c977367f08, TgtPtrBegin=0x0000022cba0b0000, Size=8, Name=unknown
Libomptarget --> Unloading target library!
Target LEVEL0 RTL --> Target binary is a valid oneAPI OpenMP image.
Libomptarget --> Image 0x00007ff7c498c000 is compatible with RTL 0x00007ffa3d8b0000!
Libomptarget --> Unregistered image 0x00007ff7c498c000 from RTL 0x00007ffa3d8b0000!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor 0x00007ff7c497d000
Target LEVEL0 RTL --> Memory usage for host memory, device 0:
Target LEVEL0 RTL --> -- Allocator: Native, Pool
Target LEVEL0 RTL --> -- Requested: 0, 0
Target LEVEL0 RTL --> -- Allocated: 0, 0
Target LEVEL0 RTL --> -- Freed : 0, 0
Target LEVEL0 RTL --> -- InUse : 0, 0
Target LEVEL0 RTL --> -- PeakUse : 0, 0
Target LEVEL0 RTL --> -- NumAllocs: 0, 0
Target LEVEL0 RTL --> Memory usage for shared memory, device 0:
Target LEVEL0 RTL --> -- Allocator: Native, Pool
Target LEVEL0 RTL --> -- Requested: 65536, 28
Target LEVEL0 RTL --> -- Allocated: 65536, 128
Target LEVEL0 RTL --> -- Freed : 65536, 128
Target LEVEL0 RTL --> -- InUse : 0, 0
Target LEVEL0 RTL --> -- PeakUse : 65536, 128
Target LEVEL0 RTL --> -- NumAllocs: 1, 4
Target LEVEL0 RTL --> Memory usage for device memory, device 0:
Target LEVEL0 RTL --> -- Allocator: Native, Pool
Target LEVEL0 RTL --> -- Requested: 0, 0
Target LEVEL0 RTL --> -- Allocated: 0, 0
Target LEVEL0 RTL --> -- Freed : 0, 0
Target LEVEL0 RTL --> -- InUse : 0, 0
Target LEVEL0 RTL --> -- PeakUse : 0, 0
Target LEVEL0 RTL --> -- NumAllocs: 0, 0
Target LEVEL0 RTL --> Closed RTL successfully
Target LEVEL0 RTL --> Deinit Level0 plugin!
Libomptarget --> Done unregistering library!
Libomptarget --> Deinit target library!

To sum up, I still need help to have an example of a DLL which really can run on the GPU, and with stability

I look forward to your feedback and thanks in advance,

David.

dtroncho · ‎05-20-2022

Hi,

Maybe I almost have it. I have attached an example. If you run \GPU_CPU_Example\Dll1_VB_Test\bin\Release\net5.0\Dll1_VB_Test.exe it will use a VC++ DLL which tries to execute in the GPU but returns this error:

Libomptarget --> Host ptr 0x00007ffa3b4c3410 does not have a matching target pointer.
Libomptarget error: Run with
Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
unknown:0:12: Libomptarget fatal error 1: failure of target construct while offloading is mandatory

Do you know how to fix that? Because looks like we need to solve that to, maybe, succeed.

I have uploaded the DLL and the program which uses it.

I look fwd to your feedback and thanks in advance.

David.

dtroncho · ‎05-25-2022

Hello, any news on this ticket?

I look forward to your help and thanks in advance.

SantoshY_Intel · ‎05-26-2022

Hi,

We were able to reproduce your issue at our end using the steps given by you. We are working on your issue internally with the developers and will get back to you soon.

Thanks & Regards,

Santosh

Klaus-Dieter_O_Intel · ‎06-03-2022

Please try using explicit array sections for the offload mapping and not just pointers, for example:

void Compute(int A[MAX_TEST][MAX_TEST], int B[MAX_TEST][MAX_TEST], int C[MAX_TEST][MAX_TEST])

{

int a = 0;

bool is_cpu = true;

#ifdef USE_POINTER

#pragma omp target teams distribute parallel for map(to: A, B, a) map(tofrom: C) map(from: is_cpu)

#else

#pragma omp target teams distribute parallel for map(to: A[0:MAX_TEST][0:MAX_TEST], B[0:MAX_TEST][0:MAX_TEST], a) map(tofrom: C[0:MAX_TEST][0:MAX_TEST]) map(from: is_cpu)

#endif

for (int i = 0; i < MAX_TEST; i++) {

a = a + 1;

if (i == 0) is_cpu = omp_is_initial_device();

for (int j = 0; j < MAX_TEST; j++) {

for (int k = 0; k < MAX_TEST; k++) {

C[i][j] += A[i][k] * B[k][j];

}

if (a == 0) {

std::cout << "Offloaded on GPU " << std::endl;

}

else

std::cout << "Executed on CPU " << std::endl;

if (! is_cpu) {

std::cout << "Offloaded on GPU " << std::endl;

}

else

std::cout << "Executed on CPU " << std::endl;

}

And you can use omp_is_initial_device() to check whether the code is executed on the GPU or the host, see https://www.openmp.org/spec-html/5.1/openmpsu166.html.

dtroncho · ‎06-07-2022

Thank you for the feedback, however, it does not resolve the problem because:

I did what you mentioned, recompiled and tried, and I obtained same result:

Libomptarget error: Host ptr 0x00007fff8615344d does not have a matching target pointer.
Libomptarget error: Run with
Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
unknown:0:15: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
In the particular example that we have shared in this post, matrices are of size [MAX_TEST][MAX_TEST], but in the real case matrices are created in run time by allocating (malloc) memory and then sent to the GPU via the shared omp directive, and therefore dimensions and sizes cannot be typed in the omp shared directive. If I try like this: #pragma omp parallel default(none) shared(float *A, float *B, ...), I obtain an error from the intel compiler.
I think the real problem was indicated in the first message of these posts, which probably means that currently the intel compiler cannot create a DLL which offloads to a GPU: clang-cl: error: The use of '-LD' is not supported with '/Qiopenmp /Qopenmp-targets=spir64'.

We need your help to be able to publish DLLs which can be partially offloaded to intel's GPUs.

I look forward to your feedback and thanks in advance.

dtroncho · ‎06-13-2022

Hello! any news? Thanks in advance.

Klaus-Dieter_O_Intel · ‎07-27-2022

The developers are working on a solution.

dtroncho · ‎07-28-2022

Hello Klaus,

Thanks for the update. That´s very good news: it will allow to reuse and distribute DLLs which off-load (via OpenMP) to Intel's GPUs.

I look forward to your updates.

Best regards,

David.

dtroncho · ‎08-22-2022

Hello Klaus,

I hope this message finds you very well.

Do you have any update or estimated date for this to be solved?

Thanks in advance.