Unable to allocate memory with KMPC - Page 2

jimdempseyatthecove · ‎10-18-2023

Trying out offloading in Fortran.

module myDPCPPlib
    interface
        subroutine DPCPPinit() bind(C, name="DPCPPinit")
        end subroutine DPCPPinit
    end interface
end module myDPCPPlib
    
program TestGPU
    use myDPCPPlib
    use omp_lib
    implicit none
    ! Variables
    integer i,j
    integer :: nRows
    real, allocatable :: arrayShared(:,:)
    real :: sum(4)
    ! Body of TestGPU
    print *,omp_get_num_devices()
    call DPCPPinit()
    print *,omp_get_num_devices()
    print *
    nRows = 100000
    !$omp allocate allocator(omp_target_shared_mem_alloc)
    allocate(arrayShared(4,nRows))
    do j=1,size(arrayShared, dim=2)
        do i=1,4
            arrayShared(i,j) = i*j
        end do
    end do
    
    !$omp target teams distribute parallel do map(from:nRows, arrayShared) reduction(+:sum)
    do j=1,nRows
        sum = sum + arrayShared(:,j)
    end do
    !$omp end target teams distribute parallel do
    print *,sum
end program TestGPU

extern "C" {
    void DPCPPinit() {
        // The default device selector will select the most performant device.
        auto d_selector{ default_selector_v };
        try {
            queue q(d_selector, exception_handler);

            // Print out the device information used for the kernel code.
            std::cout << "Running on device: "
                << q.get_device().get_info<info::device::name>() << "\n";
        }
        catch (exception const& e) {
            std::cout << "An exception is caught.\n";
            std::terminate();
        }
    }
}

(other code elided from library)

Output:

           1
Running on device: Intel(R) Graphics [0x9bca]
           1

forrtl: warning (786): Unable to allocate memory with KMPC, _mm_malloc used instead
  0.0000000E+00  0.0000000E+00  0.0000000E+00  0.0000000E+00
Press any key to continue . . .

First test, line 18 of program sees if I can get the number of devices before making any GPU interaction. The tests succeeds, 1 device locate.

The second test, to call into the library to obtain the device information, succeeds.

The third test, retries the get number of devices, and this too succeeds.

Now, to the crux of the matter. When I reach

    !$omp allocate allocator(omp_target_shared_mem_alloc)
    allocate(arrayShared(4,nRows))

I get the warning(786) message

The program continues.

The code runs to completion, but the results are all 0.0's.

Additional information:

The build issues:

C:\Users\jim\source\repos\TestGPU\TestGPU\TestGPU.f90(36): warning #9127: The executable ALLOCATE directive associated with an ALLOCATE statement is deprecated.

I suspect that the warning indicates that the shared memory allocation failed over to local memory allocation.

Note, with the integrated graphics, the Host and GPU memory are the same.

Any hints would be appreciated.

Jim Dempsey

Barbara_P_Intel · ‎10-20-2023

I ran @TobiasK version on my laptop successfully.

@jimdempseyatthecove, do you know about the env var LIBOMPTARGET_PLUGIN_PROFILE? I set it to T and got this profile info:

======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) Iris(R) Xe Graphics, Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_80ff4b02_b27ec6_MAIN___l27
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (msec)                        Device Time (msec)
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :     722.94    722.94    722.94    722.94      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :       1.89      0.13      0.00      0.58      0.00      0.00      0.00      0.00     14.00
DataRead (Device to Host) :       0.05      0.05      0.05      0.05      0.00      0.00      0.00      0.00      1.00
DataWrite (Host to Device):       0.17      0.06      0.05      0.06      0.00      0.00      0.00      0.00      3.00
Kernel 0                  :       2.29      2.29      2.29      2.29      0.09      0.09      0.09      0.09      1.00
Linking                   :       0.06      0.06      0.06      0.06      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :       6.91      6.91      6.91      6.91      0.00      0.00      0.00      0.00      1.00
======================================================================================================================

I think we can infer that USM worked (fingers crossed). From my previous experience on Linux using MAP to transfer data in/out there is device time for DataRead/DataWrite. I don't see that here.

jimdempseyatthecove · ‎10-20-2023

>> think we can infer that USM worked (fingers crossed).

No, you'd have to compare that report running @TobiasK 's method to that report running my method.

The lack of the MemMove in my run diagnostics, and present in @TobiasK code run

together with 5.25791359998402 seconds runtime for second loop with offload my code

verses 7.36363490001531 seconds runtime for second loop with offload @TobiasK code

Strongly indicates that even though @TobiasK code allocated USM, the code generated treated the allocation as if it were located in host memory (as evidenced by the memmove of 1600000 bytes of memory)

BTW I am unable to get any additional output with LIBOMPTARGET_PLUGIN_PROFILE=T

Jim Dempsey