Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Unable to allocate memory with KMPC

jimdempseyatthecove
Honored Contributor III
2,819 Views

Trying out offloading in Fortran.

module myDPCPPlib
    interface
        subroutine DPCPPinit() bind(C, name="DPCPPinit")
        end subroutine DPCPPinit
    end interface
end module myDPCPPlib
    
program TestGPU
    use myDPCPPlib
    use omp_lib
    implicit none
    ! Variables
    integer i,j
    integer :: nRows
    real, allocatable :: arrayShared(:,:)
    real :: sum(4)
    ! Body of TestGPU
    print *,omp_get_num_devices()
    call DPCPPinit()
    print *,omp_get_num_devices()
    print *
    nRows = 100000
    !$omp allocate allocator(omp_target_shared_mem_alloc)
    allocate(arrayShared(4,nRows))
    do j=1,size(arrayShared, dim=2)
        do i=1,4
            arrayShared(i,j) = i*j
        end do
    end do
    
    !$omp target teams distribute parallel do map(from:nRows, arrayShared) reduction(+:sum)
    do j=1,nRows
        sum = sum + arrayShared(:,j)
    end do
    !$omp end target teams distribute parallel do
    print *,sum
end program TestGPU
extern "C" {
    void DPCPPinit() {
        // The default device selector will select the most performant device.
        auto d_selector{ default_selector_v };
        try {
            queue q(d_selector, exception_handler);

            // Print out the device information used for the kernel code.
            std::cout << "Running on device: "
                << q.get_device().get_info<info::device::name>() << "\n";
        }
        catch (exception const& e) {
            std::cout << "An exception is caught.\n";
            std::terminate();
        }
    }
}

(other code elided from library)

Output:

           1
Running on device: Intel(R) Graphics [0x9bca]
           1

forrtl: warning (786): Unable to allocate memory with KMPC, _mm_malloc used instead
  0.0000000E+00  0.0000000E+00  0.0000000E+00  0.0000000E+00
Press any key to continue . . .

First test, line 18 of program sees if I can get the number of devices before making any GPU interaction. The tests succeeds, 1 device locate.

The second test, to call into the library to obtain the device information, succeeds.

The third test, retries the get number of devices, and this too succeeds.

 

Now, to the crux of the matter. When I reach

    !$omp allocate allocator(omp_target_shared_mem_alloc)
    allocate(arrayShared(4,nRows))

I get the warning(786) message

The program continues.

The code runs to completion, but the results are all 0.0's.

Additional information:

The build issues:

C:\Users\jim\source\repos\TestGPU\TestGPU\TestGPU.f90(36): warning #9127: The executable ALLOCATE directive associated with an ALLOCATE statement is deprecated.

I suspect that the warning indicates that the shared memory allocation failed over to local memory allocation.

Note, with the integrated graphics, the Host and GPU memory are the same.

 

Any hints would be appreciated.

 

Jim Dempsey

 

 

0 Kudos
22 Replies
jimdempseyatthecove
Honored Contributor III
2,547 Views

Ok, I see that the OpenMP 4 spec has

!$omp allocate[(list)] [allocator(...))]

but OpenMP 5 does not.

 

What is the recommended way to obtain shared memory (USM).

I know I can call my supplied DPCPP library to allocate and return a C_PTP, then use C_F_POINTER to convert the pointer. But this is not clean. There should be an !$omp... directive that does this without the glue code.

 

Jim Dempsey

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,522 Views

I've worked out the allocation using the OpenMP 5.0 API.

But the behavior is odd.

jimdempseyatthecove_0-1697650534950.png

The allocation returns a C_PTR,

pointer validated as not NULL,

call to C_F_POINTER works.

associated indicates true

print of array shows all 0.0's (debug build zeroed array)

host do loop filling in array works

print out of array shows values (Host access works)

sum is correct

sum set to 999.

!$omp target teams distribute parallel do map(from:nRows) reduction(+:sum)

loop runs (presumably on GPU)

returned sum was not modified

I also tried

!$omp target teams distribute parallel do map(from:nRows) map(tofrom:sum) reduction(+:sum)

with no luck.

An additional point of interest is in the Locals window

arrayShared appears twice as undefined

Yet I can print out the contents.

 

Jim Dempsey

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,520 Views

Setting environment variable LIBOMPTARGET_DEBUG=1

output at break point on first statement of program

Libomptarget --> Init target library!
Libomptarget --> Callback to __tgt_register_ptask_services with handlers 0x00007ffa930a1300 0x00007ffa9309ff40
Libomptarget --> Initialized OMPT
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'omptarget.rtl.level0.dll'...
Target LEVEL0 RTL --> Init Level0 plugin!
Libomptarget --> Successfully loaded library 'omptarget.rtl.level0.dll'!
Target LEVEL0 RTL --> Looking for Level0 devices...
Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) Graphics [0x9bca]
Target LEVEL0 RTL --> Found 1 root devices, 1 total devices.
Target LEVEL0 RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL0 RTL --> -- 0
Target LEVEL0 RTL --> Root Device Information
Target LEVEL0 RTL --> Device 0
Target LEVEL0 RTL --> -- Name                         : Intel(R) Graphics [0x9bca]
Target LEVEL0 RTL --> -- PCI ID                       : 0x9bca
Target LEVEL0 RTL --> -- Number of total EUs          : 24
Target LEVEL0 RTL --> -- Number of threads per EU     : 7
Target LEVEL0 RTL --> -- EU SIMD width                : 8
Target LEVEL0 RTL --> -- Number of EUs per subslice   : 8
Target LEVEL0 RTL --> -- Number of subslices per slice: 3
Target LEVEL0 RTL --> -- Number of slices             : 1
Target LEVEL0 RTL --> -- Local memory size (bytes)    : 65536
Target LEVEL0 RTL --> -- Global memory size (bytes)   : 3353759744
Target LEVEL0 RTL --> -- Cache size (bytes)           : 524288
Target LEVEL0 RTL --> -- Max clock frequency (MHz)    : 1150
Target LEVEL0 RTL --> Driver API version is 10003
Target LEVEL0 RTL --> Interop property IDs, Names, Descriptions
Target LEVEL0 RTL --> -- 0, device_num_eus, intptr_t, total number of EUs
Target LEVEL0 RTL --> -- 1, device_num_threads_per_eu, intptr_t, number of threads per EU
Target LEVEL0 RTL --> -- 2, device_eu_simd_width, intptr_t, physical EU simd width
Target LEVEL0 RTL --> -- 3, device_num_eus_per_subslice, intptr_t, number of EUs per sub-slice
Target LEVEL0 RTL --> -- 4, device_num_subslices_per_slice, intptr_t, number of sub-slices per slice
Target LEVEL0 RTL --> -- 5, device_num_slices, intptr_t, number of slices
Target LEVEL0 RTL --> -- 6, device_local_mem_size, intptr_t, local memory size in bytes
Target LEVEL0 RTL --> -- 7, device_global_mem_size, intptr_t, global memory size in bytes
Target LEVEL0 RTL --> -- 8, device_global_mem_cache_size, intptr_t, global memory cache size in bytes
Target LEVEL0 RTL --> -- 9, device_max_clock_frequency, intptr_t, max clock frequency in MHz
Target LEVEL0 RTL --> -- 10, is_imm_cmd_list, intptr_t, Using immediate command list
Target LEVEL0 RTL --> Found driver extensions:
Target LEVEL0 RTL --> -- ZE_extension_float_atomics
Target LEVEL0 RTL --> -- ZE_experimental_relaxed_allocation_limits
Target LEVEL0 RTL --> -- ZE_experimental_module_program
Target LEVEL0 RTL --> -- ZE_experimental_scheduling_hints
Target LEVEL0 RTL --> -- ZE_experimental_global_offset
Target LEVEL0 RTL --> -- ZE_extension_pci_properties
Target LEVEL0 RTL --> -- ZE_extension_memory_compression_hints
Target LEVEL0 RTL --> -- ZE_extension_memory_free_policies
Target LEVEL0 RTL --> Returning 1 top-level devices
Libomptarget --> Registering RTL omptarget.rtl.level0.dll supporting 1 devices!
Libomptarget --> Optional interface: __tgt_rtl_data_alloc_base
Libomptarget --> Optional interface: __tgt_rtl_data_alloc_managed
Libomptarget --> Optional interface: __tgt_rtl_data_realloc
Libomptarget --> Optional interface: __tgt_rtl_data_aligned_alloc
Libomptarget --> Optional interface: __tgt_rtl_register_host_pointer
Libomptarget --> Optional interface: __tgt_rtl_unregister_host_pointer
Libomptarget --> Optional interface: __tgt_rtl_get_context_handle
Libomptarget --> Optional interface: __tgt_rtl_init_ompt
Libomptarget --> Optional interface: __tgt_rtl_requires_mapping
Libomptarget --> Optional interface: __tgt_rtl_push_subdevice
Libomptarget --> Optional interface: __tgt_rtl_pop_subdevice
Libomptarget --> Optional interface: __tgt_rtl_add_build_options
Libomptarget --> Optional interface: __tgt_rtl_is_supported_device
Libomptarget --> Optional interface: __tgt_rtl_create_interop
Libomptarget --> Optional interface: __tgt_rtl_release_interop
Libomptarget --> Optional interface: __tgt_rtl_use_interop
Libomptarget --> Optional interface: __tgt_rtl_get_num_interop_properties
Libomptarget --> Optional interface: __tgt_rtl_get_interop_property_value
Libomptarget --> Optional interface: __tgt_rtl_get_interop_property_info
Libomptarget --> Optional interface: __tgt_rtl_get_interop_rc_desc
Libomptarget --> Optional interface: __tgt_rtl_get_num_sub_devices
Libomptarget --> Optional interface: __tgt_rtl_is_accessible_addr_range
Libomptarget --> Optional interface: __tgt_rtl_notify_indirect_access
Libomptarget --> Optional interface: __tgt_rtl_is_private_arg_on_host
Libomptarget --> Optional interface: __tgt_rtl_command_batch_begin
Libomptarget --> Optional interface: __tgt_rtl_command_batch_end
Libomptarget --> Optional interface: __tgt_rtl_kernel_batch_begin
Libomptarget --> Optional interface: __tgt_rtl_kernel_batch_end
Libomptarget --> Optional interface: __tgt_rtl_set_function_ptr_map
Libomptarget --> Optional interface: __tgt_rtl_alloc_per_hw_thread_scratch
Libomptarget --> Optional interface: __tgt_rtl_free_per_hw_thread_scratch
Libomptarget --> Optional interface: __tgt_rtl_run_target_team_nd_region
Libomptarget --> Optional interface: __tgt_rtl_get_device_info
Libomptarget --> Optional interface: __tgt_rtl_data_aligned_alloc_shared
Libomptarget --> Optional interface: __tgt_rtl_prefetch_shared_mem
Libomptarget --> Optional interface: __tgt_rtl_get_device_from_ptr
Libomptarget --> Optional interface: __tgt_rtl_flush_queue
Libomptarget --> Optional interface: __tgt_rtl_sync_barrier
Libomptarget --> Optional interface: __tgt_rtl_async_barrier
Libomptarget --> Optional interface: __tgt_rtl_memcpy_rect_3d
Target LEVEL0 RTL --> Initialized OMPT
Libomptarget --> Loading library 'omptarget.rtl.x86_64.dll'...
Libomptarget --> Unable to load library 'omptarget.rtl.x86_64.dll': omptarget.rtl.x86_64.dll: Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> RTLs loaded!
Target LEVEL0 RTL --> Target binary is a valid oneAPI OpenMP image.
Libomptarget --> Image 0x00007ff761178000 is compatible with RTL omptarget.rtl.level0.dll!
Libomptarget --> RTL 0x00000296eb5bd4b0 has index 0!
Libomptarget --> Registering image 0x00007ff761178000 with RTL omptarget.rtl.level0.dll!
Libomptarget --> Done registering entries!

Looks fine so far.

Output following call to omp_aligned_alloc

Libomptarget --> Call to llvm_omp_target_alloc_shared for device 0 requesting 80 bytes
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Call to omp_get_initial_device returning 1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target LEVEL0 RTL --> Initialize requires flags to 8
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80200010000
Target LEVEL0 RTL --> Initialized device memory pool for device 0x00000296ed819978: AllocUnit = 65536, AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Allocated a shared memory 0x00000296f1950000
Target LEVEL0 RTL --> Initialized shared memory pool for device 0x00000296ed819978: AllocUnit = 65536, AllocMax = 8388608, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Initialized reduction scratch pool for device 0x00000296ed819978: AllocMin = 65536, AllocMax = 268435456, PoolSizeMax = 8589934592
Target LEVEL0 RTL --> Initialized zero-initialized reduction counter pool for device 0x00000296ed819978: AllocMin = 64, AllocMax = 64, PoolSizeMax = 1048576
Target LEVEL0 RTL --> Allocated a host memory 0x00000296f1950000
Target LEVEL0 RTL --> Initialized host memory pool for device 0x00000296ed819978: AllocUnit = 65536, AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Created a command queue 0x00000296ed816c58 (Ordinal: 0, Index: 0) for device 0.
Target LEVEL0 RTL --> Initialized Level0 device 0
Libomptarget --> Device 0 is ready to use.
Target LEVEL0 RTL --> Allocated a shared memory 0x00000296f1950000
Target LEVEL0 RTL --> New block allocation for shared memory pool: base = 0x00000296f1950000, size = 65536, pool size = 65536
Libomptarget --> llvm_omp_target_alloc_shared returns device ptr 0x00000296f1950000

Indicating shared memory allocated.

Output following  !$omp target teams distribute parallel do map(from:nRows) reduction(+:sum)

Libomptarget --> Entering target region for device 0 with entry point 0x00007ff761165b66
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Call to omp_get_initial_device returning 1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 1
Libomptarget --> Device 0 is ready to use.
Target LEVEL0 RTL --> Device 0: Loading binary from 0x00007ff761178000
Target LEVEL0 RTL --> Expecting to have 1 entries defined
Target LEVEL0 RTL --> Base L0 module compilation options: -cl-std=CL2.0
Target LEVEL0 RTL --> Found a single section in the image
Target LEVEL0 RTL --> Created module from image #0.
Target LEVEL0 RTL --> Module link is not required
Target LEVEL0 RTL --> Looking up device global variable '__omp_offloading_entries_table_size' of size 8 bytes on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 8 bytes).
Target LEVEL0 RTL --> Created a command list 0x00000296ed83fcc8 (Ordinal: 0) for device 0.
Target LEVEL0 RTL --> Warning: number of entries in host and device offload tables mismatch (1 != 3).
Target LEVEL0 RTL --> Looking up device global variable '__omp_offloading_entries_table' of size 120 bytes on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 120 bytes).
Target LEVEL0 RTL --> Device offload table loaded:
Target LEVEL0 RTL -->   0:      _ZL14name_val_table_e602d2b9dbc73bc7d12981ced14f3ce5
Target LEVEL0 RTL -->   1:      _ZL7pone_ld_3d4ae508d8dbf78737978824de0e0216
Target LEVEL0 RTL -->   2:      __omp_offloading_8c89ed45_a473_MAIN___l62
Target LEVEL0 RTL --> Looking up device global variable '__omp_offloading_8c89ed45_a473_MAIN___l62_kernel_info' of unknown size on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 80 bytes).
Target LEVEL0 RTL --> Kernel 0: Entry = 0x00007ff761165b66, Name = __omp_offloading_8c89ed45_a473_MAIN___l62, NumArgs = 6, Handle = 0x00000296f43056b0
Target LEVEL0 RTL --> Looking up device global variable '__omp_spirv_program_data' of size 64 bytes on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 64 bytes).
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80200010000
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80200310000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb80200310000, size = 65536, pool size = 65536
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80200320000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb80200320000, size = 65536, pool size = 131072
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80200330000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb80200330000, size = 65536, pool size = 196608
Libomptarget --> Entry  0: Base=0x000000f0a7aff52c, Begin=0x000000f0a7aff52c, Size=4, Type=0x22, Name=TESTGPU$NROWS
Libomptarget --> Entry  1: Base=0x000000f0a7aff540, Begin=0x000000f0a7aff540, Size=16, Type=0x223, Name=TESTGPU$SUM
Libomptarget --> Entry  2: Base=0x000000f0a7aff4c0, Begin=0x000000f0a7aff4c0, Size=96, Type=0x20, Name=TESTGPU$ARRAYSHARED
Libomptarget --> Entry  3: Base=0x000000f0a7aff4c0, Begin=0x00000296f1950000, Size=80, Type=0x3000000000213, Name=TESTGPU$ARRAYSHARED_addr_a0
Libomptarget --> Entry  4: Base=0x000000f0a7aff4c0, Begin=0x000000f0a7aff4c8, Size=88, Type=0x3000000000001, Name=TESTGPU$ARRAYSHARED_dv_len
Libomptarget --> Entry  5: Base=0x00007ff76116f1c0, Begin=0x00007ff76116f1c0, Size=4096, Type=0x4a0, Name=unknown
Libomptarget --> Entry  6: Base=0x00007ff76116f1c4, Begin=0x00007ff76116f1c4, Size=4, Type=0x40a0, Name=unknown
Libomptarget --> Entry  7: Base=0x0000000000000006, Begin=0x0000000000000006, Size=0, Type=0x120, Name=unknown
Libomptarget --> loop trip count is 0.
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff52c, Size=4)...
Target LEVEL0 RTL --> Ptr 0x000000f0a7aff52c requires mapping
Target LEVEL0 RTL --> Allocated a shared memory 0x00000296f1e10000
Target LEVEL0 RTL --> New block allocation for shared memory pool: base = 0x00000296f1e10000, size = 65536, pool size = 131072
Libomptarget --> Creating new map entry with HstPtrBase=0x000000f0a7aff52c, HstPtrBegin=0x000000f0a7aff52c, TgtPtrBegin=0x00000296f1e10000, Size=4, DynRefCount=1, HoldRefCount=0, Name=TESTGPU$NROWS
Libomptarget --> There are 4 bytes allocated at target address 0x00000296f1e10000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff540, Size=16)...
Target LEVEL0 RTL --> Ptr 0x000000f0a7aff540 requires mapping
Libomptarget --> Creating new map entry with HstPtrBase=0x000000f0a7aff540, HstPtrBegin=0x000000f0a7aff540, TgtPtrBegin=0x00000296f1e10040, Size=16, DynRefCount=1, HoldRefCount=0, Name=TESTGPU$SUM
Libomptarget --> Moving 16 bytes (hst:0x000000f0a7aff540) -> (tgt:0x00000296f1e10040)
Target LEVEL0 RTL --> Copied 16 bytes (hst:0x000000f0a7aff540) -> (tgt:0x00000296f1e10040)
Libomptarget --> There are 16 bytes allocated at target address 0x00000296f1e10040 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff4c0, Size=96)...
Target LEVEL0 RTL --> Ptr 0x000000f0a7aff4c0 requires mapping
Libomptarget --> Creating new map entry with HstPtrBase=0x000000f0a7aff4c0, HstPtrBegin=0x000000f0a7aff4c0, TgtPtrBegin=0x00000296f1950080, Size=96, DynRefCount=1, HoldRefCount=0, Name=TESTGPU$ARRAYSHARED
Libomptarget --> There are 96 bytes allocated at target address 0x00000296f1950080 - is new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff4c0, Size=8)...
Libomptarget --> Mapping exists (implicit) with HstPtrBegin=0x000000f0a7aff4c0, TgtPtrBegin=0x00000296f1950080, Size=8, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=unknown
Libomptarget --> There are 8 bytes allocated at target address 0x00000296f1950080 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000296f1950000, Size=80)...
Target LEVEL0 RTL --> Ptr 0x00000296f1950000 does not require mapping
Libomptarget --> Return HstPtrBegin 0x00000296f1950000 Size=80 for device-accessible memory
Libomptarget --> There are 80 bytes allocated at target address 0x00000296f1950000 - is not new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff4c8, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000f0a7aff4c8, TgtPtrBegin=0x00000296f1950088, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=TESTGPU$ARRAYSHARED_dv_len
Libomptarget --> Moving 88 bytes (hst:0x000000f0a7aff4c8) -> (tgt:0x00000296f1950088)
Target LEVEL0 RTL --> Copied 88 bytes (hst:0x000000f0a7aff4c8) -> (tgt:0x00000296f1950088)
Libomptarget --> There are 88 bytes allocated at target address 0x00000296f1950088 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff52c, Size=4)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000f0a7aff52c, TgtPtrBegin=0x00000296f1e10000, Size=4, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0x00000296f1e10000, Offset: 0) from host pointer 0x000000f0a7aff52c
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff540, Size=16)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000f0a7aff540, TgtPtrBegin=0x00000296f1e10040, Size=16, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0x00000296f1e10040, Offset: 0) from host pointer 0x000000f0a7aff540
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff4c0, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000f0a7aff4c0, TgtPtrBegin=0x00000296f1950080, Size=96, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0x00000296f1950080, Offset: 0) from host pointer 0x000000f0a7aff4c0
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80200340000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb80200340000, size = 524288, pool size = 524288
Target LEVEL0 RTL --> Allocated 4096 bytes from scratch pool
Libomptarget --> Allocated 4096 bytes of target memory at 0xffffb80200340000 for private array 0x00007ff76116f1c0 - pushing target argument 0xffffb80200340000
Target LEVEL0 RTL --> Allocated a device memory 0xffffb802003c0000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb802003c0000, size = 65536, pool size = 65536
Target LEVEL0 RTL --> Allocated 4 bytes from zero-initialized pool
Libomptarget --> Allocated 4 bytes of target memory at 0xffffb802003c0000 for private array 0x00007ff76116f1c4 - pushing target argument 0xffffb802003c0000
Libomptarget --> Forwarding first-private value 0x0000000000000006 to the target construct
Libomptarget --> Launching target execution __omp_offloading_8c89ed45_a473_MAIN___l62 with pointer 0x00000296f44a1f80 (index=0).
Target LEVEL0 RTL --> Executing a kernel 0x00000296f44a1f80...
Target LEVEL0 RTL --> omp_get_thread_limit() returned 2147483647
Target LEVEL0 RTL --> omp_get_max_teams() returned 0
Target LEVEL0 RTL --> Assumed kernel SIMD width is 16
Target LEVEL0 RTL --> Preferred team size is multiple of 16
Target LEVEL0 RTL --> Capping maximum team size to 1024 due to kernel constraints (reduction).
Target LEVEL0 RTL --> Capping maximum thread groups count to 1024 due to kernel constraints (reduction).
Target LEVEL0 RTL --> Team sizes = {224, 1, 1}
Target LEVEL0 RTL --> Number of teams = {12, 1, 1}
Target LEVEL0 RTL --> Kernel Pointer argument 0 (value: 0x00000296f1e10000) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Pointer argument 1 (value: 0x00000296f1e10040) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Pointer argument 2 (value: 0x00000296f1950080) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Pointer argument 3 (value: 0xffffb80200340000) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Pointer argument 4 (value: 0xffffb802003c0000) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Scalar argument 5 (value: 0x0000000000000006) was set successfully for device 0.
Target LEVEL0 RTL --> Setting indirect access flags 0x0000000000000006
Target LEVEL0 RTL --> Submitted kernel 0x00000296f43056b0 to device 0
Target LEVEL0 RTL --> Executed kernel entry 0x00000296f44a1f80 on device 0
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff4c8, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000f0a7aff4c8, TgtPtrBegin=0x00000296f1950088, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> There are 88 bytes allocated at target address 0x00000296f1950088 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x00000296f1950000, Size=80)...
Target LEVEL0 RTL --> Ptr 0x00000296f1950000 does not require mapping
Libomptarget --> Get HstPtrBegin 0x00000296f1950000 Size=80 for device-accessible memory
Libomptarget --> There are 80 bytes allocated at target address 0x00000296f1950000 - is not last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff4c0, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000f0a7aff4c0, TgtPtrBegin=0x00000296f1950080, Size=96, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 96 bytes allocated at target address 0x00000296f1950080 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff540, Size=16)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000f0a7aff540, TgtPtrBegin=0x00000296f1e10040, Size=16, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 16 bytes allocated at target address 0x00000296f1e10040 - is last
Libomptarget --> Moving 16 bytes (tgt:0x00000296f1e10040) -> (hst:0x000000f0a7aff540)
Target LEVEL0 RTL --> Copied 16 bytes (tgt:0x00000296f1e10040) -> (hst:0x000000f0a7aff540)
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000f0a7aff52c, Size=4)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000f0a7aff52c, TgtPtrBegin=0x00000296f1e10000, Size=4, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 4 bytes allocated at target address 0x00000296f1e10000 - is last
Libomptarget --> Moving 4 bytes (tgt:0x00000296f1e10000) -> (hst:0x000000f0a7aff52c)
Target LEVEL0 RTL --> Copied 4 bytes (tgt:0x00000296f1e10000) -> (hst:0x000000f0a7aff52c)
Libomptarget --> Removing map entry with HstPtrBegin=0x000000f0a7aff4c0, TgtPtrBegin=0x00000296f1950080, Size=96, Name=TESTGPU$ARRAYSHARED
Libomptarget --> Deleting tgt data 0x00000296f1950080 of size 96
Libomptarget --> Removing map entry with HstPtrBegin=0x000000f0a7aff540, TgtPtrBegin=0x00000296f1e10040, Size=16, Name=TESTGPU$SUM
Libomptarget --> Deleting tgt data 0x00000296f1e10040 of size 16
Libomptarget --> Removing map entry with HstPtrBegin=0x000000f0a7aff52c, TgtPtrBegin=0x00000296f1e10000, Size=4, Name=TESTGPU$NROWS
Libomptarget --> Deleting tgt data 0x00000296f1e10000 of size 4

Does this help?

 

Jim Dempsey

 

0 Kudos
JohnNichols
Valued Contributor III
2,513 Views

Can I say no, and I think you need a real expert, not me. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,506 Views

Let me sketch what I would like to do.

 

From a Fortran program I want to use shared memory to hold an allocatable array.

To accomplish this, it appears that I need to declare the unallocated array as a pointer.

 

real, pointer :: array(:)

 

While on Host, I can call the omp_aligned_malloc to obtain a block of shared memory, use C_F_POINTER to construct an array descriptor to assign to the pointer, and access it from host, I am having issues in accessing the GPU.

The problem I am encountering is the pointer itself resides in the Host (not in shared memory).

I would prefer for the pointer itself (meaning the locations storing the array descriptor, together with its contents) to be stored in shared memory. In this manner it only needs to be initialized (C_TO_F_POINTER) once for both the CPU and GPU to access the shared memory data.

 

I cannot use map(to:array) or from:array or tofrom array as this would copy the data pointed to by the array.

 

So, how do I accomplish this?

Note, I cannot pass the void* obtained from the aligned malloc into the offload region as I cannot use a block/end block to permit me to declare an alternate pointer, and then call C_TO_F_POINTER as this call is not available in an offload region..

 

A little help by way of a working example would be appreciated.

 

The objective is to have the same code run with or without the GPU and not have a bunch of hacks surrounding the original code.

 

Jim Dempsey 

0 Kudos
Barbara_P_Intel
Employee
2,498 Views

I haven't experimented with USM so I can't help there. BUT...

>> The objective is to have the same code run with or without the GPU and not have a bunch of hacks surrounding the original code.

There is an environment variable that you can set to use the GPU or not: 

  • OMP_TARGET_OFFLOAD = mandatory | disabled | default

Here's a description from the DGR. 

Controls the program behavior when offloading a target region.

Possible values:

  • MANDATORY: Program execution is terminated if a device construct or device memory routine is encountered and the device is not available or is not supported.

  • DISABLED: Disables target offloading to devices and execution occurs on the host.

  • DEFAULT: Target offloading is enabled if the device is available and supported.

Default: DEFAULT

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,486 Views

Thanks Barbara,

The diagnostic dump of the run indicated (to me) that the offload ran, but the results were not placed into the shared memory.

Where they went, I haven't a clue..

Also note the reduction(+:sum), did not modify (initialize to 0's) going into the offload region and/or did not return the 0'd sum regardless of the do loop executing.

No memory access error either.

 

If you can ask around, the question is how do you place the following in USM

 

!$omp ??? place this in USM

real, pointer :: array(:)

!$omp end ???

 

Such that I can use the omp_aligned_alloc(...) to allocate a buffer in shared memory and then create the pointer/descriptor in shared memory.

.AND.

have the pointer array persistent across multiple offloads.

While could construct a pointer to a UDT containing the pointer, then fill this in with a generated pointer from omp_aligned_malloc for shared memory, I cannot figure out how to make this pointer visible in the offload section.

 

I'd like to not having to hack in a solution.

Jim Dempsey

0 Kudos
TobiasK
Moderator
2,463 Views

@jimdempseyatthecove 

the reason why your results are incorrect is that you specify map(from:nrows) which means the do i=1,nrows is basically do i=1,uninitialized .

I don't like the idea behind USM that's why I never used it. I will ask some developers, if there is another way without c_ptrs or we have to wait until OpenMP 6 and still used the deprecated !$omp allocate allocators() until then.

0 Kudos
TobiasK
Moderator
2,450 Views

@jimdempseyatthecove 

Ah learned something new today.

 

!$omp allocate allocator(omp_target_shared_mem_alloc)
allocate(x(N))

 

should be replaced by:

 

!$omp allocators allocate(allocator(omp_target_shared_mem_alloc):x)
allocate(x(N))

 

 it's just the spelling that is deprecated, not the functionality.

0 Kudos
Barbara_P_Intel
Employee
2,436 Views

I did some googling with omp_target_shared_mem_alloc and found this in the Intel GPU Optimization Guide. That may be of interest.

As @TobiasK pointed out, use the updated syntax, i.e. !$omp allocators. Tobias said he'll get the document updated.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,430 Views

Thanks, looking at it now

 

Jim

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,423 Views

@TobiasK @Barbara_P_Intel 

Earlier, when I compiled with

!$omp allocate allocator(omp_target_shared_mem_alloc)
allocate( f1n(1:nvir,1:nvir) )

ifx stated this was depreciated.  OpenMP 4.0 lists this, OpenMP 5.0 does not.

I any event, desire to have not only the contents of the allocaton in USM but also the array descriptor f1n in USM

This will permit both the Host and the GPU use "f1n" without some hack code to get the descriptor in the GPU (or in the Host as the case may be).

 

I hope this explains the problem.

 

The allocation listed above, would be appropriate for

 

   Call SomeDPCPPsubroutine(C_LOC(f1n(1,1)), nvir, nvir)

 

My needs are for the array descriptor itself (and memory allocated to it) be in USM such that Fortran language offloads can access the data.

 

Jim Dempsey

0 Kudos
TobiasK
Moderator
2,413 Views

@jimdempseyatthecove 
only the syntax is depreciated not the functionality, so you can ignore the warning or change the syntax to the new syntax:

 

!$omp allocate allocator(omp_target_shared_mem_alloc)
allocate( f1n(1:nvir,1:nvir) )

 

to:

 

!$omp allocators allocate(allocator(omp_target_shared_mem_alloc):f1n)
allocate( f1n(1:nvir,1:nvir) )

 

 or are you saying that even that does not fix your problem?

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,392 Views
program
  real,allocatable :: fn1(:,:) ! there is no decorations on this
...
  !$omp allocate allocator(omp_target_shared_mem_allocate)
  allocate(fn1(1:nvir,1:nvir)) ! USM memory allocation inserted into fn1

! Note, the array descriptor fn1 is residing in local memory

consider

module foo
type bar_t
  real, allocatable :: array(:)
end type bar_t
type(bar_t) :: abar
end module foo
...
!$omp allocate allocator(omp_target_shared_mem_allocate)
allocate(abar%array(1234)) ! abar%array -> USM
...
!$omp task...
   abar%array(I) = expression
! where is abar located??
! where is the array descriptor array contained within abar located?

When I do build with !$omp allocate...

I do get the correct result

But I get the warning 9127 that this feature is depreciated (possibly being removed in future release)

And I get the Unable to allocate memory with KMPC, _mm_malloc used instead

And, worse, with LIBOMPTARGET_DEBUG=1 no offload debug trace output is emitted. IOW it compiled (executed) host code for the offload region.

 

With the hack I posted earlier, and LIBOMPTARGET_DEBUG=1, I do not receive the Unable to allocate memory... message, and I receive the offload debug trace output.

 

My conclusion is that !$omp allocate allocator(omp_target_shared_mem_allocate) is not working.

 

Jim Dempsey

0 Kudos
TobiasK
Moderator
2,372 Views

@jimdempseyatthecove 

could you try this:

program TestGPU
    use omp_lib
    implicit none
    ! Variables
    integer i,j
    integer :: nRows
    type d
        real, allocatable :: arrayShared(:,:)
    end type 
    type(d) :: test    
    !$omp allocate(test) allocator(omp_target_shared_mem_alloc)
    real :: sum(4)
    ! Body of TestGPU
    print *,omp_get_num_devices()

    print *
    nRows = 100000
    !$omp allocators allocate(allocator(omp_target_shared_mem_alloc):test%arrayshared)
    allocate(test%arrayShared(4,nRows))
    do j=1,size(test%arrayShared, dim=2)
        do i=1,4
            test%arrayShared(i,j) = i*j
        end do
    end do
    
    sum=0
    !$omp target teams distribute parallel do reduction(+:sum)
    do j=1,nRows
        sum = sum + test%arrayShared(:,j)
    end do
    !$omp end target teams distribute parallel do
    print *,sum
end program TestGPU
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,369 Views

Thanks, will do.

I wasn't aware of the ":xxx" feature.

 

Jim

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,365 Views

Initial test has problems

jimdempseyatthecove_0-1697813025333.png

And:

forrtl: warning (786): Unable to allocate memory with KMPC, _mm_malloc used instead
           1

forrtl: severe (151): allocatable array is already allocated
Image              PC                Routine            Line        Source
TestGPU.exe        00007FF6582913C0  TESTGPU                   105  TestGPU.f90
TestGPU.exe        00007FF65829259B  Unknown               Unknown  Unknown
TestGPU.exe        00007FF658292B69  Unknown               Unknown  Unknown
TestGPU.exe        00007FF658292A8E  Unknown               Unknown  Unknown
TestGPU.exe        00007FF65829294E  Unknown               Unknown  Unknown
TestGPU.exe        00007FF658292BDE  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFAD3127344  Unknown               Unknown  Unknown
ntdll.dll          00007FFAD51026B1  Unknown               Unknown  Unknown

If I comment out the !$omp allocate(test)...

I get a result, but, notice the LIBOMPTARGET_DEBUG=1 is not outputting a trace. IOW dispatch to target not performed.

           1

forrtl: warning (786): Unable to allocate memory with KMPC, _mm_malloc used instead
  5.0000502E+09  1.0000100E+10  1.5000150E+10  2.0000201E+10

What do you see on your side?

IOW not that a summation was produced, rather that the summation occured on the GPU.

 

Jim

0 Kudos
TobiasK
Moderator
2,348 Views

Which iGPU do you use?

On my Laptop everything works fine and I don't see any copying of data in the debug output of LIBOMPTARGET_DEBUG=4 and LIBOMPTARGET_INFO=4.

@Barbara_P_Intel will have another look at it.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,333 Views

Core i7-10710U has integrated graphics.

>>On my Laptop everything works fine and I don't see any copying of data in the debug output of LIBOMPTARGET_DEBUG=4 and LIBOMPTARGET_INFO=4.

This would indicate that there is no dialog with the GPU

 

On my system, at first statement of program (omp_get_num_devices())

The GPU reports:

Libomptarget (pid:19564) --> Init target library!
Libomptarget (pid:19564) --> Callback to __tgt_register_ptask_services with handlers 0x00007ffa6cad1300 0x00007ffa6cacff40
Libomptarget (pid:19564) --> Initialized OMPT
Libomptarget (pid:19564) --> Loading RTLs...
Libomptarget (pid:19564) --> Loading library 'omptarget.rtl.level0.dll'...
Target LEVEL0 RTL (pid:19564) --> Init Level0 plugin!
Libomptarget (pid:19564) --> Successfully loaded library 'omptarget.rtl.level0.dll'!
Target LEVEL0 RTL (pid:19564) --> Looking for Level0 devices...
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeInit ( ZE_INIT_FLAG_GPU_ONLY )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeInit (
Target LEVEL0 RTend of L (pid:19564) --> flags = 1
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDriverGet ( &NumDrivers, nullptr )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDriverGet (
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff83c
Target LEVEL0 RTL (pid:19564) --> phDrivers = 0x0000000000000000
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDriverGet ( &NumDrivers, &Driver )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDriverGet (
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff83c
Target LEVEL0 RTL (pid:19564) --> phDrivers = 0x000002ba19503f18
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGet ( Driver, &NumFoundDevices, nullptr )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGet (
Target LEVEL0 RTL (pid:19564) --> hDriver = 0x000002ba19512510
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff838
Target LEVEL0 RTL (pid:19564) --> phDevices = 0x0000000000000000
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGet ( Driver, &NumFoundDevices, RootDevices.data() )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGet (
Target LEVEL0 RTL (pid:19564) --> hDriver = 0x000002ba19512510
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff838
Target LEVEL0 RTL (pid:19564) --> phDevices = 0x000002ba1b5fb610
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetSubDevices ( RootDevices[I], &NumSub, nullptr )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetSubDevices (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff808
Target LEVEL0 RTL (pid:19564) --> phSubdevices = 0x0000000000000000
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetCommandQueueGroupProperties ( Device, &Count, nullptr )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetCommandQueueGroupProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff774
Target LEVEL0 RTL (pid:19564) --> pCommandQueueGroupProperties = 0x0000000000000000
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetCommandQueueGroupProperties ( Device, &Count, Properties.data() )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetCommandQueueGroupProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff774
Target LEVEL0 RTL (pid:19564) --> pCommandQueueGroupProperties = 0x000002ba1b68bd90
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetProperties ( Device, &properties )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pDeviceProperties = 0x000000c403fff600
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetComputeProperties ( Device, &computeProperties )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetComputeProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pComputeProperties = 0x000000c403fff5a0
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetMemoryProperties ( Device, &Count, &MemoryProperties )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetMemoryProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff430
Target LEVEL0 RTL (pid:19564) --> pMemProperties = 0x000000c403fff478
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetCacheProperties ( Device, &Count, &CacheProperties )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetCacheProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff430
Target LEVEL0 RTL (pid:19564) --> pCacheProperties = 0x000000c403fff440
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetCommandQueueGroupProperties ( Device, &Count, nullptr )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetCommandQueueGroupProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff3a4
Target LEVEL0 RTL (pid:19564) --> pCommandQueueGroupProperties = 0x0000000000000000
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetCommandQueueGroupProperties ( Device, &Count, Properties.data() )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetCommandQueueGroupProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff3a4
Target LEVEL0 RTL (pid:19564) --> pCommandQueueGroupProperties = 0x000002ba1b68c750
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetCommandQueueGroupProperties ( Device, &Count, nullptr )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetCommandQueueGroupProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff3a4
Target LEVEL0 RTL (pid:19564) --> pCommandQueueGroupProperties = 0x0000000000000000
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetCommandQueueGroupProperties ( Device, &Count, Properties.data() )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetCommandQueueGroupProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff3a4
Target LEVEL0 RTL (pid:19564) --> pCommandQueueGroupProperties = 0x000002ba1b68c0c0
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetCommandQueueGroupProperties ( Device, &Count, nullptr )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetCommandQueueGroupProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff3a4
Target LEVEL0 RTL (pid:19564) --> pCommandQueueGroupProperties = 0x0000000000000000
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDeviceGetCommandQueueGroupProperties ( Device, &Count, Properties.data() )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDeviceGetCommandQueueGroupProperties (
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff3a4
Target LEVEL0 RTL (pid:19564) --> pCommandQueueGroupProperties = 0x000002ba1b68bf70
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> Found a GPU device, Name = Intel(R) Graphics [0x9bca]
Target LEVEL0 RTL (pid:19564) --> Found 1 root devices, 1 total devices.
Target LEVEL0 RTL (pid:19564) --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL0 RTL (pid:19564) --> -- 0
Target LEVEL0 RTL (pid:19564) --> Root Device Information
Target LEVEL0 RTL (pid:19564) --> Device 0
Target LEVEL0 RTL (pid:19564) --> -- Name : Intel(R) Graphics [0x9bca]
Target LEVEL0 RTL (pid:19564) --> -- PCI ID : 0x9bca
Target LEVEL0 RTL (pid:19564) --> -- Number of total EUs : 24
Target LEVEL0 RTL (pid:19564) --> -- Number of threads per EU : 7
Target LEVEL0 RTL (pid:19564) --> -- EU SIMD width : 8
Target LEVEL0 RTL (pid:19564) --> -- Number of EUs per subslice : 8
Target LEVEL0 RTL (pid:19564) --> -- Number of subslices per slice: 3
Target LEVEL0 RTL (pid:19564) --> -- Number of slices : 1
Target LEVEL0 RTL (pid:19564) --> -- Local memory size (bytes) : 65536
Target LEVEL0 RTL (pid:19564) --> -- Global memory size (bytes) : 3353759744
Target LEVEL0 RTL (pid:19564) --> -- Cache size (bytes) : 524288
Target LEVEL0 RTL (pid:19564) --> -- Max clock frequency (MHz) : 1150
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeContextCreate ( Driver, &contextDesc, &context )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeContextCreate (
Target LEVEL0 RTL (pid:19564) --> hDriver = 0x000002ba19512510
Target LEVEL0 RTL (pid:19564) --> desc = 0x000000c403fff780
Target LEVEL0 RTL (pid:19564) --> phContext = 0x000000c403fff778
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDriverGetApiVersion ( Driver, &DriverAPIVersion )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDriverGetApiVersion (
Target LEVEL0 RTL (pid:19564) --> hDriver = 0x000002ba19512510
Target LEVEL0 RTL (pid:19564) --> version = 0x000002ba19503f28
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> Driver API version is 10003
Target LEVEL0 RTL (pid:19564) --> Interop property IDs, Names, Descriptions
Target LEVEL0 RTL (pid:19564) --> -- 0, device_num_eus, intptr_t, total number of EUs
Target LEVEL0 RTL (pid:19564) --> -- 1, device_num_threads_per_eu, intptr_t, number of threads per EU
Target LEVEL0 RTL (pid:19564) --> -- 2, device_eu_simd_width, intptr_t, physical EU simd width
Target LEVEL0 RTL (pid:19564) --> -- 3, device_num_eus_per_subslice, intptr_t, number of EUs per sub-slice
Target LEVEL0 RTL (pid:19564) --> -- 4, device_num_subslices_per_slice, intptr_t, number of sub-slices per slice
Target LEVEL0 RTL (pid:19564) --> -- 5, device_num_slices, intptr_t, number of slices
Target LEVEL0 RTL (pid:19564) --> -- 6, device_local_mem_size, intptr_t, local memory size in bytes
Target LEVEL0 RTL (pid:19564) --> -- 7, device_global_mem_size, intptr_t, global memory size in bytes
Target LEVEL0 RTL (pid:19564) --> -- 8, device_global_mem_cache_size, intptr_t, global memory cache size in bytes
Target LEVEL0 RTL (pid:19564) --> -- 9, device_max_clock_frequency, intptr_t, max clock frequency in MHz
Target LEVEL0 RTL (pid:19564) --> -- 10, is_imm_cmd_list, intptr_t, Using immediate command list
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDriverGetExtensionProperties ( Driver, &NumExtensions, nullptr )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDriverGetExtensionProperties (
Target LEVEL0 RTL (pid:19564) --> hDriver = 0x000002ba19512510
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff8a0
Target LEVEL0 RTL (pid:19564) --> pExtensionProperties = 0x0000000000000000
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDriverGetExtensionProperties ( Driver, &NumExtensions, Extensions.data() )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDriverGetExtensionProperties (
Target LEVEL0 RTL (pid:19564) --> hDriver = 0x000002ba19512510
Target LEVEL0 RTL (pid:19564) --> pCount = 0x000000c403fff8a0
Target LEVEL0 RTL (pid:19564) --> pExtensionProperties = 0x000002ba1b64cb40
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> Found driver extensions:
Target LEVEL0 RTL (pid:19564) --> -- ZE_extension_float_atomics
Target LEVEL0 RTL (pid:19564) --> -- ZE_experimental_relaxed_allocation_limits
Target LEVEL0 RTL (pid:19564) --> -- ZE_experimental_module_program
Target LEVEL0 RTL (pid:19564) --> -- ZE_experimental_scheduling_hints
Target LEVEL0 RTL (pid:19564) --> -- ZE_experimental_global_offset
Target LEVEL0 RTL (pid:19564) --> -- ZE_extension_pci_properties
Target LEVEL0 RTL (pid:19564) --> -- ZE_extension_memory_compression_hints
Target LEVEL0 RTL (pid:19564) --> -- ZE_extension_memory_free_policies
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDriverGetExtensionFunctionAddress ( Driver, "zeGitsIndirectAllocationOffsets", &GitsIndirectAllocationOffsets )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDriverGetExtensionFunctionAddress (
Target LEVEL0 RTL (pid:19564) --> hDriver = 0x000002ba19512510
Target LEVEL0 RTL (pid:19564) --> name = 0x00007ffa7da3475a
Target LEVEL0 RTL (pid:19564) --> ppFunctionAddress = 0x000002ba19504208
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDriverGetExtensionFunctionAddress ( Driver, "zexDriverImportExternalPointer", &RegisterHostPointer )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDriverGetExtensionFunctionAddress (
Target LEVEL0 RTL (pid:19564) --> hDriver = 0x000002ba19512510
Target LEVEL0 RTL (pid:19564) --> name = 0x00007ffa7da347bd
Target LEVEL0 RTL (pid:19564) --> ppFunctionAddress = 0x000002ba19504210
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDriverGetExtensionFunctionAddress ( Driver, "zexDriverReleaseImportedPointer", &UnRegisterHostPointer )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDriverGetExtensionFunctionAddress (
Target LEVEL0 RTL (pid:19564) --> hDriver = 0x000002ba19512510
Target LEVEL0 RTL (pid:19564) --> name = 0x00007ffa7da34822
Target LEVEL0 RTL (pid:19564) --> ppFunctionAddress = 0x000002ba19504218
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLER: zeDriverGetExtensionFunctionAddress ( Driver, "zexDriverGetHostPointerBaseAddress", &GetHostPointerBaseAddress )
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeDriverGetExtensionFunctionAddress (
Target LEVEL0 RTL (pid:19564) --> hDriver = 0x000002ba19512510
Target LEVEL0 RTL (pid:19564) --> name = 0x00007ffa7da3488f
Target LEVEL0 RTL (pid:19564) --> ppFunctionAddress = 0x000002ba19504220
Target LEVEL0 RTL (pid:19564) --> )
Target LEVEL0 RTL (pid:19564) --> Returning 1 top-level devices
Libomptarget (pid:19564) --> Registering RTL omptarget.rtl.level0.dll supporting 1 devices!
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_data_alloc_base
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_data_alloc_managed
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_data_realloc
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_data_aligned_alloc
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_register_host_pointer
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_unregister_host_pointer
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_get_context_handle
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_init_ompt
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_requires_mapping
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_push_subdevice
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_pop_subdevice
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_add_build_options
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_is_supported_device
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_create_interop
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_release_interop
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_use_interop
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_get_num_interop_properties
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_get_interop_property_value
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_get_interop_property_info
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_get_interop_rc_desc
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_get_num_sub_devices
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_is_accessible_addr_range
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_notify_indirect_access
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_is_private_arg_on_host
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_command_batch_begin
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_command_batch_end
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_kernel_batch_begin
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_kernel_batch_end
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_set_function_ptr_map
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_alloc_per_hw_thread_scratch
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_free_per_hw_thread_scratch
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_run_target_team_nd_region
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_get_device_info
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_data_aligned_alloc_shared
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_prefetch_shared_mem
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_get_device_from_ptr
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_flush_queue
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_sync_barrier
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_async_barrier
Libomptarget (pid:19564) --> Optional interface: __tgt_rtl_memcpy_rect_3d
Target LEVEL0 RTL --> Initialized OMPT
Libomptarget (pid:19564) --> Loading library 'omptarget.rtl.x86_64.dll'...
Libomptarget (pid:19564) --> Unable to load library 'omptarget.rtl.x86_64.dll': omptarget.rtl.x86_64.dll: Can't open: The specified module could not be found. (0x7E)!
Libomptarget (pid:19564) --> RTLs loaded!
Target LEVEL0 RTL (pid:19564) --> Target binary is a valid oneAPI OpenMP image.
Libomptarget (pid:19564) --> Image 0x00007ff71abce000 is compatible with RTL omptarget.rtl.level0.dll!
Libomptarget (pid:19564) --> RTL 0x000002ba194a54e0 has index 0!
Libomptarget (pid:19564) --> Registering image 0x00007ff71abce000 with RTL omptarget.rtl.level0.dll!
Libomptarget (pid:19564) --> Done registering entries!

 

After allocate:

Libomptarget (pid:19564) --> Call to omp_get_num_devices returning 1
1

forrtl: warning (786): Unable to allocate memory with KMPC, _mm_malloc used instead

 

end of program (after offload and print of sum)

 

An exceptionally long output from entering, running, and exiting the offload.

Of particular interests/concerns are:

...
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeMemAllocDevice (
Target LEVEL0 RTL (pid:19564) --> hContext = 0x000002ba1b6637b0
Target LEVEL0 RTL (pid:19564) --> device_desc = 0x000000c403ffec90
Target LEVEL0 RTL (pid:19564) --> size = 3145728
Target LEVEL0 RTL (pid:19564) --> alignment = 0
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pptr = 0x000000c403ffeca8
Target LEVEL0 RTL (pid:19564) --> )

...
Libomptarget (pid:19564) --> Entry 2: Base=0x000000c403fff9b0, Begin=0x000002ba1d0d6050, Size=1600000, Type=0x2000000000213, Name=ARRAYSHARED_addr_a0

This is the shared memory address and size of the array
...
Another buffer of 8MB is allocated
Target LEVEL0 RTL (pid:19564) --> ZE_CALLEE: zeMemAllocShared (
Target LEVEL0 RTL (pid:19564) --> hContext = 0x000002ba1b6637b0
Target LEVEL0 RTL (pid:19564) --> device_desc = 0x000000c403ffe7c0
Target LEVEL0 RTL (pid:19564) --> host_desc = 0x000000c403ffe7a0
Target LEVEL0 RTL (pid:19564) --> size = 8388608
Target LEVEL0 RTL (pid:19564) --> alignment = 0
Target LEVEL0 RTL (pid:19564) --> hDevice = 0x000002ba1b61f978
Target LEVEL0 RTL (pid:19564) --> pptr = 0x000000c403ffe7d8
Target LEVEL0 RTL (pid:19564) --> )
...
Libomptarget (pid:19564) --> Moving 1600000 bytes (hst:0x000002ba1d0d6050) -> (tgt:0x000002ba1d730000)

Now it is copying the USM memory, into the device

IOW this is as if the array was stored in Host memory

...
Libomptarget (pid:19564) --> Moving 1600000 bytes (hst:0x000002ba1d0d6050) -> (tgt:0x000002ba1d730000)

 

So @TobiasK recommended code, while producing the expected sum, his code is treating the USM allocation as if it were Host memory

 

Now to run my hack and follow the activity on the GPU (I changed the nRows of my code to 100000)

Libomptarget (pid:28636) --> Done registering entries!
Libomptarget (pid:28636) --> Call to llvm_omp_target_alloc_shared for device 0 requesting 1600000 bytes

 

Allocation of USM, good

 

No reference to "Moving 1600000", good

 

I can only concur that the hack method is the only one that works.

 

I will get some timing to also confirm.

 

Back after lunch.

 

Jim

 

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,327 Views

Use if(.true.) for hack method, .false. for @TobiasK method.

This is indirect proof that the USM memory is being copied in(/out) the offload region.

(I did not examine the diagnostic dump for copy out of the offload region)

!  TestGPU.f90 
    
!dir$ if(.false.)
program TestGPU
    use omp_lib
    USE, INTRINSIC :: ISO_C_BINDING
    implicit none
    !$omp requires UNIFIED_SHARED_MEMORY
    ! Variables
    integer i,j
    integer, parameter :: nCols = 4
    integer :: nRows
    type(C_PTR) :: blob
    integer(C_INTPTR_T) :: x
    type boink
        real, pointer :: arrayShared(:,:)
    end type boink
    type(boink) :: theBoink
    
    real, pointer :: hack(:,:)
    real :: sum(ncols)
    real(8) :: t1, t2
    nRows = 100000
    blob = omp_aligned_alloc (64, nRows*sizeof(sum), omp_target_shared_mem_alloc)
    call C_F_POINTER(blob, hack, [nCols,nRows])
    theBoink%arrayShared => hack
    do j=1,size(theBoink%arrayShared, dim=2)
        do i=1,4
            theBoink%arrayShared(i,j) = i*j
        end do
    end do
    sum = 0.0
    do j=1,nRows
        sum = sum + theBoink%arrayShared(:,j)
    end do
    print *,sum
    t1 = omp_get_wtime()
    sum = 0.0
    !$omp target teams distribute parallel do reduction(+:sum) map(theBoink,sum)
    do j=1,nRows
        sum = sum + theBoink%arrayShared(:,j)
    end do
    !$omp end target teams distribute parallel do
    print *,sum
    t2 = omp_get_wtime()
    print *, "first time", t2-t1
    t1 = omp_get_wtime()
    do i=1,10000
    sum = 0.0
    !$omp target teams distribute parallel do reduction(+:sum) map(theBoink,sum)
    do j=1,nRows
        sum = sum + theBoink%arrayShared(:,j)
    end do
    !$omp end target teams distribute parallel do
    end do
    print *,sum
    t2 = omp_get_wtime()
    print *, "second time", t2-t1
end program TestGPU
!dir$ else
program TestGPU
    use omp_lib
    implicit none
    ! Variables
    integer i,j
    integer :: nRows
    real(8) :: t1, t2
    type d
        real, allocatable :: arrayShared(:,:)
    end type 
    type(d) :: test    
    !!$omp allocate(test) allocator(omp_target_shared_mem_alloc)
    real :: sum(4)
    ! Body of TestGPU
    print *,omp_get_num_devices()

    print *
    nRows = 100000
    !$omp allocators allocate(allocator(omp_target_shared_mem_alloc):test%arrayshared)
    allocate(test%arrayShared(4,nRows))
    do j=1,size(test%arrayShared, dim=2)
        do i=1,4
            test%arrayShared(i,j) = i*j
        end do
    end do
    
    t1 = omp_get_wtime()
    sum=0
    !$omp target teams distribute parallel do reduction(+:sum)
    do j=1,nRows
        sum = sum + test%arrayShared(:,j)
    end do
    !$omp end target teams distribute parallel do
    print *,sum
    t2 = omp_get_wtime()
    print *, "first time", t2-t1
    t1 = omp_get_wtime()
    do i=1,10000
    sum=0
    !$omp target teams distribute parallel do reduction(+:sum)
    do j=1,nRows
        sum = sum + test%arrayShared(:,j)
    end do
    !$omp end target teams distribute parallel do
    end do
    print *,sum
    t2 = omp_get_wtime()
    print *, "second time", t2-t1
end program TestGPU!dir$ endif
!dir$ endif

Loop count around offload = 1000

Hack method:
  4.9999903E+09  9.9999805E+09  1.5000218E+10  1.9999961E+10
  5.0000502E+09  1.0000100E+10  1.5000150E+10  2.0000201E+10
 first time   1.08818829999655
  5.0000502E+09  1.0000100E+10  1.5000150E+10  2.0000201E+10
 second time  0.566158799978439
Other method:
           1

forrtl: warning (786): Unable to allocate memory with KMPC, _mm_malloc used instead
  5.0000502E+09  1.0000100E+10  1.5000150E+10  2.0000201E+10
 first time   1.04598009999609
  5.0000502E+09  1.0000100E+10  1.5000150E+10  2.0000201E+10
 second time  0.735391399997752
Press any key to continue . . .

Loop count = 10000

Hack method:
  4.9999903E+09  9.9999805E+09  1.5000218E+10  1.9999961E+10
  5.0000502E+09  1.0000100E+10  1.5000150E+10  2.0000201E+10
 first time   1.06300289998762
  5.0000502E+09  1.0000100E+10  1.5000150E+10  2.0000201E+10
 second time   5.25791359998402
Press any key to continue . . .
           1

forrtl: warning (786): Unable to allocate memory with KMPC, _mm_malloc used instead
  5.0000502E+09  1.0000100E+10  1.5000150E+10  2.0000201E+10
 first time   1.02533200001926
  5.0000502E+09  1.0000100E+10  1.5000150E+10  2.0000201E+10
 second time   7.36363490001531
Press any key to continue . . .

 

0 Kudos
Reply