Offload USM (Unified Shared Memory) Hack, Reproducer

jimdempseyatthecove · ‎10-18-2023

I think I have a hack to use USM in Fortran.

There may be a better way to do this, but I am at a loss for finding an official way.

Some GPU's support USM. The objective is for the virtual machine of the CPU and the virtual machine of the GPU to be able to map the same addresses. Such that when the host de-references a USM address and the GPU de-references the same address they access the same data be it in the host RAM or GPU RAM. The drivers may migrate the data over the PCIe bus or directly access the variable over the PCIe bus should the data not reside in the accessors local memory.

The major benefit of this is less code to change when porting and app to use GPU, and more importantly, to have the same set of source code (!$omp with directives) run without a GPU (or one not supporting USM).

The problem to overcome is to construct a way such that an entire array can reside in USM .AND. not be transferred in whole as you enter and leave an offload region.

I have been unable to locate an OpenMP 5.0 way of doing this (OpenMP 4.0 seemed to have this ability, but this directive has been removed from 5.0).

Now the hack:

program TestGPU
    use myDPCPPlib
    use omp_lib
    USE, INTRINSIC :: ISO_C_BINDING
    implicit none
    !$omp requires UNIFIED_SHARED_MEMORY
    ! Variables
    integer i,j
    integer, parameter :: nCols = 4
    integer :: nRows
    type(C_PTR) :: blob
    integer(C_INTPTR_T) :: x
    type boink
        real, pointer :: arrayShared(:,:)
    end type boink
    type(boink) :: theBoink
    
    real, pointer :: hack(:,:)
    real :: sum(ncols)
    
    nRows = 500
    blob = omp_aligned_alloc (64, nRows*sizeof(sum), omp_target_shared_mem_alloc)
    call C_F_POINTER(blob, hack, [nCols,nRows])
    theBoink%arrayShared => hack
    do j=1,size(theBoink%arrayShared, dim=2)
        do i=1,4
            theBoink%arrayShared(i,j) = i*j
        end do
    end do
    do j=1,nRows
        sum = sum + theBoink%arrayShared(:,j)
    end do
    print *,sum
    !$omp target teams distribute parallel do map(theBoink,sum) reduction(+:sum)
    do j=1,nRows
        sum = sum + theBoink%arrayShared(:,j)
    end do
    !$omp end target teams distribute parallel do
    print *,sum
end program TestGPU

Output:

   125250.0       250500.0       375750.0       501000.0
   250500.0       501000.0       751500.0       1002000.

the problem is, the reduction(+:sum) is not implicitly zeroed.

I can manually zero the sum value outside the offload region, but I am worried that if the reduction isn't working for the zeroing, that it may not be working for a race condition (especially if writes are involved).

Can anyone shed light on this.

Jim Dempsey

TobiasK · ‎10-19-2023

@jimdempseyatthecove the summation works as expected, however, it's unsafe if you do not initialize sum()=0 before your host summation via compiler options. OpenMP reductions do not zero the initial value of variable, at least that's my understanding of the standard.

jimdempseyatthecove · ‎10-19-2023

Oooo, I read the (+:var) initialization wrong (it is omp_priv=0 not omp_in=0)

Thanks for pointing this out.

Do you know of an easire way to have a module variable in USM (in particular an array descriptor, or pointer)?

Something similar to what is done with !$omp threadprivate, though with USM.

Jim Dempsey