- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Trying out offloading in Fortran.
module myDPCPPlib
interface
subroutine DPCPPinit() bind(C, name="DPCPPinit")
end subroutine DPCPPinit
end interface
end module myDPCPPlib
program TestGPU
use myDPCPPlib
use omp_lib
implicit none
! Variables
integer i,j
integer :: nRows
real, allocatable :: arrayShared(:,:)
real :: sum(4)
! Body of TestGPU
print *,omp_get_num_devices()
call DPCPPinit()
print *,omp_get_num_devices()
print *
nRows = 100000
!$omp allocate allocator(omp_target_shared_mem_alloc)
allocate(arrayShared(4,nRows))
do j=1,size(arrayShared, dim=2)
do i=1,4
arrayShared(i,j) = i*j
end do
end do
!$omp target teams distribute parallel do map(from:nRows, arrayShared) reduction(+:sum)
do j=1,nRows
sum = sum + arrayShared(:,j)
end do
!$omp end target teams distribute parallel do
print *,sum
end program TestGPU
extern "C" {
void DPCPPinit() {
// The default device selector will select the most performant device.
auto d_selector{ default_selector_v };
try {
queue q(d_selector, exception_handler);
// Print out the device information used for the kernel code.
std::cout << "Running on device: "
<< q.get_device().get_info<info::device::name>() << "\n";
}
catch (exception const& e) {
std::cout << "An exception is caught.\n";
std::terminate();
}
}
}
(other code elided from library)
Output:
1
Running on device: Intel(R) Graphics [0x9bca]
1
forrtl: warning (786): Unable to allocate memory with KMPC, _mm_malloc used instead
0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00
Press any key to continue . . .
First test, line 18 of program sees if I can get the number of devices before making any GPU interaction. The tests succeeds, 1 device locate.
The second test, to call into the library to obtain the device information, succeeds.
The third test, retries the get number of devices, and this too succeeds.
Now, to the crux of the matter. When I reach
!$omp allocate allocator(omp_target_shared_mem_alloc)
allocate(arrayShared(4,nRows))
I get the warning(786) message
The program continues.
The code runs to completion, but the results are all 0.0's.
Additional information:
The build issues:
C:\Users\jim\source\repos\TestGPU\TestGPU\TestGPU.f90(36): warning #9127: The executable ALLOCATE directive associated with an ALLOCATE statement is deprecated.
I suspect that the warning indicates that the shared memory allocation failed over to local memory allocation.
Note, with the integrated graphics, the Host and GPU memory are the same.
Any hints would be appreciated.
Jim Dempsey
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I ran @TobiasK version on my laptop successfully.
@jimdempseyatthecove, do you know about the env var LIBOMPTARGET_PLUGIN_PROFILE? I set it to T and got this profile info:
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) Iris(R) Xe Graphics, Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0 : __omp_offloading_80ff4b02_b27ec6_MAIN___l27
----------------------------------------------------------------------------------------------------------------------
: Host Time (msec) Device Time (msec)
Name : Total Average Min Max Total Average Min Max Count
----------------------------------------------------------------------------------------------------------------------
Compiling : 722.94 722.94 722.94 722.94 0.00 0.00 0.00 0.00 1.00
DataAlloc : 1.89 0.13 0.00 0.58 0.00 0.00 0.00 0.00 14.00
DataRead (Device to Host) : 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 1.00
DataWrite (Host to Device): 0.17 0.06 0.05 0.06 0.00 0.00 0.00 0.00 3.00
Kernel 0 : 2.29 2.29 2.29 2.29 0.09 0.09 0.09 0.09 1.00
Linking : 0.06 0.06 0.06 0.06 0.00 0.00 0.00 0.00 1.00
OffloadEntriesInit : 6.91 6.91 6.91 6.91 0.00 0.00 0.00 0.00 1.00
======================================================================================================================
I think we can infer that USM worked (fingers crossed). From my previous experience on Linux using MAP to transfer data in/out there is device time for DataRead/DataWrite. I don't see that here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> think we can infer that USM worked (fingers crossed).
No, you'd have to compare that report running @TobiasK 's method to that report running my method.
The lack of the MemMove in my run diagnostics, and present in @TobiasK code run
together with 5.25791359998402 seconds runtime for second loop with offload my code
verses 7.36363490001531 seconds runtime for second loop with offload @TobiasK code
Strongly indicates that even though @TobiasK code allocated USM, the code generated treated the allocation as if it were located in host memory (as evidenced by the memmove of 1600000 bytes of memory)
BTW I am unable to get any additional output with LIBOMPTARGET_PLUGIN_PROFILE=T
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »