Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28323 Discussions

Internal compiler error (C0000005) when compiling OpenMP program

lxander
Novice
1,492 Views

I am getting the error "xfortcom: Fatal: There has been an internal compiler error (C0000005)" when compiling the following code with the IFX Intel® Fortran Compiler (Version 2023.0.0 Build 20221201).

  program GPUTests

  use omp_lib
  
  implicit none
  
  ! Declare variables
  integer, parameter :: n = 10000
  ! double precision, ALLOCATABLE :: a(:,:), b(:,:), c(:,:)
  integer :: i,j,k
  integer :: devices
  double precision, ALLOCATABLE, target :: a_d(:,:), b_d(:,:), c_d(:,:)
  
  allocate(a_d(n,n))
  allocate(b_d(n,n))
  allocate(c_d(n,n))
  
  devices = omp_get_num_devices()

  call random_seed()
  call random_number(a_d)
  call random_number(b_d)
  
  ! Set up OpenMP parallel region
  !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j,k)

  ! Perform multiplication on GPU
  !$omp target map(to: a_d,b_d) map(from:c_d)
  !$omp  do
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c_d(i,j) = c_d(i,j) + a_d(i,k) * b_d(k,j)
      end do
    end do
  end do
  !$omp end  do
  !$omp end target

  ! End parallel region
  !$OMP END PARALLEL

  ! Print result
  print *, c_d(1,1)

  end program GPUTests

If I comment out "!$omp target map(to: a_d,b_d) map(from:c_d)" and " !$omp end target" at lines 28 and 38, the code compiles and runs (but, of course, the calculation is done on the CPU instead of the GPU).  I'm wondering if I have my OpenMP target directives wrong.  Perhaps I have somethiong wrong in my build options?

Build Log
   

Build started: Project: GPUTests, Configuration: Debug|x64

Output
   
Deleting intermediate files and output files for project 'GPUTests', configuration 'Debug|x64'.
Compiling with Intel® Fortran Compiler 2023.0.0 [Intel(R) 64]...
ifx /nologo /debug:full /Od /Qopenmp-targets:spir64 /Qopenmp /Qiopenmp /warn:interfaces /module:"x64\Debug\\" /object:"x64\Debug\\" /Fd"x64\Debug\vc170.pdb" /libs:dll /threads /dbglibs /c /Qlocation,link,"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\bin\HostX64\x64" /Qm64 "C:\My Files\Repos\Arc\GPUTests\GPUTests.f90"
xfortcom: Fatal: There has been an internal compiler error (C0000005).
compilation aborted for C:\My Files\Repos\Arc\GPUTests\GPUTests.f90 (code 1)


GPUTests - 1 error(s), 0 warning(s)

 

14 Replies
lxander
Novice
1,360 Views

I made a small change to the OpenMP directives about the loop; changing "target" to "target teams distribute" etc. and changed the array type to REAL*4 (because with LIBOMPTARGET_DEBUG=1, there is a Level0 message "Double is not supported on this platform") ...

  program GPUTests

  use omp_lib

  implicit none

  ! Declare variables
  integer, parameter :: n = 1000

  integer :: devices

  integer :: i,j,k
  real*4, ALLOCATABLE, target :: a_d(:,:), b_d(:,:), c_d(:,:)

  allocate(a_d(n,n))
  allocate(b_d(n,n))
  allocate(c_d(n,n))

  devices = omp_get_num_devices()

  call random_seed()
  call random_number(a_d)
  call random_number(b_d)

  ! Perform multiplication on GPU
  !$OMP TARGET TEAMS DISTRIBUTE DEFAULT(SHARED) PRIVATE(I,J,K) MAP(TO:A_D,B_D) MAP(TOFROM:C_D)
  do i=1,n
    do j=1,n
      c_d(i,j) = 0.0
    end do
  end do

  !$OMP DO
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c_d(i,j) = c_d(i,j) + a_d(i,k) * b_d(k,j)
      end do
    end do
  end do
  !$OMP END DO
  
  !!$OMP END TARGET TEAMS DISTRIBUTE !<- commented out as with in get error #7622 Misplaced part of OpenMP parallel directive
  
  ! Print result
  print *, c_d(1,1)

  end program GPUTests

The code compiles without any errors and it runs and this is the output (from setting LIBOMPTARGET_DEBUG=1).

Libomptarget --> Init target library!
Libomptarget --> Initialized OMPT
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'omptarget.rtl.level0.dll'...
Target LEVEL0 RTL --> Init Level0 plugin!
Target LEVEL0 RTL --> omp_get_thread_limit() returned 2147483647
Target LEVEL0 RTL --> omp_get_max_teams() returned 0
Libomptarget --> Successfully loaded library 'omptarget.rtl.level0.dll'!
Target LEVEL0 RTL --> Looking for Level0 devices...
Target LEVEL0 RTL --> Found copy command queue for device 0x000001a4946c13b8, ordinal = 1
Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) Arc(TM) A770 Graphics
Target LEVEL0 RTL --> Found copy command queue for device 0x000001a4946c13b8, ordinal = 1
Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) Arc(TM) A770 Graphics
Target LEVEL0 RTL --> Found copy command queue for device 0x000001a4946c13b8, ordinal = 1
Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) Arc(TM) A770 Graphics
Target LEVEL0 RTL --> Found 1 root devices, 3 total devices.
Target LEVEL0 RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL0 RTL --> -- 0
Target LEVEL0 RTL --> -- 0.0.0
Target LEVEL0 RTL --> -- 0.0.1
Target LEVEL0 RTL --> Root Device Information
Target LEVEL0 RTL --> Device 0
Target LEVEL0 RTL --> -- Name : Intel(R) Arc(TM) A770 Graphics
Target LEVEL0 RTL --> -- PCI ID : 0x56a0
Target LEVEL0 RTL --> -- Number of total EUs : 512
Target LEVEL0 RTL --> -- Number of threads per EU : 8
Target LEVEL0 RTL --> -- EU SIMD width : 8
Target LEVEL0 RTL --> -- Number of EUs per subslice : 8
Target LEVEL0 RTL --> -- Number of subslices per slice: 8
Target LEVEL0 RTL --> -- Number of slices : 8
Target LEVEL0 RTL --> -- Local memory size (bytes) : 65536
Target LEVEL0 RTL --> -- Global memory size (bytes) : 13623099392
Target LEVEL0 RTL --> -- Cache size (bytes) : 4194304
Target LEVEL0 RTL --> -- Max clock frequency (MHz) : 2400
Target LEVEL0 RTL --> Driver API version is 10003
Target LEVEL0 RTL --> Interop property IDs, Names, Descriptions
Target LEVEL0 RTL --> -- 0, device_num_eus, intptr_t, total number of EUs
Target LEVEL0 RTL --> -- 1, device_num_threads_per_eu, intptr_t, number of threads per EU
Target LEVEL0 RTL --> -- 2, device_eu_simd_width, intptr_t, physical EU simd width
Target LEVEL0 RTL --> -- 3, device_num_eus_per_subslice, intptr_t, number of EUs per sub-slice
Target LEVEL0 RTL --> -- 4, device_num_subslices_per_slice, intptr_t, number of sub-slices per slice
Target LEVEL0 RTL --> -- 5, device_num_slices, intptr_t, number of slices
Target LEVEL0 RTL --> -- 6, device_local_mem_size, intptr_t, local memory size in bytes
Target LEVEL0 RTL --> -- 7, device_global_mem_size, intptr_t, global memory size in bytes
Target LEVEL0 RTL --> -- 8, device_global_mem_cache_size, intptr_t, global memory cache size in bytes
Target LEVEL0 RTL --> -- 9, device_max_clock_frequency, intptr_t, max clock frequency in MHz
Target LEVEL0 RTL --> Found driver extensions:
Target LEVEL0 RTL --> -- ZE_extension_float_atomics
Target LEVEL0 RTL --> -- ZE_experimental_relaxed_allocation_limits
Target LEVEL0 RTL --> -- ZE_experimental_module_program
Target LEVEL0 RTL --> -- ZE_experimental_scheduling_hints
Target LEVEL0 RTL --> -- ZE_experimental_global_offset
Target LEVEL0 RTL --> -- ZE_extension_pci_properties
Target LEVEL0 RTL --> -- ZE_extension_memory_compression_hints
Target LEVEL0 RTL --> -- ZE_extension_memory_free_policies
Target LEVEL0 RTL --> -- ZE_extension_device_memory_properties
Target LEVEL0 RTL --> -- ZE_extension_raytracing
Target LEVEL0 RTL --> -- ZE_experimental_power_saving_hint
Target LEVEL0 RTL --> -- ZE_extension_cache_reservation
Target LEVEL0 RTL --> Returning 1 top-level devices
Libomptarget --> Registering RTL omptarget.rtl.level0.dll supporting 1 devices!
Libomptarget --> Optional interface: __tgt_rtl_data_alloc_base
Libomptarget --> Optional interface: __tgt_rtl_data_alloc_managed
Libomptarget --> Optional interface: __tgt_rtl_data_realloc
Libomptarget --> Optional interface: __tgt_rtl_data_aligned_alloc
Libomptarget --> Optional interface: __tgt_rtl_register_host_pointer
Libomptarget --> Optional interface: __tgt_rtl_unregister_host_pointer
Libomptarget --> Optional interface: __tgt_rtl_get_context_handle
Libomptarget --> Optional interface: __tgt_rtl_init_ompt
Libomptarget --> Optional interface: __tgt_rtl_requires_mapping
Libomptarget --> Optional interface: __tgt_rtl_push_subdevice
Libomptarget --> Optional interface: __tgt_rtl_pop_subdevice
Libomptarget --> Optional interface: __tgt_rtl_add_build_options
Libomptarget --> Optional interface: __tgt_rtl_is_supported_device
Libomptarget --> Optional interface: __tgt_rtl_create_interop
Libomptarget --> Optional interface: __tgt_rtl_release_interop
Libomptarget --> Optional interface: __tgt_rtl_use_interop
Libomptarget --> Optional interface: __tgt_rtl_get_num_interop_properties
Libomptarget --> Optional interface: __tgt_rtl_get_interop_property_value
Libomptarget --> Optional interface: __tgt_rtl_get_interop_property_info
Libomptarget --> Optional interface: __tgt_rtl_get_interop_rc_desc
Libomptarget --> Optional interface: __tgt_rtl_get_num_sub_devices
Libomptarget --> Optional interface: __tgt_rtl_is_accessible_addr_range
Libomptarget --> Optional interface: __tgt_rtl_notify_indirect_access
Libomptarget --> Optional interface: __tgt_rtl_is_private_arg_on_host
Libomptarget --> Optional interface: __tgt_rtl_command_batch_begin
Libomptarget --> Optional interface: __tgt_rtl_command_batch_end
Libomptarget --> Optional interface: __tgt_rtl_kernel_batch_begin
Libomptarget --> Optional interface: __tgt_rtl_kernel_batch_end
Libomptarget --> Optional interface: __tgt_rtl_set_function_ptr_map
Libomptarget --> Optional interface: __tgt_rtl_alloc_per_hw_thread_scratch
Libomptarget --> Optional interface: __tgt_rtl_free_per_hw_thread_scratch
Libomptarget --> Optional interface: __tgt_rtl_run_target_team_nd_region
Libomptarget --> Optional interface: __tgt_rtl_get_device_info
Libomptarget --> Optional interface: __tgt_rtl_data_aligned_alloc_shared
Libomptarget --> Optional interface: __tgt_rtl_prefetch_shared_mem
Target LEVEL0 RTL --> Initialized OMPT
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: Can't open: The specified module could not be found. (0x7E)!
Libomptarget --> Loading library 'omptarget.rtl.x86_64.dll'...
Libomptarget --> Unable to load library 'omptarget.rtl.x86_64.dll': T!
Libomptarget --> Unable to load library 'omptarget.rtl.x86_64.dll': omptarget.rtl.x86_64.dll: Can't open: The specified module could not be found. (0x7E)!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.cuda.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.cuda.so': libomptarget.rtl.cuda.so: Can't open: The specified module could not be found. (0x7E)!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: Can't open: The specified module could not be found. (0x7E)!
Libomptarget --> Loading library 'libomptarget.rtl.ve.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': libomptarget.rtl.ve.so: Can't open: The specified module could not be found. (0x7E)!
Libomptarget --> Loading library 'libomptarget.rtl.amdgpu.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.amdgpu.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.amdgpu.so': libomptarget.rtl.amdgpu.so: Can't open: The specified module could not be found. (0x7E)!
Libomptarget --> Loading library 'libomptarget.rtl.rpc.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.rpc.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.rpc.so': libomptarget.rtl.rpc.so: Can't open: The specified module could not be found. (0x7E)!
Libomptarget --> RTLs loaded!
Target LEVEL0 RTL --> Target binary is a valid oneAPI OpenMP image.
Libomptarget --> Image 0x00007ff734a02000 is compatible with RTL omptarget.rtl.level0.dll!
Libomptarget --> RTL 0x000001a492132870 has index 0!
Libomptarget --> Registering image 0x00007ff734a02000 with RTL omptarget.rtl.level0.dll!
Libomptarget --> Done registering entries!
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Entering target region with entry point 0x00007ff7349f9441 and device Id 0
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Call to omp_get_initial_device returning 1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target LEVEL0 RTL --> Initialize requires flags to 0
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80200180000
Target LEVEL0 RTL --> Initialized device memory pool for device 0x000001a4946c13b8: AllocUnit = 65536, AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Allocated a shared memory 0x000001a499c00000
Target LEVEL0 RTL --> Initialized shared memory pool for device 0x000001a4946c13b8: AllocUnit = 262144, AllocMax = 8388608, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Initialized reduction scratch pool for device 0x000001a4946c13b8: AllocMin = 65536, AllocMax = 268435456, PoolSizeMax = 8589934592
Target LEVEL0 RTL --> Allocated a host memory 0x000001a497ae0000
Target LEVEL0 RTL --> Initialized host memory pool for device 0x000001a4946c13b8: AllocUnit = 65536, AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Created a command queue 0x000001a49472b1c8 (Ordinal: 0, Index: 0) for device 0.
Target LEVEL0 RTL --> Initialized Level0 device 0
Libomptarget --> Device 0 is ready to use.
Target LEVEL0 RTL --> Device 0: Loading binary from 0x00007ff734a02000
Target LEVEL0 RTL --> Expecting to have 1 entries defined
Target LEVEL0 RTL --> Base L0 module compilation options: -cl-std=CL2.0
Target LEVEL0 RTL --> Found a single section in the image
Target LEVEL0 RTL --> Created module from image #0.
Target LEVEL0 RTL --> Module link is not required
Target LEVEL0 RTL --> Looking up device global variable '__omp_offloading_entries_table_size' of size 8 bytes on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 8 bytes).
Target LEVEL0 RTL --> Created a command list 0x000001a494949018 (Ordinal: 1) for device 0.
Target LEVEL0 RTL --> Created a command queue 0x000001a49a3126a8 (Ordinal: 1, Index: 0) for device 0.
Target LEVEL0 RTL --> Warning: number of entries in host and device offload tables mismatch (1 != 2).
Target LEVEL0 RTL --> Looking up device global variable '__omp_offloading_entries_table' of size 80 bytes on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 80 bytes).
Target LEVEL0 RTL --> Device offload table loaded:
Target LEVEL0 RTL --> 0: _ZL7pone_ld_3d4ae508d8dbf78737978824de0e0216
Target LEVEL0 RTL --> 1: __omp_offloading_6c745552_3c6cc_MAIN___l26
Target LEVEL0 RTL --> Looking up device global variable '__omp_offloading_6c745552_3c6cc_MAIN___l26_kernel_info' of unknown size on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 72 bytes).
Target LEVEL0 RTL --> Kernel 0: Entry = 0x00007ff7349f9441, Name = __omp_offloading_6c745552_3c6cc_MAIN___l26, NumArgs = 5, Handle = 0x000001a49a2de000
Target LEVEL0 RTL --> Looking up device global variable '__omp_spirv_program_data' of size 64 bytes on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 64 bytes).
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206270000
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206570000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb80206570000, size = 65536, pool size = 65536
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206580000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb80206580000, size = 65536, pool size = 131072
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206590000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb80206590000, size = 65536, pool size = 196608
Libomptarget --> Entry 0: Base=0x0000007be9f8f8c0, Begin=0x0000007be9f8f8c0, Size=96, Type=0x20, Name=GPUTESTS$C_D
Libomptarget --> Entry 1: Base=0x0000007be9f8f8c0, Begin=0x000001a49770b050, Size=4000000, Type=0x1000000000013, Name=GPUTESTS$C_D_addr_a0
Libomptarget --> Entry 2: Base=0x0000007be9f8f8c0, Begin=0x0000007be9f8f8c8, Size=88, Type=0x1000000000001, Name=GPUTESTS$C_D_dv_len
Libomptarget --> Entry 3: Base=0x0000007be9f8f800, Begin=0x0000007be9f8f800, Size=96, Type=0x20, Name=GPUTESTS$A_D
Libomptarget --> Entry 4: Base=0x0000007be9f8f800, Begin=0x000001a496f41050, Size=4000000, Type=0x4000000000011, Name=GPUTESTS$A_D_addr_a0
Libomptarget --> Entry 5: Base=0x0000007be9f8f800, Begin=0x0000007be9f8f808, Size=88, Type=0x4000000000001, Name=GPUTESTS$A_D_dv_len
Libomptarget --> Entry 6: Base=0x0000007be9f8f860, Begin=0x0000007be9f8f860, Size=96, Type=0x20, Name=GPUTESTS$B_D
Libomptarget --> Entry 7: Base=0x0000007be9f8f860, Begin=0x000001a497323050, Size=4000000, Type=0x7000000000011, Name=GPUTESTS$B_D_addr_a0
Libomptarget --> Entry 8: Base=0x0000007be9f8f860, Begin=0x0000007be9f8f868, Size=88, Type=0x7000000000001, Name=GPUTESTS$B_D_dv_len
Libomptarget --> Entry 9: Base=0x0000000000000000, Begin=0x0000000000000000, Size=0, Type=0x120, Name=unknown
Libomptarget --> Entry 10: Base=0x00000000000003e7, Begin=0x00000000000003e7, Size=0, Type=0x120, Name=unknown
Libomptarget --> Entry 11: Base=0x0000007be9f8f570, Begin=0x0000007be9f8f570, Size=56, Type=0x800, Name=unknown
Libomptarget --> loop trip count is 0.
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f8c0, Size=96)...
Target LEVEL0 RTL --> Ptr 0x0000007be9f8f8c0 requires mapping
Target LEVEL0 RTL --> Allocated a device memory 0xffffb802065a0000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb802065a0000, size = 65536, pool size = 262144
Libomptarget --> Creating new map entry with HstPtrBegin=0x0000007be9f8f8c0, TgtPtrBegin=0xffffb802065a0000, Size=96, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$C_D
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0000 - is new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f8c0, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f8c0, TgtPtrBegin=0xffffb802065a0000, Size=8, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=unknown
Libomptarget --> There are 8 bytes allocated at target address 0xffffb802065a0000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000001a49770b050, Size=4000000)...
Target LEVEL0 RTL --> Ptr 0x000001a49770b050 requires mapping
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206600000
Libomptarget --> Creating new map entry with HstPtrBegin=0x000001a49770b050, TgtPtrBegin=0xffffb80206600000, Size=4000000, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$C_D_addr_a0
Libomptarget --> Moving 4000000 bytes (hst:0x000001a49770b050) -> (tgt:0xffffb80206600000)
Target LEVEL0 RTL --> Copied 4000000 bytes (hst:0x000001a49770b050) -> (tgt:0xffffb80206600000)
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206600000 - is new
Libomptarget --> Update pointer (0xffffb802065a0000) -> [0xffffb80206600000]
Target LEVEL0 RTL --> Copied 8 bytes (hst:0x000001a49a8c7120) -> (tgt:0xffffb802065a0000)
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f8c0, Size=8)...
Target LEVEL0 RTL --> Notifying indirect access: 0xffffb802065a0000 + 0
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f8c8, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f8c8, TgtPtrBegin=0xffffb802065a0008, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=GPUTESTS$C_D_dv_len
Libomptarget --> Moving 88 bytes (hst:0x0000007be9f8f8c8) -> (tgt:0xffffb802065a0008)
Target LEVEL0 RTL --> Copied 88 bytes (hst:0x0000007be9f8f8c8) -> (tgt:0xffffb802065a0008)
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0008 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f800, Size=96)...
Target LEVEL0 RTL --> Ptr 0x0000007be9f8f800 requires mapping
Libomptarget --> Creating new map entry with HstPtrBegin=0x0000007be9f8f800, TgtPtrBegin=0xffffb802065a0080, Size=96, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$A_D
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0080 - is new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f800, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f800, TgtPtrBegin=0xffffb802065a0080, Size=8, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=unknown
Libomptarget --> There are 8 bytes allocated at target address 0xffffb802065a0080 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000001a496f41050, Size=4000000)...
Target LEVEL0 RTL --> Ptr 0x000001a496f41050 requires mapping
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206a00000
Libomptarget --> Creating new map entry with HstPtrBegin=0x000001a496f41050, TgtPtrBegin=0xffffb80206a00000, Size=4000000, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$A_D_addr_a0
Libomptarget --> Moving 4000000 bytes (hst:0x000001a496f41050) -> (tgt:0xffffb80206a00000)
Target LEVEL0 RTL --> Copied 4000000 bytes (hst:0x000001a496f41050) -> (tgt:0xffffb80206a00000)
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206a00000 - is new
Libomptarget --> Update pointer (0xffffb802065a0080) -> [0xffffb80206a00000]
Target LEVEL0 RTL --> Copied 8 bytes (hst:0x000001a49a8c7128) -> (tgt:0xffffb802065a0080)
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f800, Size=8)...
Target LEVEL0 RTL --> Notifying indirect access: 0xffffb802065a0080 + 0
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f808, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f808, TgtPtrBegin=0xffffb802065a0088, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=GPUTESTS$A_D_dv_len
Libomptarget --> Moving 88 bytes (hst:0x0000007be9f8f808) -> (tgt:0xffffb802065a0088)
Target LEVEL0 RTL --> Copied 88 bytes (hst:0x0000007be9f8f808) -> (tgt:0xffffb802065a0088)
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0088 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f860, Size=96)...
Target LEVEL0 RTL --> Ptr 0x0000007be9f8f860 requires mapping
Libomptarget --> Creating new map entry with HstPtrBegin=0x0000007be9f8f860, TgtPtrBegin=0xffffb802065a0100, Size=96, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$B_D
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0100 - is new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f860, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f860, TgtPtrBegin=0xffffb802065a0100, Size=8, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=unknown
Libomptarget --> There are 8 bytes allocated at target address 0xffffb802065a0100 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000001a497323050, Size=4000000)...
Target LEVEL0 RTL --> Ptr 0x000001a497323050 requires mapping
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206e00000
Libomptarget --> Creating new map entry with HstPtrBegin=0x000001a497323050, TgtPtrBegin=0xffffb80206e00000, Size=4000000, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$B_D_addr_a0
Libomptarget --> Moving 4000000 bytes (hst:0x000001a497323050) -> (tgt:0xffffb80206e00000)
Target LEVEL0 RTL --> Copied 4000000 bytes (hst:0x000001a497323050) -> (tgt:0xffffb80206e00000)
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206e00000 - is new
Libomptarget --> Update pointer (0xffffb802065a0100) -> [0xffffb80206e00000]
Target LEVEL0 RTL --> Copied 8 bytes (hst:0x000001a49a8c7040) -> (tgt:0xffffb802065a0100)
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f860, Size=8)...
Target LEVEL0 RTL --> Notifying indirect access: 0xffffb802065a0100 + 0
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f868, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f868, TgtPtrBegin=0xffffb802065a0108, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=GPUTESTS$B_D_dv_len
Libomptarget --> Moving 88 bytes (hst:0x0000007be9f8f868) -> (tgt:0xffffb802065a0108)
Target LEVEL0 RTL --> Copied 88 bytes (hst:0x0000007be9f8f868) -> (tgt:0xffffb802065a0108)
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0108 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f8c0, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f8c0, TgtPtrBegin=0xffffb802065a0000, Size=96, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0xffffb802065a0000, Offset: 0) from host pointer 0x0000007be9f8f8c0
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f800, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f800, TgtPtrBegin=0xffffb802065a0080, Size=96, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0xffffb802065a0080, Offset: 0) from host pointer 0x0000007be9f8f800
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f860, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f860, TgtPtrBegin=0xffffb802065a0100, Size=96, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0xffffb802065a0100, Offset: 0) from host pointer 0x0000007be9f8f860
Libomptarget --> Forwarding first-private value 0x0000000000000000 to the target construct
Libomptarget --> Forwarding first-private value 0x00000000000003e7 to the target construct
Libomptarget --> Launching target execution __omp_offloading_6c745552_3c6cc_MAIN___l26 with pointer 0x000001a49a937df0 (index=0).
Target LEVEL0 RTL --> Executing a kernel 0x000001a49a937df0...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 16
Target LEVEL0 RTL --> Preferred team size is multiple of 32
Target LEVEL0 RTL --> Loop 0: lower bound = 0, upper bound = 0, Stride = 1
Target LEVEL0 RTL --> Loop 1: lower bound = 0, upper bound = 999, Stride = 1
Target LEVEL0 RTL --> Team sizes = {32, 1, 1}
Target LEVEL0 RTL --> Number of teams = {1, 1000, 1}
Target LEVEL0 RTL --> Kernel Pointer argument 0 (value: 0xffffb802065a0000) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Pointer argument 1 (value: 0xffffb802065a0080) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Pointer argument 2 (value: 0xffffb802065a0100) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Scalar argument 3 (value: 0x0000000000000000) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Scalar argument 4 (value: 0x00000000000003e7) was set successfully for device 0.
Target LEVEL0 RTL --> Setting indirect access flags 0x0000000000000002
Target LEVEL0 RTL --> Created a command list 0x000001a49a6ce038 (Ordinal: 0) for device 0.
Target LEVEL0 RTL --> Submitted kernel 0x000001a49a2de000 to device 0
Target LEVEL0 RTL --> Executed kernel entry 0x000001a49a937df0 on device 0
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f868, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f868, TgtPtrBegin=0xffffb802065a0108, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0108 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000001a497323050, Size=4000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000001a497323050, TgtPtrBegin=0xffffb80206e00000, Size=4000000, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206e00000 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f860, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f860, TgtPtrBegin=0xffffb802065a0100, Size=96, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0100 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f808, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f808, TgtPtrBegin=0xffffb802065a0088, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0088 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000001a496f41050, Size=4000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000001a496f41050, TgtPtrBegin=0xffffb80206a00000, Size=4000000, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206a00000 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f800, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f800, TgtPtrBegin=0xffffb802065a0080, Size=96, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0080 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f8c8, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f8c8, TgtPtrBegin=0xffffb802065a0008, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0008 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000001a49770b050, Size=4000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000001a49770b050, TgtPtrBegin=0xffffb80206600000, Size=4000000, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206600000 - is last
Libomptarget --> Moving 4000000 bytes (tgt:0xffffb80206600000) -> (hst:0x000001a49770b050)
Target LEVEL0 RTL --> Copied 4000000 bytes (tgt:0xffffb80206600000) -> (hst:0x000001a49770b050)
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f8c0, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x0000007be9f8f8c0, TgtPtrBegin=0xffffb802065a0000, Size=96, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0000 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000001a497323050, Size=4000000)...
Libomptarget --> Deleting tgt data 0xffffb80206e00000 of size 4000000
Target LEVEL0 RTL --> Deleted device memory 0xffffb80206e00000 (Base: 0xffffb80206e00000, Size: 4000000)
Libomptarget --> Removing map entry with HstPtrBegin=0x000001a497323050, TgtPtrBegin=0xffffb80206e00000, Size=4000000, Name=GPUTESTS$B_D_addr_a0
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f860, Size=96)...
Libomptarget --> Removing shadow pointer 0x0000007be9f8f860
Libomptarget --> Deleting tgt data 0xffffb802065a0100 of size 96
Libomptarget --> Removing map entry with HstPtrBegin=0x0000007be9f8f860, TgtPtrBegin=0xffffb802065a0100, Size=96, Name=GPUTESTS$B_D
Libomptarget --> Looking up mapping(HstPtrBegin=0x000001a496f41050, Size=4000000)...
Libomptarget --> Deleting tgt data 0xffffb80206a00000 of size 4000000
Target LEVEL0 RTL --> Deleted device memory 0xffffb80206a00000 (Base: 0xffffb80206a00000, Size: 4000000)
Libomptarget --> Removing map entry with HstPtrBegin=0x000001a496f41050, TgtPtrBegin=0xffffb80206a00000, Size=4000000, Name=GPUTESTS$A_D_addr_a0
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f800, Size=96)...
Libomptarget --> Removing shadow pointer 0x0000007be9f8f800
Libomptarget --> Deleting tgt data 0xffffb802065a0080 of size 96
Libomptarget --> Removing map entry with HstPtrBegin=0x0000007be9f8f800, TgtPtrBegin=0xffffb802065a0080, Size=96, Name=GPUTESTS$A_D
Libomptarget --> Looking up mapping(HstPtrBegin=0x000001a49770b050, Size=4000000)...
Libomptarget --> Deleting tgt data 0xffffb80206600000 of size 4000000
Target LEVEL0 RTL --> Deleted device memory 0xffffb80206600000 (Base: 0xffffb80206600000, Size: 4000000)
Libomptarget --> Removing map entry with HstPtrBegin=0x000001a49770b050, TgtPtrBegin=0xffffb80206600000, Size=4000000, Name=GPUTESTS$C_D_addr_a0
Libomptarget --> Looking up mapping(HstPtrBegin=0x0000007be9f8f8c0, Size=96)...
Libomptarget --> Removing shadow pointer 0x0000007be9f8f8c0
Libomptarget --> Deleting tgt data 0xffffb802065a0000 of size 96
Libomptarget --> Removing map entry with HstPtrBegin=0x0000007be9f8f8c0, TgtPtrBegin=0xffffb802065a0000, Size=96, Name=GPUTESTS$C_D

I'm still not seeing any load in the GPU.  Does any of the above output confirm (or otherwise) that the calculation is being done on the GPU?  Maybe my OpenMP directives are still not correct?

 

0 Kudos
lxander
Novice
1,360 Views

I've made some small modifications to the OpenMP directives and this compiles and executes ...

  program GPUTests

  use omp_lib

  implicit none

  ! Declare variables
  integer, parameter :: n = 1000

  integer :: devices

  integer :: i,j,k
  real*4, ALLOCATABLE, target :: a_d(:,:), b_d(:,:), c_d(:,:)

  allocate(a_d(n,n))
  allocate(b_d(n,n))
  allocate(c_d(n,n))

  devices = omp_get_num_devices()

  call random_seed()
  call random_number(a_d)
  call random_number(b_d)

  ! Perform multiplication on GPU
  !$OMP TARGET TEAMS DISTRIBUTE DEFAULT(SHARED) PRIVATE(I,J,K) MAP(TO:A_D,B_D) MAP(TOFROM:C_D)
  do i=1,n
    do j=1,n
      c_d(i,j) = 0.0
    end do
  end do

  !$OMP DO
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c_d(i,j) = c_d(i,j) + a_d(i,k) * b_d(k,j)
      end do
    end do
  end do
  !$OMP END DO
  
  !!$OMP END TARGET TEAMS DISTRIBUTE !<- commented out as with in get error #7622 Misplaced part of OpenMP parallel directive
  
  ! Print result
  print *, c_d(1,1)

  end program GPUTests

This is the output ...

Libomptarget --> Init target library!
Libomptarget --> Initialized OMPT
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'omptarget.rtl.level0.dll'...
Target LEVEL0 RTL --> Init Level0 plugin!
Target LEVEL0 RTL --> omp_get_thread_limit() returned 2147483647
Target LEVEL0 RTL --> omp_get_max_teams() returned 0
Libomptarget --> Successfully loaded library 'omptarget.rtl.level0.dll'!
Target LEVEL0 RTL --> Looking for Level0 devices...
Target LEVEL0 RTL --> Found copy command queue for device 0x000002b90efd4628, ordinal = 1
Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) Arc(TM) A770 Graphics
Target LEVEL0 RTL --> Found copy command queue for device 0x000002b90efd4628, ordinal = 1
Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) Arc(TM) A770 Graphics
Target LEVEL0 RTL --> Found copy command queue for device 0x000002b90efd4628, ordinal = 1
Target LEVEL0 RTL --> Found a GPU device, Name = Intel(R) Arc(TM) A770 Graphics
Target LEVEL0 RTL --> Found 1 root devices, 3 total devices.
Target LEVEL0 RTL --> List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL0 RTL --> -- 0
Target LEVEL0 RTL --> -- 0.0.0
Target LEVEL0 RTL --> -- 0.0.1
Target LEVEL0 RTL --> Root Device Information
Target LEVEL0 RTL --> Device 0
Target LEVEL0 RTL --> -- Name                         : Intel(R) Arc(TM) A770 Graphics
Target LEVEL0 RTL --> -- PCI ID                       : 0x56a0
Target LEVEL0 RTL --> -- Number of total EUs          : 512
Target LEVEL0 RTL --> -- Number of threads per EU     : 8
Target LEVEL0 RTL --> -- EU SIMD width                : 8
Target LEVEL0 RTL --> -- Number of EUs per subslice   : 8
Target LEVEL0 RTL --> -- Number of subslices per slice: 8
Target LEVEL0 RTL --> -- Number of slices             : 8
Target LEVEL0 RTL --> -- Local memory size (bytes)    : 65536
Target LEVEL0 RTL --> -- Global memory size (bytes)   : 13623099392
Target LEVEL0 RTL --> -- Cache size (bytes)           : 4194304
Target LEVEL0 RTL --> -- Max clock frequency (MHz)    : 2400
Target LEVEL0 RTL --> Driver API version is 10003
Target LEVEL0 RTL --> Interop property IDs, Names, Descriptions
Target LEVEL0 RTL --> -- 0, device_num_eus, intptr_t, total number of EUs
Target LEVEL0 RTL --> -- 1, device_num_threads_per_eu, intptr_t, number of threads per EU
Target LEVEL0 RTL --> -- 2, device_eu_simd_width, intptr_t, physical EU simd width
Target LEVEL0 RTL --> -- 3, device_num_eus_per_subslice, intptr_t, number of EUs per sub-slice
Target LEVEL0 RTL --> -- 4, device_num_subslices_per_slice, intptr_t, number of sub-slices per slice
Target LEVEL0 RTL --> -- 5, device_num_slices, intptr_t, number of slices
Target LEVEL0 RTL --> -- 6, device_local_mem_size, intptr_t, local memory size in bytes
Target LEVEL0 RTL --> -- 7, device_global_mem_size, intptr_t, global memory size in bytes
Target LEVEL0 RTL --> -- 8, device_global_mem_cache_size, intptr_t, global memory cache size in bytes
Target LEVEL0 RTL --> -- 9, device_max_clock_frequency, intptr_t, max clock frequency in MHz
Target LEVEL0 RTL --> Found driver extensions:
Target LEVEL0 RTL --> -- ZE_extension_float_atomics
Target LEVEL0 RTL --> -- ZE_experimental_relaxed_allocation_limits
Target LEVEL0 RTL --> -- ZE_experimental_module_program
Target LEVEL0 RTL --> -- ZE_experimental_scheduling_hints
Target LEVEL0 RTL --> -- ZE_experimental_global_offset
Target LEVEL0 RTL --> -- ZE_extension_pci_properties
Target LEVEL0 RTL --> -- ZE_extension_memory_compression_hints
Target LEVEL0 RTL --> -- ZE_extension_memory_free_policies
Target LEVEL0 RTL --> -- ZE_extension_device_memory_properties
Target LEVEL0 RTL --> -- ZE_extension_raytracing
Target LEVEL0 RTL --> -- ZE_experimental_power_saving_hint
Target LEVEL0 RTL --> -- ZE_extension_cache_reservation
Target LEVEL0 RTL --> Returning 1 top-level devices
Libomptarget --> Registering RTL omptarget.rtl.level0.dll supporting 1 devices!
Libomptarget --> Optional interface: __tgt_rtl_data_alloc_base
Libomptarget --> Optional interface: __tgt_rtl_data_alloc_managed
Libomptarget --> Optional interface: __tgt_rtl_data_realloc
Libomptarget --> Optional interface: __tgt_rtl_data_aligned_alloc
Libomptarget --> Optional interface: __tgt_rtl_register_host_pointer
Libomptarget --> Optional interface: __tgt_rtl_unregister_host_pointer
Libomptarget --> Optional interface: __tgt_rtl_get_context_handle
Libomptarget --> Optional interface: __tgt_rtl_init_ompt
Libomptarget --> Optional interface: __tgt_rtl_requires_mapping
Libomptarget --> Optional interface: __tgt_rtl_push_subdevice
Libomptarget --> Optional interface: __tgt_rtl_pop_subdevice
Libomptarget --> Optional interface: __tgt_rtl_add_build_options
Libomptarget --> Optional interface: __tgt_rtl_is_supported_device
Libomptarget --> Optional interface: __tgt_rtl_create_interop
Libomptarget --> Optional interface: __tgt_rtl_release_interop
Libomptarget --> Optional interface: __tgt_rtl_use_interop
Libomptarget --> Optional interface: __tgt_rtl_get_num_interop_properties
Libomptarget --> Optional interface: __tgt_rtl_get_interop_property_value
Libomptarget --> Optional interface: __tgt_rtl_get_interop_property_info
Libomptarget --> Optional interface: __tgt_rtl_get_interop_rc_desc
Libomptarget --> Optional interface: __tgt_rtl_get_num_sub_devices
Libomptarget --> Optional interface: __tgt_rtl_is_accessible_addr_range
Libomptarget --> Optional interface: __tgt_rtl_notify_indirect_access
Libomptarget --> Optional interface: __tgt_rtl_is_private_arg_on_host
Libomptarget --> Optional interface: __tgt_rtl_command_batch_begin
Libomptarget --> Optional interface: __tgt_rtl_command_batch_end
Libomptarget --> Optional interface: __tgt_rtl_kernel_batch_begin
Libomptarget --> Optional interface: __tgt_rtl_kernel_batch_end
Libomptarget --> Optional interface: __tgt_rtl_set_function_ptr_map
Libomptarget --> Optional interface: __tgt_rtl_alloc_per_hw_thread_scratch
Libomptarget --> Optional interface: __tgt_rtl_free_per_hw_thread_scratch
Libomptarget --> Optional interface: __tgt_rtl_run_target_team_nd_region
Libomptarget --> Optional interface: __tgt_rtl_get_device_info
Libomptarget --> Optional interface: __tgt_rtl_data_aligned_alloc_shared
Libomptarget --> Optional interface: __tgt_rtl_prefetch_shared_mem
Target LEVEL0 RTL --> Initialized OMPT
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> Loading library 'omptarget.rtl.x86_64.dll'...
Libomptarget --> Unable to load library 'omptarget.rtl.x86_64.dll': T!
Libomptarget --> Unable to load library 'omptarget.rtl.x86_64.dll': omptarget.rtl.x86_64.dll: Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.cuda.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.cuda.so': libomptarget.rtl.cuda.so: Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> Loading library 'libomptarget.rtl.ve.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': libomptarget.rtl.ve.so: Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> Loading library 'libomptarget.rtl.amdgpu.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.amdgpu.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.amdgpu.so': libomptarget.rtl.amdgpu.so: Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> Loading library 'libomptarget.rtl.rpc.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.rpc.so': T!
Libomptarget --> Unable to load library 'libomptarget.rtl.rpc.so': libomptarget.rtl.rpc.so: Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> RTLs loaded!
Target LEVEL0 RTL --> Target binary is a valid oneAPI OpenMP image.
Libomptarget --> Image 0x00007ff6e6d32000 is compatible with RTL omptarget.rtl.level0.dll!
Libomptarget --> RTL 0x000002b90cc07ad0 has index 0!
Libomptarget --> Registering image 0x00007ff6e6d32000 with RTL omptarget.rtl.level0.dll!
Libomptarget --> Done registering entries!
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Entering target region with entry point 0x00007ff6e6d29441 and device Id 0
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Call to omp_get_num_devices returning 1
Libomptarget --> Call to omp_get_initial_device returning 1
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Target LEVEL0 RTL --> Initialize requires flags to 0
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80200180000
Target LEVEL0 RTL --> Initialized device memory pool for device 0x000002b90efd4628: AllocUnit = 65536, AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Allocated a shared memory 0x000002b914600000
Target LEVEL0 RTL --> Initialized shared memory pool for device 0x000002b90efd4628: AllocUnit = 262144, AllocMax = 8388608, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Initialized reduction scratch pool for device 0x000002b90efd4628: AllocMin = 65536, AllocMax = 268435456, PoolSizeMax = 8589934592
Target LEVEL0 RTL --> Allocated a host memory 0x000002b912420000
Target LEVEL0 RTL --> Initialized host memory pool for device 0x000002b90efd4628: AllocUnit = 65536, AllocMax = 1048576, Capacity = 4, PoolSizeMax = 268435456
Target LEVEL0 RTL --> Created a command queue 0x000002b90cc1bab8 (Ordinal: 0, Index: 0) for device 0.
Target LEVEL0 RTL --> Initialized Level0 device 0
Libomptarget --> Device 0 is ready to use.
Target LEVEL0 RTL --> Device 0: Loading binary from 0x00007ff6e6d32000
Target LEVEL0 RTL --> Expecting to have 1 entries defined
Target LEVEL0 RTL --> Base L0 module compilation options: -cl-std=CL2.0
Target LEVEL0 RTL --> Found a single section in the image
Target LEVEL0 RTL --> Created module from image #0.
Target LEVEL0 RTL --> Module link is not required
Target LEVEL0 RTL --> Looking up device global variable '__omp_offloading_entries_table_size' of size 8 bytes on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 8 bytes).
Target LEVEL0 RTL --> Created a command list 0x000002b9152d1a48 (Ordinal: 1) for device 0.
Target LEVEL0 RTL --> Created a command queue 0x000002b90cc1b638 (Ordinal: 1, Index: 0) for device 0.
Target LEVEL0 RTL --> Warning: number of entries in host and device offload tables mismatch (1 != 2).
Target LEVEL0 RTL --> Looking up device global variable '__omp_offloading_entries_table' of size 80 bytes on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 80 bytes).
Target LEVEL0 RTL --> Device offload table loaded:
Target LEVEL0 RTL -->   0:      _ZL7pone_ld_3d4ae508d8dbf78737978824de0e0216
Target LEVEL0 RTL -->   1:      __omp_offloading_6c745552_3c6cc_MAIN___l26
Target LEVEL0 RTL --> Looking up device global variable '__omp_offloading_6c745552_3c6cc_MAIN___l26_kernel_info' of unknown size on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 72 bytes).
Target LEVEL0 RTL --> Kernel 0: Entry = 0x00007ff6e6d29441, Name = __omp_offloading_6c745552_3c6cc_MAIN___l26, NumArgs = 5, Handle = 0x000002b90f091bf0
Target LEVEL0 RTL --> Looking up device global variable '__omp_spirv_program_data' of size 64 bytes on device 0.
Target LEVEL0 RTL --> Global variable lookup succeeded (size: 64 bytes).
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206270000
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206570000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb80206570000, size = 65536, pool size = 65536
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206580000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb80206580000, size = 65536, pool size = 131072
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206590000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb80206590000, size = 65536, pool size = 196608
Libomptarget --> Entry  0: Base=0x000000c083f6f430, Begin=0x000000c083f6f430, Size=96, Type=0x20, Name=GPUTESTS$C_D
Libomptarget --> Entry  1: Base=0x000000c083f6f430, Begin=0x000002b912047050, Size=4000000, Type=0x1000000000013, Name=GPUTESTS$C_D_addr_a0
Libomptarget --> Entry  2: Base=0x000000c083f6f430, Begin=0x000000c083f6f438, Size=88, Type=0x1000000000001, Name=GPUTESTS$C_D_dv_len
Libomptarget --> Entry  3: Base=0x000000c083f6f370, Begin=0x000000c083f6f370, Size=96, Type=0x20, Name=GPUTESTS$A_D
Libomptarget --> Entry  4: Base=0x000000c083f6f370, Begin=0x000002b911888050, Size=4000000, Type=0x4000000000011, Name=GPUTESTS$A_D_addr_a0
Libomptarget --> Entry  5: Base=0x000000c083f6f370, Begin=0x000000c083f6f378, Size=88, Type=0x4000000000001, Name=GPUTESTS$A_D_dv_len
Libomptarget --> Entry  6: Base=0x000000c083f6f3d0, Begin=0x000000c083f6f3d0, Size=96, Type=0x20, Name=GPUTESTS$B_D
Libomptarget --> Entry  7: Base=0x000000c083f6f3d0, Begin=0x000002b911c6d050, Size=4000000, Type=0x7000000000011, Name=GPUTESTS$B_D_addr_a0
Libomptarget --> Entry  8: Base=0x000000c083f6f3d0, Begin=0x000000c083f6f3d8, Size=88, Type=0x7000000000001, Name=GPUTESTS$B_D_dv_len
Libomptarget --> Entry  9: Base=0x0000000000000000, Begin=0x0000000000000000, Size=0, Type=0x120, Name=unknown
Libomptarget --> Entry 10: Base=0x00000000000003e7, Begin=0x00000000000003e7, Size=0, Type=0x120, Name=unknown
Libomptarget --> Entry 11: Base=0x000000c083f6f0e0, Begin=0x000000c083f6f0e0, Size=56, Type=0x800, Name=unknown
Libomptarget --> loop trip count is 0.
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f430, Size=96)...
Target LEVEL0 RTL --> Ptr 0x000000c083f6f430 requires mapping
Target LEVEL0 RTL --> Allocated a device memory 0xffffb802065a0000
Target LEVEL0 RTL --> New block allocation for device memory pool: base = 0xffffb802065a0000, size = 65536, pool size = 262144
Libomptarget --> Creating new map entry with HstPtrBegin=0x000000c083f6f430, TgtPtrBegin=0xffffb802065a0000, Size=96, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$C_D
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0000 - is new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f430, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f430, TgtPtrBegin=0xffffb802065a0000, Size=8, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=unknown
Libomptarget --> There are 8 bytes allocated at target address 0xffffb802065a0000 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000002b912047050, Size=4000000)...
Target LEVEL0 RTL --> Ptr 0x000002b912047050 requires mapping
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206600000
Libomptarget --> Creating new map entry with HstPtrBegin=0x000002b912047050, TgtPtrBegin=0xffffb80206600000, Size=4000000, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$C_D_addr_a0
Libomptarget --> Moving 4000000 bytes (hst:0x000002b912047050) -> (tgt:0xffffb80206600000)
Target LEVEL0 RTL --> Copied 4000000 bytes (hst:0x000002b912047050) -> (tgt:0xffffb80206600000)
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206600000 - is new
Libomptarget --> Update pointer (0xffffb802065a0000) -> [0xffffb80206600000]
Target LEVEL0 RTL --> Copied 8 bytes (hst:0x000002b91538b3b0) -> (tgt:0xffffb802065a0000)
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f430, Size=8)...
Target LEVEL0 RTL --> Notifying indirect access: 0xffffb802065a0000 + 0
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f438, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f438, TgtPtrBegin=0xffffb802065a0008, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=GPUTESTS$C_D_dv_len
Libomptarget --> Moving 88 bytes (hst:0x000000c083f6f438) -> (tgt:0xffffb802065a0008)
Target LEVEL0 RTL --> Copied 88 bytes (hst:0x000000c083f6f438) -> (tgt:0xffffb802065a0008)
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0008 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f370, Size=96)...
Target LEVEL0 RTL --> Ptr 0x000000c083f6f370 requires mapping
Libomptarget --> Creating new map entry with HstPtrBegin=0x000000c083f6f370, TgtPtrBegin=0xffffb802065a0080, Size=96, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$A_D
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0080 - is new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f370, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f370, TgtPtrBegin=0xffffb802065a0080, Size=8, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=unknown
Libomptarget --> There are 8 bytes allocated at target address 0xffffb802065a0080 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000002b911888050, Size=4000000)...
Target LEVEL0 RTL --> Ptr 0x000002b911888050 requires mapping
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206a00000
Libomptarget --> Creating new map entry with HstPtrBegin=0x000002b911888050, TgtPtrBegin=0xffffb80206a00000, Size=4000000, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$A_D_addr_a0
Libomptarget --> Moving 4000000 bytes (hst:0x000002b911888050) -> (tgt:0xffffb80206a00000)
Target LEVEL0 RTL --> Copied 4000000 bytes (hst:0x000002b911888050) -> (tgt:0xffffb80206a00000)
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206a00000 - is new
Libomptarget --> Update pointer (0xffffb802065a0080) -> [0xffffb80206a00000]
Target LEVEL0 RTL --> Copied 8 bytes (hst:0x000002b91538b3b8) -> (tgt:0xffffb802065a0080)
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f370, Size=8)...
Target LEVEL0 RTL --> Notifying indirect access: 0xffffb802065a0080 + 0
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f378, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f378, TgtPtrBegin=0xffffb802065a0088, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=GPUTESTS$A_D_dv_len
Libomptarget --> Moving 88 bytes (hst:0x000000c083f6f378) -> (tgt:0xffffb802065a0088)
Target LEVEL0 RTL --> Copied 88 bytes (hst:0x000000c083f6f378) -> (tgt:0xffffb802065a0088)
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0088 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f3d0, Size=96)...
Target LEVEL0 RTL --> Ptr 0x000000c083f6f3d0 requires mapping
Libomptarget --> Creating new map entry with HstPtrBegin=0x000000c083f6f3d0, TgtPtrBegin=0xffffb802065a0100, Size=96, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$B_D
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0100 - is new
Libomptarget --> Has a pointer entry:
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f3d0, Size=8)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f3d0, TgtPtrBegin=0xffffb802065a0100, Size=8, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=unknown
Libomptarget --> There are 8 bytes allocated at target address 0xffffb802065a0100 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000002b911c6d050, Size=4000000)...
Target LEVEL0 RTL --> Ptr 0x000002b911c6d050 requires mapping
Target LEVEL0 RTL --> Allocated a device memory 0xffffb80206e00000
Libomptarget --> Creating new map entry with HstPtrBegin=0x000002b911c6d050, TgtPtrBegin=0xffffb80206e00000, Size=4000000, DynRefCount=1, HoldRefCount=0, Name=GPUTESTS$B_D_addr_a0
Libomptarget --> Moving 4000000 bytes (hst:0x000002b911c6d050) -> (tgt:0xffffb80206e00000)
Target LEVEL0 RTL --> Copied 4000000 bytes (hst:0x000002b911c6d050) -> (tgt:0xffffb80206e00000)
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206e00000 - is new
Libomptarget --> Update pointer (0xffffb802065a0100) -> [0xffffb80206e00000]
Target LEVEL0 RTL --> Copied 8 bytes (hst:0x000002b91538b210) -> (tgt:0xffffb802065a0100)
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f3d0, Size=8)...
Target LEVEL0 RTL --> Notifying indirect access: 0xffffb802065a0100 + 0
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f3d8, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f3d8, TgtPtrBegin=0xffffb802065a0108, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0, Name=GPUTESTS$B_D_dv_len
Libomptarget --> Moving 88 bytes (hst:0x000000c083f6f3d8) -> (tgt:0xffffb802065a0108)
Target LEVEL0 RTL --> Copied 88 bytes (hst:0x000000c083f6f3d8) -> (tgt:0xffffb802065a0108)
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0108 - is new
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f430, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f430, TgtPtrBegin=0xffffb802065a0000, Size=96, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0xffffb802065a0000, Offset: 0) from host pointer 0x000000c083f6f430
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f370, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f370, TgtPtrBegin=0xffffb802065a0080, Size=96, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0xffffb802065a0080, Offset: 0) from host pointer 0x000000c083f6f370
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f3d0, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f3d0, TgtPtrBegin=0xffffb802065a0100, Size=96, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> Obtained target argument (Begin: 0xffffb802065a0100, Offset: 0) from host pointer 0x000000c083f6f3d0
Libomptarget --> Forwarding first-private value 0x0000000000000000 to the target construct
Libomptarget --> Forwarding first-private value 0x00000000000003e7 to the target construct
Libomptarget --> Launching target execution __omp_offloading_6c745552_3c6cc_MAIN___l26 with pointer 0x000002b915269bc0 (index=0).
Target LEVEL0 RTL --> Executing a kernel 0x000002b915269bc0...
Target LEVEL0 RTL --> Assumed kernel SIMD width is 16
Target LEVEL0 RTL --> Preferred team size is multiple of 32
Target LEVEL0 RTL --> Loop 0: lower bound = 0, upper bound = 0, Stride = 1
Target LEVEL0 RTL --> Loop 1: lower bound = 0, upper bound = 999, Stride = 1
Target LEVEL0 RTL --> Team sizes = {32, 1, 1}
Target LEVEL0 RTL --> Number of teams = {1, 1000, 1}
Target LEVEL0 RTL --> Kernel Pointer argument 0 (value: 0xffffb802065a0000) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Pointer argument 1 (value: 0xffffb802065a0080) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Pointer argument 2 (value: 0xffffb802065a0100) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Scalar argument 3 (value: 0x0000000000000000) was set successfully for device 0.
Target LEVEL0 RTL --> Kernel Scalar argument 4 (value: 0x00000000000003e7) was set successfully for device 0.
Target LEVEL0 RTL --> Setting indirect access flags 0x0000000000000002
Target LEVEL0 RTL --> Created a command list 0x000002b90f03fbf8 (Ordinal: 0) for device 0.
Target LEVEL0 RTL --> Submitted kernel 0x000002b90f091bf0 to device 0
Target LEVEL0 RTL --> Executed kernel entry 0x000002b915269bc0 on device 0
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f3d8, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f3d8, TgtPtrBegin=0xffffb802065a0108, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0108 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000002b911c6d050, Size=4000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000002b911c6d050, TgtPtrBegin=0xffffb80206e00000, Size=4000000, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206e00000 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f3d0, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f3d0, TgtPtrBegin=0xffffb802065a0100, Size=96, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0100 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f378, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f378, TgtPtrBegin=0xffffb802065a0088, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0088 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000002b911888050, Size=4000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000002b911888050, TgtPtrBegin=0xffffb80206a00000, Size=4000000, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206a00000 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f370, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f370, TgtPtrBegin=0xffffb802065a0080, Size=96, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0080 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f438, Size=88)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f438, TgtPtrBegin=0xffffb802065a0008, Size=88, DynRefCount=1 (update suppressed), HoldRefCount=0
Libomptarget --> There are 88 bytes allocated at target address 0xffffb802065a0008 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000002b912047050, Size=4000000)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000002b912047050, TgtPtrBegin=0xffffb80206600000, Size=4000000, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 4000000 bytes allocated at target address 0xffffb80206600000 - is last
Libomptarget --> Moving 4000000 bytes (tgt:0xffffb80206600000) -> (hst:0x000002b912047050)
Target LEVEL0 RTL --> Copied 4000000 bytes (tgt:0xffffb80206600000) -> (hst:0x000002b912047050)
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f430, Size=96)...
Libomptarget --> Mapping exists with HstPtrBegin=0x000000c083f6f430, TgtPtrBegin=0xffffb802065a0000, Size=96, DynRefCount=0 (decremented, delayed deletion), HoldRefCount=0
Libomptarget --> There are 96 bytes allocated at target address 0xffffb802065a0000 - is last
Libomptarget --> Looking up mapping(HstPtrBegin=0x000002b911c6d050, Size=4000000)...
Libomptarget --> Deleting tgt data 0xffffb80206e00000 of size 4000000
Target LEVEL0 RTL --> Deleted device memory 0xffffb80206e00000 (Base: 0xffffb80206e00000, Size: 4000000)
Libomptarget --> Removing map entry with HstPtrBegin=0x000002b911c6d050, TgtPtrBegin=0xffffb80206e00000, Size=4000000, Name=GPUTESTS$B_D_addr_a0
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f3d0, Size=96)...
Libomptarget --> Removing shadow pointer 0x000000c083f6f3d0
Libomptarget --> Deleting tgt data 0xffffb802065a0100 of size 96
Libomptarget --> Removing map entry with HstPtrBegin=0x000000c083f6f3d0, TgtPtrBegin=0xffffb802065a0100, Size=96, Name=GPUTESTS$B_D
Libomptarget --> Looking up mapping(HstPtrBegin=0x000002b911888050, Size=4000000)...
Libomptarget --> Deleting tgt data 0xffffb80206a00000 of size 4000000
Target LEVEL0 RTL --> Deleted device memory 0xffffb80206a00000 (Base: 0xffffb80206a00000, Size: 4000000)
Libomptarget --> Removing map entry with HstPtrBegin=0x000002b911888050, TgtPtrBegin=0xffffb80206a00000, Size=4000000, Name=GPUTESTS$A_D_addr_a0
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f370, Size=96)...
Libomptarget --> Removing shadow pointer 0x000000c083f6f370
Libomptarget --> Deleting tgt data 0xffffb802065a0080 of size 96
Libomptarget --> Removing map entry with HstPtrBegin=0x000000c083f6f370, TgtPtrBegin=0xffffb802065a0080, Size=96, Name=GPUTESTS$A_D
Libomptarget --> Looking up mapping(HstPtrBegin=0x000002b912047050, Size=4000000)...
Libomptarget --> Deleting tgt data 0xffffb80206600000 of size 4000000
Target LEVEL0 RTL --> Deleted device memory 0xffffb80206600000 (Base: 0xffffb80206600000, Size: 4000000)
Libomptarget --> Removing map entry with HstPtrBegin=0x000002b912047050, TgtPtrBegin=0xffffb80206600000, Size=4000000, Name=GPUTESTS$C_D_addr_a0
Libomptarget --> Looking up mapping(HstPtrBegin=0x000000c083f6f430, Size=96)...
Libomptarget --> Removing shadow pointer 0x000000c083f6f430
Libomptarget --> Deleting tgt data 0xffffb802065a0000 of size 96
Libomptarget --> Removing map entry with HstPtrBegin=0x000000c083f6f430, TgtPtrBegin=0xffffb802065a0000, Size=96, Name=GPUTESTS$C_D

I still think something is not right since there doesn't seem to be any load in the GPU.

0 Kudos
jimdempseyatthecove
Black Belt
1,346 Views

...

 

...

Libomptarget --> 
  Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> 
  Unable to load library 'libomptarget.rtl.ppc64.so': T!
Libomptarget --> 
  Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: 
  Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> 
  Loading library 'omptarget.rtl.x86_64.dll'...
Libomptarget --> 
  Unable to load library 'omptarget.rtl.x86_64.dll': T!
Libomptarget --> 
  Unable to load library 'omptarget.rtl.x86_64.dll': 
    omptarget.rtl.x86_64.dll: 
      Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> 
  Loading library 'libomptarget.rtl.cuda.so'...
Libomptarget --> 
  Unable to load library 'libomptarget.rtl.cuda.so': T!
Libomptarget --> 
  Unable to load library 'libomptarget.rtl.cuda.so': 
    libomptarget.rtl.cuda.so: 
      Can't open: The specified module could not be found.  (0x7E)!
...
Libomptarget --> RTLs loaded!

 

The first part indicates one of

a) the path to the libraries not setup properly (environment variable LD_LIBRARY_PATH=...)

b) the path is correct but those files are not there

Last part (RTLs loaded!) seem contradictory with respect to the "Unable to load..."

However,

Target LEVEL0 RTL --> Submitted kernel 0x000002b90f091bf0 to device 0
Target LEVEL0 RTL --> Executed kernel entry 0x000002b915269bc0 on device 0
...

Infers execution occurred. (the reported missing libraries were not required)

Jim Dempsey

 

 

0 Kudos
jimdempseyatthecove
Black Belt
1,441 Views

The !$OMP are a bit picky. The following appears to work on my system:

!  GPUTests.f90 
  program GPUTests

  use omp_lib
  
  implicit none
  
  ! Declare variables
!  integer, parameter :: n = 10000
  integer, parameter :: n = 100
  ! double precision, ALLOCATABLE :: a(:,:), b(:,:), c(:,:)
  integer :: i,j,k
  integer :: devices
  double precision, ALLOCATABLE, target :: a_d(:,:), b_d(:,:), c_d(:,:)
  
  allocate(a_d(n,n))
  allocate(b_d(n,n))
  allocate(c_d(n,n))
  
  devices = omp_get_num_devices()
  print *,"devices =", devices
  
  call random_seed()
  call random_number(a_d)
  call random_number(b_d)

  ! Perform multiplication on GPU
!$omp target map(to: a_d,b_d) map(from:c_d) 
!$omp parallel do collapse(3) schedule (static, 1) private(i, j, k)
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c_d(i,j) = c_d(i,j) + a_d(i,k) * b_d(k,j)
      end do
    end do
  end do
!$omp end target
  ! Print result
  print *, c_d(1,1)

  end program GPUTests
 devices =           1
  8.813144990012810E-002

I reduced the array size as I do not know the available memory to my GPU without further exploration.

I will try the 10,000...

... still running. I am on an Intel NUC with CPU integrated GPU which does not have hardware supported DP. (IOW DP implemented via software).

Jim Dempsey

jimdempseyatthecove
Black Belt
1,438 Views

Using single precision, n=3000, takes ~33 seconds.

 

Barbara_P_Intel
Moderator
1,367 Views

@lxander, thank you for this nice reproducer. I filed a bug, CMPLRLLVM-46214.

@jimdempseyatthecove, thank you for the working matmul! That's the way I would code it. These clauses are optional: collapse(3) schedule (static, 1) private(i, j, k). I don't know if performance is impacted by the collapse or schedule or not.



Barbara_P_Intel
Moderator
1,365 Views

Some references for using OpenMP TARGET directives:

The OpenMP website posts documents with examples. Here's a link. 

That same website lists books that are available. I have this one on my desk, Using OpenMP – The Next Step – by Ruud van der Pas, Eric Stotzer and Christian Terboven (2017).

Webinar: Three Quick, Practical Examples of OpenMP Offload to GPUs There are links to other webinars there, too, that you may find useful.

For when you're ready to optimize, check this out: oneAPI GPU Optimization Guide

Barbara_P_Intel
Moderator
1,346 Views

I set this environment variable:

On Linux/bash: export LIBOMPTARGET_PLUGIN_PROFILE=T

On Windows: set LIBOMPTARGET_PLUGIN_PROFILE=T

If it offloads, a table of profiling information is printed.


0 Kudos
TobiasK
Moderator
1,300 Views

@jimdempseyatthecove @lxander great to see some OpenMP offload work here!


@lxander 

Let me give you a few hints
In your first example you open a parallel region on the host. The subsequent target region is created by all host threads separately, this is probably not what you want to do. Also the !$omp do at line 28 is directly nested inside the !$omp target region. !$omp do is a worksharing directive, it does not create any threads, it just distributes the following do loop across the threads that are already active. The !$omp target region creates exactly one team with one thread, so no worksharing can happen at the !$omp do directive.

In your second example:
Please be aware the !$omp do at line 33 will run on the host, not on the device!  Since the directive at line 26 is a combined directive ( !$omp target teams distribute) it implicitly ends after the loop at line 31. That is the reason why the compiler complains about misplaced directive at line 43. Essentially, the !$omp do at line 33 have no effect, because there is no parallel region active.

 

@jimdempseyatthecove 
you are using the !$omp parallel do directly nested inside the !$omp target region, so the !$omp parallel does run on the device.
However, you are missing to create 'teams' which means that you are quite limited in terms of achievable parallelism on the GPU. To fully occupy the GPU you need to create teams which then create parallel regions again.
Luckily, you can just simply use
!$omp target teams distribute parallel do simd collapse(3)

which will create multiple teams, the distribute will distribute the work among teams, each team creates a parallel region with threads and the do will distribute work among the threads.

Another note:
c_d is not initialized and may contain garbage.

 

Since teams are not able to synchronize only one teams region is allowed inside a target region.
If you want to initialize the array c_d on the device you have to create two target regions:
  

  !$omp target enter data map (to:a_d,b_d) map (alloc:c_d)                                                                                                                                                                                                                                                                                                                                    
  !$omp target teams distribute parallel do simd collapse(2) private (i,j)                                                                                                                                                                                                                                                                                                                    
  do i = 1, n
    do j = 1, n
      c_d(i,j) = 0.
    end do
   end do

  !$omp target teams distribute parallel do simd collapse(3) private(i, j, k)                                                                                                                                                                                                                                                                                                                 
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c_d(i,j) = c_d(i,j) + a_d(i,k) * b_d(k,j)
      end do
    end do
  end do
  !$omp target exit data map(delete:a_d, b_d) map(from:c_d)    

Note: the target enter / exit data map regions are stand alone directive to map data to the device. In more complicated program you can place these directives in modules where you also allocate module variables. The so mapped data is present on the device unteil you explicitly remove it with a target exit data directive. That way you avoid implicit data copies at each !$omp target region since it detects the presence of the data on the device and just uses it without additional transfers.
Be careful: if you modify the data on the host between these regions you have to enforce synchronization via !$omp target update, also it is not allowed to change the allocation status while the arrays are mapped!

The number of teams and threads per team is defined by the runtime if you do not specify num_teams() and thread_limit()/ num_threads() clauses. Look for "Team sizes" and "Number of teams" in the LIBOMPTARGET debug output.

Best
Tobias

 

lxander
Novice
1,165 Views

Thank you for that wonderfully thorough explanation.  It’s really helpful and has got me pointed in the right direction.

0 Kudos
lxander
Novice
1,165 Views

I’ll give that a try … and thanks for the references.

0 Kudos
jimdempseyatthecove
Black Belt
1,276 Views

Tobias, very interesting. Thanks for the reply.

Comment: It seems (grammatically) redundant to combine "teams distribute" together with "parallel" on an !$omp target.

Can you elaborate the nuances regarding this?

IOW the syntax:

       !$omp target parallel do ...

should be sufficient to describe the intent of the programmer to perform the parallel do within a GPU (or multiple GPUs).

I suppose "teams distribute" might be an (a) annotation required to distribute the execution to multiple GPUs, but then it would also seem that

!$omp target enter data map (to:a_d,b_d) map (alloc:c_d)

would require "teams distribute" for multiple GPUs as well as some means to proportionally distribute the mapping.

i.e. it would not be desirable to consume the entire array sizes on each GPU as well as the PCIe bus bandwidth to copy the unnecessary data. This would not apply to shared memory.

 

Jim Dempsey

 

 

0 Kudos
TobiasK
Moderator
1,270 Views

Jim,
the problem with !$omp parallel do is that it creates a league of threads that are able to synchronize among each other (!$omp barrier explicit or implicit). If you think about GPUs that not possible. Hence something else is required - teams are not able to synchronize between each other but within each team !$omp parallel do creates threads that are able to synchronize. Also the teams are scheduled in any order. Actually the compiler optimizes away the loop body an just starts as many teams * threads per team as there are iteration available in the loop body, so each thread inside a team will just work on one i/j/k index which reminds me that the code I provided is also wrong... we need to have a reduction over c_d otherwise we have a race condition in load and storing c_d(i,j)

  !$omp target teams distribute parallel do simd collapse(3) private(i, j, k) reduction(+:c_d)

Multiple GPUs are not targeted automatically, all !$omp target directives have a optional argument device 
https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-1.pdf 
page 199:

36 If no device clause is present, the behavior is as if the device clause appears without a
37 device-modifier and with an expression equal to the value of the default-device-var ICV

 Using multiple devices is possible by using multiple host threads each starting a !$omp target region with a different device id but unfortunately, not automatically. 

 

Best
Tobias

0 Kudos
Barbara_P_Intel
Moderator
710 Views

@lxander, the ICE (Internal Compiler Error) that you originally reported is fixed in ifx 2023.2.0. It was released in July 2023 as part of the oneAPI HPC Toolkit 2023.2.



0 Kudos
Reply