Internal compiler error (C0000005) when compiling OpenMP program


I am getting the error "xfortcom: Fatal: There has been an internal compiler error (C0000005)" when compiling the following code with the IFX Intel® Fortran Compiler (Version 2023.0.0 Build 20221201).

  program GPUTests

  use omp_lib
  implicit none
  ! Declare variables
  integer, parameter :: n = 10000
  ! double precision, ALLOCATABLE :: a(:,:), b(:,:), c(:,:)
  integer :: i,j,k
  integer :: devices
  double precision, ALLOCATABLE, target :: a_d(:,:), b_d(:,:), c_d(:,:)
  devices = omp_get_num_devices()

  call random_seed()
  call random_number(a_d)
  call random_number(b_d)
  ! Set up OpenMP parallel region

  ! Perform multiplication on GPU
  !$omp target map(to: a_d,b_d) map(from:c_d)
  !$omp  do
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c_d(i,j) = c_d(i,j) + a_d(i,k) * b_d(k,j)
      end do
    end do
  end do
  !$omp end  do
  !$omp end target

  ! End parallel region

  ! Print result
  print *, c_d(1,1)

  end program GPUTests

If I comment out "!$omp target map(to: a_d,b_d) map(from:c_d)" and " !$omp end target" at lines 28 and 38, the code compiles and runs (but, of course, the calculation is done on the CPU instead of the GPU).  I'm wondering if I have my OpenMP target directives wrong.  Perhaps I have somethiong wrong in my build options?

Build Log

Build started: Project: GPUTests, Configuration: Debug|x64

Deleting intermediate files and output files for project 'GPUTests', configuration 'Debug|x64'.
Compiling with Intel® Fortran Compiler 2023.0.0 [Intel(R) 64]...
ifx /nologo /debug:full /Od /Qopenmp-targets:spir64 /Qopenmp /Qiopenmp /warn:interfaces /module:"x64\Debug\\" /object:"x64\Debug\\" /Fd"x64\Debug\vc170.pdb" /libs:dll /threads /dbglibs /c /Qlocation,link,"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\bin\HostX64\x64" /Qm64 "C:\My Files\Repos\Arc\GPUTests\GPUTests.f90"
xfortcom: Fatal: There has been an internal compiler error (C0000005).
compilation aborted for C:\My Files\Repos\Arc\GPUTests\GPUTests.f90 (code 1)

GPUTests - 1 error(s), 0 warning(s)


I made a small change to the OpenMP directives about the loop; changing "target" to "target teams distribute" etc. and changed the array type to REAL*4 (because with LIBOMPTARGET_DEBUG=1, there is a Level0 message "Double is not supported on this platform") ...

  program GPUTests

  use omp_lib

  implicit none

  ! Declare variables
  integer, parameter :: n = 1000

  integer :: devices

  integer :: i,j,k
  real*4, ALLOCATABLE, target :: a_d(:,:), b_d(:,:), c_d(:,:)


  devices = omp_get_num_devices()

  call random_seed()
  call random_number(a_d)
  call random_number(b_d)

  ! Perform multiplication on GPU
  do i=1,n
    do j=1,n
      c_d(i,j) = 0.0
    end do
  end do

  !$OMP DO
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c_d(i,j) = c_d(i,j) + a_d(i,k) * b_d(k,j)
      end do
    end do
  end do
  !!$OMP END TARGET TEAMS DISTRIBUTE !<- commented out as with in get error #7622 Misplaced part of OpenMP parallel directive
  ! Print result
  print *, c_d(1,1)

  end program GPUTests

The code compiles without any errors and it runs and this is the output (from setting LIBOMPTARGET_DEBUG=1).

I'm still not seeing any load in the GPU.  Does any of the above output confirm (or otherwise) that the calculation is being done on the GPU?  Maybe my OpenMP directives are still not correct?


I've made some small modifications to the OpenMP directives and this compiles and executes ...

  program GPUTests

  use omp_lib

  implicit none

  ! Declare variables
  integer, parameter :: n = 1000

  integer :: devices

  integer :: i,j,k
  real*4, ALLOCATABLE, target :: a_d(:,:), b_d(:,:), c_d(:,:)


  devices = omp_get_num_devices()

  call random_seed()
  call random_number(a_d)
  call random_number(b_d)

  ! Perform multiplication on GPU
  do i=1,n
    do j=1,n
      c_d(i,j) = 0.0
    end do
  end do

  !$OMP DO
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c_d(i,j) = c_d(i,j) + a_d(i,k) * b_d(k,j)
      end do
    end do
  end do
  !!$OMP END TARGET TEAMS DISTRIBUTE !<- commented out as with in get error #7622 Misplaced part of OpenMP parallel directive
  ! Print result
  print *, c_d(1,1)

  end program GPUTests

This is the output ...

I still think something is not right since there doesn't seem to be any load in the GPU.

Libomptarget --> 
  Loading library ''...
Libomptarget --> 
  Unable to load library '': T!
Libomptarget --> 
  Unable to load library '': 
  Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> 
  Loading library 'omptarget.rtl.x86_64.dll'...
Libomptarget --> 
  Unable to load library 'omptarget.rtl.x86_64.dll': T!
Libomptarget --> 
  Unable to load library 'omptarget.rtl.x86_64.dll': 
      Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> 
  Loading library ''...
Libomptarget --> 
  Unable to load library '': T!
Libomptarget --> 
  Unable to load library '': 
      Can't open: The specified module could not be found.  (0x7E)!
Libomptarget --> RTLs loaded!


The first part indicates one of

a) the path to the libraries not setup properly (environment variable LD_LIBRARY_PATH=...)

b) the path is correct but those files are not there

Last part (RTLs loaded!) seem contradictory with respect to the "Unable to load..."


Target LEVEL0 RTL --> Submitted kernel 0x000002b90f091bf0 to device 0
Target LEVEL0 RTL --> Executed kernel entry 0x000002b915269bc0 on device 0

Infers execution occurred. (the reported missing libraries were not required)

Jim Dempsey



The !$OMP are a bit picky. The following appears to work on my system:

!  GPUTests.f90 
  program GPUTests

  use omp_lib
  implicit none
  ! Declare variables
!  integer, parameter :: n = 10000
  integer, parameter :: n = 100
  ! double precision, ALLOCATABLE :: a(:,:), b(:,:), c(:,:)
  integer :: i,j,k
  integer :: devices
  double precision, ALLOCATABLE, target :: a_d(:,:), b_d(:,:), c_d(:,:)
  devices = omp_get_num_devices()
  print *,"devices =", devices
  call random_seed()
  call random_number(a_d)
  call random_number(b_d)

  ! Perform multiplication on GPU
!$omp target map(to: a_d,b_d) map(from:c_d) 
!$omp parallel do collapse(3) schedule (static, 1) private(i, j, k)
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c_d(i,j) = c_d(i,j) + a_d(i,k) * b_d(k,j)
      end do
    end do
  end do
!$omp end target
  ! Print result
  print *, c_d(1,1)

  end program GPUTests
 devices =           1

I reduced the array size as I do not know the available memory to my GPU without further exploration.

I will try the 10,000...

... still running. I am on an Intel NUC with CPU integrated GPU which does not have hardware supported DP. (IOW DP implemented via software).

Jim Dempsey

Honored Contributor III

Using single precision, n=3000, takes ~33 seconds.



@lxander, thank you for this nice reproducer. I filed a bug, CMPLRLLVM-46214.

@jimdempseyatthecove, thank you for the working matmul! That's the way I would code it. These clauses are optional: collapse(3) schedule (static, 1) private(i, j, k). I don't know if performance is impacted by the collapse or schedule or not.


Some references for using OpenMP TARGET directives:

The OpenMP website posts documents with examples. Here's a link. 

That same website lists books that are available. I have this one on my desk, Using OpenMP – The Next Step – by Ruud van der Pas, Eric Stotzer and Christian Terboven (2017).

Webinar: Three Quick, Practical Examples of OpenMP Offload to GPUs There are links to other webinars there, too, that you may find useful.

For when you're ready to optimize, check this out: oneAPI GPU Optimization Guide


I set this environment variable:



If it offloads, a table of profiling information is printed.

@jimdempseyatthecove @lxander great to see some OpenMP offload work here!


Let me give you a few hints
In your first example you open a parallel region on the host. The subsequent target region is created by all host threads separately, this is probably not what you want to do. Also the !$omp do at line 28 is directly nested inside the !$omp target region. !$omp do is a worksharing directive, it does not create any threads, it just distributes the following do loop across the threads that are already active. The !$omp target region creates exactly one team with one thread, so no worksharing can happen at the !$omp do directive.

In your second example:
Please be aware the !$omp do at line 33 will run on the host, not on the device!  Since the directive at line 26 is a combined directive ( !$omp target teams distribute) it implicitly ends after the loop at line 31. That is the reason why the compiler complains about misplaced directive at line 43. Essentially, the !$omp do at line 33 have no effect, because there is no parallel region active.


you are using the !$omp parallel do directly nested inside the !$omp target region, so the !$omp parallel does run on the device.
However, you are missing to create 'teams' which means that you are quite limited in terms of achievable parallelism on the GPU. To fully occupy the GPU you need to create teams which then create parallel regions again.
Luckily, you can just simply use
!$omp target teams distribute parallel do simd collapse(3)

which will create multiple teams, the distribute will distribute the work among teams, each team creates a parallel region with threads and the do will distribute work among the threads.

Another note:
c_d is not initialized and may contain garbage.


Since teams are not able to synchronize only one teams region is allowed inside a target region.
If you want to initialize the array c_d on the device you have to create two target regions:

  !$omp target enter data map (to:a_d,b_d) map (alloc:c_d)                                                                                                                                                                                                                                                                                                                                    
  !$omp target teams distribute parallel do simd collapse(2) private (i,j)                                                                                                                                                                                                                                                                                                                    
  do i = 1, n
    do j = 1, n
      c_d(i,j) = 0.
    end do
   end do

  !$omp target teams distribute parallel do simd collapse(3) private(i, j, k)                                                                                                                                                                                                                                                                                                                 
  do i = 1, n
    do j = 1, n
      do k = 1, n
        c_d(i,j) = c_d(i,j) + a_d(i,k) * b_d(k,j)
      end do
    end do
  end do
  !$omp target exit data map(delete:a_d, b_d) map(from:c_d)    

Note: the target enter / exit data map regions are stand alone directive to map data to the device. In more complicated program you can place these directives in modules where you also allocate module variables. The so mapped data is present on the device unteil you explicitly remove it with a target exit data directive. That way you avoid implicit data copies at each !$omp target region since it detects the presence of the data on the device and just uses it without additional transfers.
Be careful: if you modify the data on the host between these regions you have to enforce synchronization via !$omp target update, also it is not allowed to change the allocation status while the arrays are mapped!

The number of teams and threads per team is defined by the runtime if you do not specify num_teams() and thread_limit()/ num_threads() clauses. Look for "Team sizes" and "Number of teams" in the LIBOMPTARGET debug output.




Thank you for that wonderfully thorough explanation.  It’s really helpful and has got me pointed in the right direction.

0 Kudos

I’ll give that a try … and thanks for the references.

Honored Contributor III

Tobias, very interesting. Thanks for the reply.

Comment: It seems (grammatically) redundant to combine "teams distribute" together with "parallel" on an !$omp target.

Can you elaborate the nuances regarding this?

IOW the syntax:

       !$omp target parallel do ...

should be sufficient to describe the intent of the programmer to perform the parallel do within a GPU (or multiple GPUs).

I suppose "teams distribute" might be an (a) annotation required to distribute the execution to multiple GPUs, but then it would also seem that

!$omp target enter data map (to:a_d,b_d) map (alloc:c_d)

would require "teams distribute" for multiple GPUs as well as some means to proportionally distribute the mapping.

i.e. it would not be desirable to consume the entire array sizes on each GPU as well as the PCIe bus bandwidth to copy the unnecessary data. This would not apply to shared memory.


Jim Dempsey



the problem with !$omp parallel do is that it creates a league of threads that are able to synchronize among each other (!$omp barrier explicit or implicit). If you think about GPUs that not possible. Hence something else is required - teams are not able to synchronize between each other but within each team !$omp parallel do creates threads that are able to synchronize. Also the teams are scheduled in any order. Actually the compiler optimizes away the loop body an just starts as many teams * threads per team as there are iteration available in the loop body, so each thread inside a team will just work on one i/j/k index which reminds me that the code I provided is also wrong... we need to have a reduction over c_d otherwise we have a race condition in load and storing c_d(i,j)

  !$omp target teams distribute parallel do simd collapse(3) private(i, j, k) reduction(+:c_d)

Multiple GPUs are not targeted automatically, all !$omp target directives have a optional argument device 
page 199:

36 If no device clause is present, the behavior is as if the device clause appears without a
37 device-modifier and with an expression equal to the value of the default-device-var ICV

 Using multiple devices is possible by using multiple host threads each starting a !$omp target region with a different device id but unfortunately, not automatically. 



@lxander, the ICE (Internal Compiler Error) that you originally reported is fixed in ifx 2023.2.0. It was released in July 2023 as part of the oneAPI HPC Toolkit 2023.2.

0 Kudos