<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Questions regarding GPUs and OCLOC in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503662#M167197</link>
    <description>&lt;P&gt;Actually, when I turned on the debug output (with double-precision matrices of a rather limited size, n=160), I got the very informative message:&lt;/P&gt;&lt;LI-CODE lang="none"&gt;Libomptarget --&amp;gt; Device 0 is ready to use.
Target LEVEL0 RTL --&amp;gt; Device 0: Loading binary from 0x00007ff63879c000
Target LEVEL0 RTL --&amp;gt; Expecting to have 2 entries defined
Target LEVEL0 RTL --&amp;gt; Base L0 module compilation options: -cl-std=CL2.0  
Target LEVEL0 RTL --&amp;gt; Found a single section in the image
Target LEVEL0 RTL --&amp;gt; Error: addModule:zeModuleCreate failed with error code 1879048196, ZE_RESULT_ERROR_MODULE_BUILD_FAILURE
Target LEVEL0 RTL --&amp;gt; Error: module creation failed
LEVEL0 message: Target build log:
LEVEL0 message:   ''
LEVEL0 message:   'error: Double type is not supported on this platform.'
LEVEL0 message:   'in kernel: 'MAIN__''
LEVEL0 message:   'error: backend compiler failed build.'
LEVEL0 message:   ''&lt;/LI-CODE&gt;&lt;P&gt;I have attached the entire output, but this was the missing information that could have pointed us to the problem directly.&lt;/P&gt;</description>
    <pubDate>Tue, 11 Jul 2023 14:47:42 GMT</pubDate>
    <dc:creator>Arjen_Markus</dc:creator>
    <dc:date>2023-07-11T14:47:42Z</dc:date>
    <item>
      <title>Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502641#M167114</link>
      <description>&lt;P&gt;I want to experiment a bit with GPUs, but I am getting lost wrt the actual hardware and the support from ifx. Here is the situation:&lt;/P&gt;&lt;P&gt;I work with a laptop running Windows. According to the task manager it has two GPUs, Intel UHD Graphics and NVIDIA RTX A1000 Laptop GPU. I have no idea if ifx supports the first (the second certainly is not supported). So, I try to build a program that exploits GPUs via OpenMP offloading. So far so good.&lt;/P&gt;&lt;P&gt;The option -Qopenmp-targets:spir64 does have an effect, in that with the environment variable LIBOMPTARGET_DEBUG set to 1 I get a lot of debugging information. If I unset that variable the program hangs and after an interruption via control-C, I get the message:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;forrtl: error (200): program aborting due to control-C event
Image              PC                Routine            Line        Source
KERNELBASE.dll     00007FFE76522943  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFE76FB7614  Unknown               Unknown  Unknown
ntdll.dll          00007FFE787E26F1  Unknown               Unknown  Unknown
Libomptarget error: Host ptr 0x00007ff6f3f795ec does not have a matching target pointer.
Libomptarget error: Run with
Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory&lt;/LI-CODE&gt;&lt;P&gt;My interpretation is that the Intel GPU is not actually used or cannot be connected or is simply not supported. WEll, that can happen. But looking for an alternative (or better: looking for the list of devices that are supported), I cane across the option -Qopenmp-targets:spir64_gen.&lt;/P&gt;&lt;P&gt;If I try that, I get:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

ifx: warning #10441: The OpenCL offline compiler could not be found and is required for AOT compilation.See "https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/compilation/ahead-of-time-compilation.html" for more information.
ifx: error #10037: could not find 'ocloc'
ifx: error #10401: error running 'Offline Compiler'&lt;/LI-CODE&gt;&lt;P&gt;So I try to find out how to get ocloc. For Windows it ought to be part of the Intel DPC++/C++ installation. As far as I can tell from the output of icx on my laptop, that has been installed:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

icx: error: no input files&lt;/LI-CODE&gt;&lt;P&gt;But I cannot find a program "ocloc.exe" on the laptop. Or anything that resembles that name.&lt;/P&gt;&lt;P&gt;So I am left with a couple of questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Is the Intel GPU I have supported by ifx or icx?&lt;/LI&gt;&lt;LI&gt;What do I need to do to get ocloc and thereby enable "spir64_gen", if that would be a solution?&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Fri, 07 Jul 2023 10:50:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502641#M167114</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-07T10:50:51Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502650#M167117</link>
      <description>&lt;P&gt;Did you install the Intel GPU device driver? Information on how to do that is in the &lt;A href="https://www.intel.com/content/www/us/en/developer/articles/system-requirements/oneapi-fortran-compiler-system-requirements.html" target="_blank" rel="noopener"&gt;System Requirements&lt;/A&gt; article. Supported Intel CPUs are listed with the driver information.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 07 Jul 2023 11:19:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502650#M167117</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2023-07-07T11:19:40Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502659#M167120</link>
      <description>&lt;P&gt;Well, to be sure (I did do an explicit update before) I followed the instructions from that page and hoped I got the right one, as none of the entries lists exactly the GPU my system apparently has. But that was unsuccessful in the sense that I get the same sort of error. The program hangs and upon control-C I get similar messages.&lt;/P&gt;</description>
      <pubDate>Fri, 07 Jul 2023 11:47:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502659#M167120</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-07T11:47:27Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502662#M167121</link>
      <description>&lt;P&gt;Judging from the debug output I would say it is working:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Libomptarget --&amp;gt; Loading library 'omptarget.rtl.level0.dll'...
Target LEVEL0 RTL --&amp;gt; Init Level0 plugin!
Target LEVEL0 RTL --&amp;gt; omp_get_thread_limit() returned 2147483647
Target LEVEL0 RTL --&amp;gt; omp_get_max_teams() returned 0
Libomptarget --&amp;gt; Successfully loaded library 'omptarget.rtl.level0.dll'!
Target LEVEL0 RTL --&amp;gt; Looking for Level0 devices...
Target LEVEL0 RTL --&amp;gt; Found a GPU device, Name = Intel(R) UHD Graphics 770
Target LEVEL0 RTL --&amp;gt; Found 1 root devices, 1 total devices.
Target LEVEL0 RTL --&amp;gt; List of devices (DeviceID[.SubID[.CCSID]])
Target LEVEL0 RTL --&amp;gt; -- 0
Target LEVEL0 RTL --&amp;gt; Root Device Information
Target LEVEL0 RTL --&amp;gt; Device 0
Target LEVEL0 RTL --&amp;gt; -- Name                         : Intel(R) UHD Graphics 770
Target LEVEL0 RTL --&amp;gt; -- PCI ID                       : 0x4688&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;and lots more, but I get no output and I have to terminate the program because I am running out of patience (the same program with classic OpenMP statement runs in half a second, the program without any OpenMP takes several seconds).&lt;/P&gt;</description>
      <pubDate>Fri, 07 Jul 2023 11:57:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502662#M167121</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-07T11:57:47Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502664#M167122</link>
      <description>&lt;P&gt;Oh, I misinterpreted the behaviour of the program! I added a write statement to the time loop (the body of that loop is in the target section) and I see that this is running, be it very slow. The task manager indeed indicates that the GPU is doing a lot of work, but I have succeeded in slowing down the program by at least a factor 100. Not entirely the result I expected :).&lt;/P&gt;</description>
      <pubDate>Fri, 07 Jul 2023 12:05:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502664#M167122</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-07T12:05:10Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502805#M167130</link>
      <description>&lt;P&gt;you can set the env var&lt;/P&gt;
&lt;P&gt;LIBOMPTARGET_PLUGIN_PROFILE=T&lt;/P&gt;
&lt;P&gt;to get an idea of how much data movement is occuring, how much time in the kernel, etc.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Is the kernel small enough to share?&amp;nbsp; For DO loops, did you use&lt;/P&gt;
&lt;P&gt;!$omp target teams distribute parallel do&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 07 Jul 2023 19:18:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502805#M167130</guid>
      <dc:creator>Ron_Green</dc:creator>
      <dc:date>2023-07-07T19:18:56Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502950#M167143</link>
      <description>&lt;P&gt;No, I am still learning how exactly to use the various keywords. I know my way around the classic OpenMP keywords, but these are new and I have to experiment. So I did. Using the keywords you suggested does improive the performance of the program, but it is still very much slower than the sequential version. I have copied the code below (attaching did not work &lt;LI-EMOJI id="lia_disappointed-face" title=":disappointed_face:"&gt;&lt;/LI-EMOJI&gt; ) - it is a toy program, easy enough for experimentation.&lt;/P&gt;&lt;LI-CODE lang="fortran"&gt;! diffu.f90 --
!     Solve a diffusion-reaction equation: nabla2 u = alpha * exp(u)
!
!     Note:
!     The program is much slower than without OpenMP offloading. This clearly
!     requires more fine-tuning.
!
!
program diffu
    use omp_lib

    implicit none
    real, allocatable :: u(:,:), unew(:,:)
    real              :: alpha, delt
    integer           :: i, j, k, n
    real              :: time1, time2
    integer           :: cnt1, cnt2, cnt_rate

    open( 10, file = 'diffu.out' )

    n     = 1280
    allocate( u(n,n), unew(n,n) )

    delt  = 0.1
    alpha = 0.01
    u     = 0.0

!!  u(1,:) = 1.0

!!    call omp_set_num_threads(128)

    call system_clock( cnt1, cnt_rate )
    call cpu_time( time1 )

    write(*,*) 'Start time loop ...'

    do k = 1,1000
!!        write(*,*) k
!XXXX !$omp target map(tofrom: u) map(from:unew)
!XXXX !$omp teams

!$omp target teams distribute parallel do
        do j=2,n-1
            do i=2,n-1
                unew(i,j) = u(i,j) + delt * (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) - 4.0 * u(i,j) + alpha * exp(u(i,j)) )
            enddo
        enddo
!$omp target teams distribute parallel do
        do j=2,n-1
            do i=2,n-1
                u(i,j) = unew(i,j)
            enddo
        enddo
    enddo

    call cpu_time( time2 )
    call system_clock( cnt2 )

    do j =1,n
        write( 10, '(*(f10.4))' ) u(:,j)
    enddo

    write(*,*) 'CPU time:   ', time2 - time1
    write(*,*) 'Clock time: ', (cnt2 - cnt1) / real(cnt_rate)
end program&lt;/LI-CODE&gt;&lt;P&gt;As you can see, it contains some experiments - the to and tofrom clauses.&lt;/P&gt;</description>
      <pubDate>Sat, 08 Jul 2023 13:51:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502950#M167143</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-08T13:51:15Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502975#M167147</link>
      <description>&lt;P&gt;Arjen,&lt;/P&gt;&lt;P&gt;I am inexperienced with GPU programming. I do have some observations on your codeing example.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Your 1st !XXXX commented section (when uncommented) can be thought of as an analog of !$omp parallel (without the DO). IOW it starts the encapsulation of a parallel region (terminated with !$omp end parallel). In this case of target (without teams), the encapsulate code (through !$omp end target) is intended to be offloaded to the GPU.&lt;/P&gt;&lt;P&gt;Your code as written is using !$omp target teams ... within the do k= loop.&lt;/P&gt;&lt;P&gt;Meaning each (of the two) instance specifies both an offload region of code plus an offload teams distribution.&lt;/P&gt;&lt;P&gt;IOW each instance performs a copy in from host to GPU and copy back from GPU to host.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Perhaps a better approach is to place the do k= loop, or a portion of that loop inside the offload region.&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;...
stride = 10
do k=1,1000, stride
  write(*,*) k
  !$omp target map(tofrom: u) map(from:unew)
  do kk=k,min(k+stride-1,1000)
    !$omp teams distribute parallel do
    do j=2,n-1
      do i=2,n-1
        unew(i,j) = u(i,j) + delt * (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) - 4.0 * u(i,j) + alpha * exp(u(i,j)) )
      enddo
    enddo
    !$omp teams distribute parallel do
    do j=2,n-1
      do i=2,n-1
        u(i,j) = unew(i,j)
      enddo
    enddo
  enddo
  !$omp end target
end do
call cpu_time( time2 )
call system_clock( cnt2 )
...&lt;/LI-CODE&gt;&lt;P&gt;Or place the entire k loop inside the target region.&lt;/P&gt;&lt;P&gt;The code above is performs a progress report every 10 steps.&lt;/P&gt;&lt;P&gt;Also, you can use persistent data within the GPU and copy out what is needed when needed.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I wish the documentation included examples, or links to examples of the various offload features.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 08 Jul 2023 18:23:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1502975#M167147</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2023-07-08T18:23:17Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503005#M167150</link>
      <description>&lt;P&gt;That is pretty much what I had in mind: copy the data to the GPU, do all the calculations there and then bring back the results. It is indeed the lack of clear examples that makes it an exercise in patience, trial and error. At the very least I am glad I could establish that the GPU is recognised and is doing the work I wanted it to do. Now I need to figure out what the right invocation is to make it worthwhile - there are quite a few permutations possible. I will look into your suggestions, thanks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 09 Jul 2023 10:35:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503005#M167150</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-09T10:35:46Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503007#M167151</link>
      <description>&lt;P&gt;I wrote a Fortran OpenMP offload tutorial that will be published in the oneAPI Samples GitHub when oneAPI 2023.2 is released later this month. It's based on a matrix multiply.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This is what the code should look like when the tutorial is complete.&amp;nbsp; The size of the matrix may need to be changed depending on the memory available in the GPU.&lt;/P&gt;
&lt;P&gt;ARGH! The ELSE for line 13 doesn't show. You'll need that.&lt;/P&gt;
&lt;LI-CODE lang="fortran"&gt;program matrix_multiply
   use omp_lib
   implicit none
   integer :: i, j, k, myid, m, n
   real(8), allocatable, dimension(:,:) :: a, b, c, c_serial

   n = 2600

    myid = OMP_GET_THREAD_NUM()
    if (myid .eq. 0) then
      print *, 'matrix size ', n
      print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()

      print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
!$omp target 
      if (OMP_IS_INITIAL_DEVICE()) then
        print *, ' Running on CPU'
        else
        print *, ' Running on GPU'
      endif
!$omp end target 
    endif

      allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n))

! Initialize matrices
      do j=1,n
         do i=1,n
            a(i,j) = i + j - 1
            b(i,j) = i - j + 1
         enddo
      enddo
      c = 0.0
      c_serial = 0.0

!$omp target teams map(to: a, b) map(tofrom: c)
!$omp distribute parallel do SIMD private(j, i, k)
! parallel compute matrix multiplication.
      do j=1,n
         do i=1,n
            do k=1,n
                c(i,j) = c(i,j) + a(i,k) * b(k,j)
            enddo
         enddo
      enddo
!$omp end target teams

! serial compute matrix multiplication
      do j=1,n
         do i=1,n
            do k=1,n
                c_serial(i,j) = c_serial(i,j) + a(i,k) * b(k,j)
            enddo
         enddo
      enddo

! verify result
      do j=1,n
         do i=1,n
            if (c_serial(i,j) .ne. c(i,j)) then
               print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j)
            exit
            endif
         enddo
      enddo

      print *,'PASSED'

end program matrix_multiply&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 09 Jul 2023 11:03:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503007#M167151</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2023-07-09T11:03:16Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503008#M167152</link>
      <description>&lt;P&gt;If I look at the suggested code change, then the k-loop itself becomes parallellised, but that cannot be done, because it represents a development in time, so it has to be sequential. Or do I misunderstand it?&lt;/P&gt;</description>
      <pubDate>Sun, 09 Jul 2023 11:05:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503008#M167152</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-09T11:05:29Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503009#M167153</link>
      <description>&lt;P&gt;Ah, thanks - again, I will study this. It will certainly be worthwhile to see the directives in action.&lt;/P&gt;</description>
      <pubDate>Sun, 09 Jul 2023 11:08:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503009#M167153</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-09T11:08:38Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503014#M167155</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&lt;SPAN&gt;If I look at the suggested code change, then the k-loop itself becomes parallellised, but that cannot be done, because it represents a development in time, so it has to be sequential. Or do I misunderstand it?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Perhaps Barbara can correct me should I be wrong&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;!$omp target (without teams), through !$omp end target...&lt;/P&gt;&lt;P&gt;can be use to install code and data into GPU (or reuse code from prior use) .AND. begin a serial sequence within the GPU.&lt;/P&gt;&lt;P&gt;Within the above target region, the use of !$omp teams distribute can be used to to form a team and distribute a DO loop(s) for a parallel processing of the loop(s).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In Barbara's example, she is using !$omp target teams ... to enter the offload region with a parallel team running, and then within the parallel offload region use !$omp distribute to partition the DO loop.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 09 Jul 2023 12:55:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503014#M167155</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2023-07-09T12:55:46Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503117#M167160</link>
      <description>&lt;P&gt;This shows the importanc of examples :). I hope that the tutorial I found will bring me the correct understanding.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In the meantime, things are not entirely straightforward. Jim's suggestion leads to the following error messages from the compiler:&lt;/P&gt;&lt;LI-CODE lang="none"&gt;Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023.1.0 Build 20230320
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

diffu_v5_jim.f90(39): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct.
        do kk=k,min(k+stride-1,1000)
--------^
diffu_v5_jim.f90(46): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct.
            !$omp teams distribute parallel do
------------------^
diffu_v5_jim.f90(52): error #8699: If a TARGET construct contains the TEAMS construct it must contain no statements or directives outside of the TEAMS construct.
        enddo
--------^
compilation aborted for diffu_v5_jim.f90 (code 1)&lt;/LI-CODE&gt;&lt;P&gt;whereas Barbara's example leads to run-time errors:&lt;/P&gt;&lt;LI-CODE lang="none"&gt; matrix size         2600
 Number of CPU procs is            1
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with
Libomptarget error: LIBOMPTARGET_DEBUG=1 to display basic debug information.
Libomptarget error: LIBOMPTARGET_DEBUG=2 to display calls to the compute runtime.
Libomptarget error: LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jul 2023 07:13:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503117#M167160</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-10T07:13:32Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503156#M167161</link>
      <description>&lt;P&gt;Some OpenMP offload references:&lt;/P&gt;
&lt;UL class="added rich-diff-level-zero" dir="auto"&gt;
&lt;LI class="rich-diff-level-one"&gt;
&lt;P&gt;&lt;A href="https://www.intel.com/content/www/us/en/developer/videos/three-quick-practical-examples-openmp-offload-gpus.html" target="_blank" rel="nofollow noopener"&gt;Three Quick, Practical Examples of OpenMP Offload to GPUs&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(Intel webinar)&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="rich-diff-level-one"&gt;
&lt;P&gt;&lt;A href="https://app.plan.intel.com/e/er?cid=em&amp;amp;source=elo&amp;amp;campid=satg_WW_satgobmcdn_EMNL_EN_2023_Dev%20Newsletter%20April%202023_C-MKA-30705_T-MKA-36702&amp;amp;content=satg_WW_satgobmcdn_EMNL_EN_2023_Dev%20Newsletter%20April%202023_C-MKA-30705_T-MKA-36702_HPC&amp;amp;elq_cid=5093974&amp;amp;em_id=91065&amp;amp;elqrid=84ff8ebfeeda4abeb600f9e3c1073d93&amp;amp;elqcampid=56326&amp;amp;erpm_id=7990181&amp;amp;s=334284386&amp;amp;lid=623351&amp;amp;elqTrackId=36debf346d1a471e9c1e5e84630bea8e&amp;amp;elq=84ff8ebfeeda4abeb600f9e3c1073d93&amp;amp;elqaid=91065&amp;amp;elqat=1" target="_blank" rel="nofollow noopener"&gt;GPU Offloading: The Next Chapter for Intel® Fortran Compiler&lt;/A&gt;&amp;nbsp;(Intel webinar)&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="rich-diff-level-one"&gt;
&lt;P&gt;&lt;A href="https://direct.mit.edu/books/book/4482/Using-OpenMP-The-Next-StepAffinity-Accelerators" target="_blank" rel="nofollow noopener"&gt;Using OpenMP—The Next Step: Affinity, Accelerators, Tasking, and SIMD&amp;nbsp;&lt;/A&gt;(book)&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="rich-diff-level-one"&gt;&lt;A href="https://www.openmp.org/wp-content/uploads/openmp-examples-5-2.pdf" target="_blank" rel="noopener"&gt;Examples&lt;/A&gt; from &lt;A href="http://openmp.org" target="_blank" rel="noopener"&gt;openmp.org&lt;/A&gt;. Search for TARGET.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jul 2023 10:50:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503156#M167161</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2023-07-10T10:50:41Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503158#M167162</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;nbsp;&lt;SPAN&gt;whereas Barbara's example leads to run-time errors:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;Can you try a smaller matrix? I had to do that for one Intel GPU I tested.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jul 2023 10:55:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503158#M167162</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2023-07-10T10:55:01Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503163#M167163</link>
      <description>&lt;P&gt;I reduced the size by a factor 10 and got a very similar error message. I am also a trifle puzzled by the statement that there is only one CPU. A "hello" program clearly showed 24 CPUs (or better: a defualt of 24 threads being started).&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jul 2023 11:10:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503163#M167163</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-10T11:10:40Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503169#M167164</link>
      <description>&lt;P&gt;With the matmul, there is only 1 CPU because the OpenMP directives are all for TARGET, not CPU. You can modify the directives to run on CPU only&lt;/P&gt;
&lt;P&gt;I just copied what I posted, added the "else" and ran it successfully on Linux with PVC. However, the output didn't say I ran on GPU. So I removed the "else" and got this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;+ a.out
 matrix size         2600
 Number of CPU procs is            1
 Number of OpenMP Device Available:           2
 Running on GPU
 PASSED
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My next step is to load the GPU driver on my laptop and see what's up.&lt;/P&gt;
&lt;P&gt;I also set the environment variable&amp;nbsp;LIBOMPTARGET_PLUGIN_PROFILE=T and got this output because I set that.&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL_ZERO) for OMP DEVICE(0) Intel(R) Data Center GPU Max 1100, Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_45_810c026c_MAIN___l14
Kernel 1                  : __omp_offloading_45_810c026c_MAIN___l35
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (msec)                        Device Time (msec)
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :     421.63    421.63    421.63    421.63      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :       3.11      0.22      0.00      0.81      0.00      0.00      0.00      0.00     14.00
DataRead (Device to Host) :       0.00      0.00      0.00      0.00      2.38      2.38      2.38      2.38      1.00
DataWrite (Host to Device):       4.39      0.49      0.01      1.71      7.39      0.82      0.00      2.47      9.00
Kernel 0                  :       1.87      1.87      1.87      1.87      0.01      0.01      0.01      0.01      1.00
Kernel 1                  :       0.07      0.07      0.07      0.07   2865.43   2865.43   2865.43   2865.43      1.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :       1.51      1.51      1.51      1.51      0.00      0.00      0.00      0.00      1.00
======================================================================================================================
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jul 2023 11:36:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503169#M167164</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2023-07-10T11:36:16Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503187#M167166</link>
      <description>&lt;P&gt;I removed the "else" statement and got the message that one OpenMP device is available. And then the error message.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;With one version of the original diffusion program I get the following profile:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="none"&gt; Start time loop ...
 CPU time:      10.23438
 Clock time:    10.23900
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) UHD Graphics 770, Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_f2b3f24a_efe1e_MAIN___l34
Kernel 1                  : __omp_offloading_f2b3f24a_efe1e_MAIN___l40
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (msec)                        Device Time (msec)
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :     604.39    604.39    604.39    604.39      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :       1.49      0.00      0.00      0.05      0.00      0.00      0.00      0.00   8008.00
DataRead (Device to Host) :     797.53      0.20      0.16      0.90      0.00      0.00      0.00      0.00   4000.00
DataWrite (Host to Device):     902.03      0.09      0.00      1.66      0.00      0.00      0.00      0.00  10000.00
Kernel 0                  :    2758.64      2.76      2.51     11.12   2582.38      2.58      2.43      3.55   1000.00
Kernel 1                  :    5127.18      5.13      4.83      6.22   4958.39      4.96      4.75      6.06   1000.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :      12.91     12.91     12.91     12.91      0.00      0.00      0.00      0.00      1.00
======================================================================================================================&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jul 2023 13:16:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503187#M167166</guid>
      <dc:creator>Arjen_Markus</dc:creator>
      <dc:date>2023-07-10T13:16:22Z</dc:date>
    </item>
    <item>
      <title>Re: Questions regarding GPUs and OCLOC</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503301#M167173</link>
      <description>&lt;P&gt;That's great that you have something that runs on the GPU! That table is proof! So it seems your environment is set up.&lt;/P&gt;
&lt;P&gt;I don't know why the matmul is failing. I just ran it successfully on an Intel Core i7-8809G 3.10GHz with Intel® HD Graphics 630. An older machine, but it worked.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jul 2023 18:50:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Questions-regarding-GPUs-and-OCLOC/m-p/1503301#M167173</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2023-07-10T18:50:45Z</dc:date>
    </item>
  </channel>
</rss>

