<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re:Do Concurrent offload getting incorrect results in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1559288#M170240</link>
    <description>&lt;P&gt;I've done some testing with your reproducer with an early version of the next ifx compiler. It prints the correct answers on CPU when compiled with -qopenmp and without.&lt;/P&gt;&lt;P&gt;But with the Intel GPU I don't get the right answers. I'm investigating.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;BR /&gt;</description>
    <pubDate>Tue, 02 Jan 2024 22:26:01 GMT</pubDate>
    <dc:creator>Barbara_P_Intel</dc:creator>
    <dc:date>2024-01-02T22:26:01Z</dc:date>
    <item>
      <title>Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1550908#M169782</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a code that is using "do concurrent" for offload to Intel GPUs.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The following code yields an incorrect result with IFX on an Intel GPU (but works fine on the CPU and on NVIDA GPUs with nvfortran):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;do concurrent (i=1:nr)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; fn2_fn1 = zero&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; fs2_fs1 = zero&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &amp;amp;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;+ diffusion_coef(2 ,k,i)) &amp;amp;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;* (x(2 ,k,i) - x(1 ,k,i))*dp(k)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &amp;amp;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;+ diffusion_coef(nt ,k,i)) &amp;amp;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; enddo&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; do concurrent (k=1:npm)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; enddo&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;enddo&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;However, if I modify the code to make the outer-most loop sequential, the code does yield the correct result:&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;do i=1,nr&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; fn2_fn1 = zero&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; fs2_fs1 = zero&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &amp;amp;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;+ diffusion_coef(2 ,k,i)) &amp;amp;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;* (x(2 ,k,i) - x(1 ,k,i))*dp(k)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &amp;amp;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;+ diffusion_coef(nt ,k,i)) &amp;amp;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;* (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; enddo&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; do concurrent (k=1:npm)&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; y( 1,k,i) = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; &amp;nbsp; y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;&amp;nbsp; enddo&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt;enddo&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It seems the compiler is not liking having the reduction loop within a DC loop (or maybe just having DC loops within DC loops?)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The NVIDIA compiler handles this by making the outer look across blocks and the two inner loops across threads:&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;7367, Generating implicit private(fn2_fn1,fs2_fs1)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Generating NVIDIA GPU code&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;7367, Loop parallelized across CUDA thread blocks ! blockidx%x&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;7370, Loop parallelized across CUDA threads(128) ! threadidx%x&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Generating reduction(+:fn2_fn1,fs2_fs1)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;7378, Loop parallelized across CUDA threads(128) ! threadidx%x&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I wanted to bring this to your attention - for now, I may make the outer loop sequential as "nr" is often small&lt;/P&gt;&lt;P&gt;&amp;nbsp;-- Ron&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 06 Dec 2023 00:25:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1550908#M169782</guid>
      <dc:creator>caplanr</dc:creator>
      <dc:date>2023-12-06T00:25:33Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551288#M169799</link>
      <description>&lt;P&gt;What version of ifx are you using? A new version, 2024.0.0, was released a couple of weeks ago as part of the HPC Toolkit.&lt;/P&gt;
&lt;P&gt;Do you have a complete reproducer including the output you expect?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What compiler options are you using?&lt;/P&gt;
&lt;P&gt;Thanks! That info will help with the triage.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 06 Dec 2023 23:29:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551288#M169799</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2023-12-06T23:29:44Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551299#M169803</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am using the latest 24.0:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;EDGE_GPU_INTEL: ~ $ ifx -v&lt;BR /&gt;ifx version 2024.0.0&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The compiler options I am using are:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;FC = mpif90 -f90=ifx&lt;/P&gt;&lt;P&gt;FFLAGS = -O3 -xHost -fp-model precise -heap-arrays -fopenmp-target-do-concurrent -fiopenmp -fopenmp-targets=spir64 -Xopenmp-target-backend "-device arc"&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;(the same issue happens on the MAX GPU with "-device pvc")&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I will be releasing the full code on github in a few days at which point I can post it here.&lt;/P&gt;&lt;P&gt;To reproduce, you would need to replace that small section with the first version I posted.&lt;/P&gt;&lt;P&gt;Then, you can run the testsuite and see that the tests fail, whereas with the revised code (the one on git) it succeeds.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I will post the back here when the code is released with the details on how to test it.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;-- Ron&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 07 Dec 2023 00:22:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551299#M169803</guid>
      <dc:creator>caplanr</dc:creator>
      <dc:date>2023-12-07T00:22:44Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551514#M169820</link>
      <description>&lt;P&gt;I should have been more specific about the reproducer. A small reproducer is best. One that we can compile and run quickly preferably without MPI.&lt;/P&gt;
&lt;P&gt;I meant "complete" in that you only supplied some loops.&lt;/P&gt;</description>
      <pubDate>Thu, 07 Dec 2023 16:42:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551514#M169820</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2023-12-07T16:42:37Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551633#M169822</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It should be straight forward to create a reproducer using the two versions of the loops I sent (I do not have time to make one currently).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Otherwise, when the code comes out tomorrow, it can be used to reproduce the error as described above.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;-- Ron&lt;/P&gt;</description>
      <pubDate>Thu, 07 Dec 2023 20:58:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551633#M169822</guid>
      <dc:creator>caplanr</dc:creator>
      <dc:date>2023-12-07T20:58:27Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551681#M169827</link>
      <description>&lt;P&gt;Should be straight forward..... People can spend a lot of time trying to figure it out what the 'obvious' but missing vital detail is. And on a personal level i will say the exercise&amp;nbsp; of making the simple reproducer has often thrown up additional details.&amp;nbsp; It is better when you have time to make a reproducer, the support staff will show much more interest and we all then get some resolution and a better product!&lt;/P&gt;</description>
      <pubDate>Fri, 08 Dec 2023 00:08:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551681#M169827</guid>
      <dc:creator>andrew_4619</dc:creator>
      <dc:date>2023-12-08T00:08:01Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551700#M169829</link>
      <description>&lt;P&gt;&lt;SPAN&gt;I do not have time to make one currently. Sad.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;As my old workmate used to say, there are 3 work days in 24 hours, I would use the spare one. Of course after 3 days of working you can tend to see pink elephants, but as long as you realize they do not exist you are still sane.&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Courtesy never hurt anyone, although interestingly, this is the second "challenging" post this afternoon.&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;A lot of people who work here do it for free and work, and most would not ask a question that can be found with a simple search. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;In essence, RTFM.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Dec 2023 00:59:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1551700#M169829</guid>
      <dc:creator>JohnNichols</dc:creator>
      <dc:date>2023-12-08T00:59:21Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1553934#M169960</link>
      <description>&lt;P&gt;&lt;SPAN&gt; Assuming fn2_fn1 and fs2_fs1 are scalars, nothing jumps out as being obviously problematic.&amp;nbsp; If they are scalars, we will assume they are LOCAL_INIT.&amp;nbsp; Otherwise, we will assume they are SHARED, which will cause problems.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;you could try adding &lt;A href="https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2023-1/do-concurrent.html" target="_self"&gt;locality-spec&lt;/A&gt;.&amp;nbsp; It not only helps the compiler, but someone 30 years from now who inherits your code will understand your data locality intent.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Finally, have you tried this without AOT ( without&amp;nbsp;&lt;SPAN&gt;Xopenmp-target-backend )?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Dec 2023 13:35:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1553934#M169960</guid>
      <dc:creator>Ron_Green</dc:creator>
      <dc:date>2023-12-14T13:35:52Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1554036#M169976</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Yes they are scalars.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have not used locality specifiers because several compilers that we use have not supported them, and we need to maintain compatibility.&lt;/P&gt;&lt;P&gt;Also, the default behavior of every compiler that supports DC is the exact locality behavior we require/expect, so adding them would simply be too robust.&amp;nbsp;&lt;/P&gt;&lt;P&gt;As for 30 years from now, I thank you for your suggestion, but the style guide for our code is clear that we do not use arrays without indicating they are arrays through indexing, or when using array syntax, but using a(:).&amp;nbsp; Therefore, as we provide documentation, it will be known they are scalars.&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;"&lt;SPAN&gt;Finally, have you tried this without AOT ( without&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;Xopenmp-target-backend )?"&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;Yes - the code works fine on the CPU on all compilers.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I must say that so far - none of these responses have been helpful, and being cursed at by a fellow user is not professional or useful.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was simply trying to help out by pointing out a bug in the compiler for code that works for GPU offload with other compilers but was failing for IFX in a bad "wrong answer" way.&amp;nbsp; &amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I may refrain from posting in this forum and rely on my direct Intel contacts from now on.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;That said, I can provide a simple reproducer when I am back from travel (as it seems you require that we provide that now) if you are still interested in fixing this bug.&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;-- Ron&lt;/P&gt;</description>
      <pubDate>Thu, 14 Dec 2023 17:15:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1554036#M169976</guid>
      <dc:creator>caplanr</dc:creator>
      <dc:date>2023-12-14T17:15:12Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1554041#M169977</link>
      <description>&lt;P&gt;I did not mean to infer that this is not a bug - it is a bug I think. Getting a bug report requires I find an hour to cook up a testcase.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;the -Xopenmp-target-backend passes the offload code to the device compiler and creates a device-specific binary.&amp;nbsp; If you leave that option off, we create SPIR-V code that, at runtime, uses that compiler to JIT compile the code.&amp;nbsp; There is often differences in JIT vs AOT code.&amp;nbsp; The error you are seeing is most likely in the IGC compiler used by our Fortran compiler to create the device code, or JIT it at runtime. I do think there is a problem in IGC codegen.&amp;nbsp; That it runs on CPU has me sure our Fortran compiler is "doing the right thing" with the loop nest. The AOT case is probably doing a loop optimization "trick" that has gone bad.&amp;nbsp; JIT may not be as aggressive in loop transforms.&amp;nbsp; It's worth a test.&lt;/P&gt;</description>
      <pubDate>Thu, 14 Dec 2023 17:22:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1554041#M169977</guid>
      <dc:creator>Ron_Green</dc:creator>
      <dc:date>2023-12-14T17:22:38Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1554046#M169978</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Like I said, I can send you a reproducer next week.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Otherwise, the code this came from is now on github: (github.com/predsci/hipft).&lt;/P&gt;&lt;P&gt;If you find the loop in question and replace it with the first example I posted (with nested DC loops), when you run the testsuite you will see that the test fails.&lt;/P&gt;&lt;P&gt;The repo provides build scripts for IFX for both CPU and GPU offload.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;-&amp;nbsp; Ron&lt;/P&gt;</description>
      <pubDate>Thu, 14 Dec 2023 17:27:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1554046#M169978</guid>
      <dc:creator>caplanr</dc:creator>
      <dc:date>2023-12-14T17:27:39Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1554208#M169991</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is a reproducer.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It should print the same pairs of numbers each time.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;On the GPU it outputs the wrong answer for the loop with embedded DC loops.&lt;/P&gt;&lt;P&gt;On the CPU it actually seg faults on the embedded DC loops when using -fopenmp.&lt;/P&gt;&lt;P&gt;On the CPU in serial mode, it produces correct results.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="fortran"&gt;program dc_gpu_bug1

    implicit none

    integer                                         :: i, j, k, nr,npm,nt,ntm,np
    double precision                                :: fn2_fn1, fs2_fs1
    double precision, dimension(:,:,:), allocatable :: diffusion_coef, x, y
    double precision, dimension(:,:,:,:), allocatable :: coef
    double precision, dimension(:),     allocatable :: dp,dt_i
    double precision, parameter :: zero=0.0
    double precision, parameter :: pi_i=1.0d0

    nr = 1
    ntm = 512
    npm = ntm*2
   
    nt=ntm+1
    np=npm

    allocate (diffusion_coef(nt,np,nr))

    diffusion_coef(:,:,:) = 1.0d0

    allocate (coef(2:ntm-1,2:npm-1,5,nr))

    coef(:,:,:,:)=0.

    coef(2,:,:,:) = 1.0d0
    coef(3:ntm-2,:,:,:) = 3.0d0
    coef(ntm-1,:,:,:) = 1.0d0

    allocate(dp(npm))
    dp(:) = 1.0d0

    allocate (dt_i(ntm))
    dt_i(:) = 1.0d0

    allocate (x(ntm,npm,nr))
    allocate (y(ntm,npm,nr))

    x(:,:,:) = 1.0d0

    x(1,:,:) = 2.0d0
    x(2,:,:) = 4.0d0

    x(ntm,:,:) = 2.0d0
    x(ntm-1,:,:) = 4.0d0

    y(:,:,:) = zero

!$omp target enter data map(to:x,y,coef,diffusion_coef,dp,dt_i)

    do concurrent (i=1:nr)
      fn2_fn1 = zero
      fs2_fs1 = zero
      do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
        fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &amp;amp;
                           + diffusion_coef(2 ,k,i)) &amp;amp;
                         * (x(2 ,k,i) - x(1 ,k,i))*dp(k)
        fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &amp;amp;
                           + diffusion_coef(nt ,k,i)) &amp;amp;
                         * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
      enddo
      do concurrent (k=1:npm)
        y( 1,k,i)  = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
        y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
      enddo
    enddo

!$omp target exit data map(from:y)

    print *, y(1,2,1), y(ntm,2,1)

! Reset y.

    y(:,:,:) = zero

!$omp target enter data map(to:y)

    do i=1,nr
      fn2_fn1 = zero
      fs2_fs1 = zero
      do concurrent (k=2:npm-1) reduce(+:fn2_fn1,fs2_fs1)
        fn2_fn1 = fn2_fn1 + (diffusion_coef(1 ,k,i) &amp;amp;
                           + diffusion_coef(2 ,k,i)) &amp;amp;
                         * (x(2 ,k,i) - x(1 ,k,i))*dp(k)
        fs2_fs1 = fs2_fs1 + (diffusion_coef(nt-1 ,k,i) &amp;amp;
                           + diffusion_coef(nt ,k,i)) &amp;amp;
                         * (x(ntm-1,k,i) - x(ntm,k,i))*dp(k)
      enddo
      do concurrent (k=1:npm)
        y( 1,k,i)  = fn2_fn1*dt_i( 1)*dt_i( 1)*pi_i
        y(ntm,k,i) = fs2_fs1*dt_i(ntm)*dt_i(ntm)*pi_i
      enddo
    enddo

!$omp target exit data map(from:y)

    print *, y(1,2,1), y(ntm,2,1)

!$omp target exit data map(delete:x,coef,diffusion_coef,dp,dt_i)
    deallocate(coef, x, y, dp, dt_i, diffusion_coef)

end program dc_gpu_bug1&lt;/LI-CODE&gt;</description>
      <pubDate>Fri, 15 Dec 2023 02:38:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1554208#M169991</guid>
      <dc:creator>caplanr</dc:creator>
      <dc:date>2023-12-15T02:38:44Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1554416#M170008</link>
      <description>&lt;P&gt;I will open a bug report.&amp;nbsp; Thank you for the reproducer!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;ron&lt;/P&gt;</description>
      <pubDate>Fri, 15 Dec 2023 16:32:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1554416#M170008</guid>
      <dc:creator>Ron_Green</dc:creator>
      <dc:date>2023-12-15T16:32:34Z</dc:date>
    </item>
    <item>
      <title>Re:Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1559288#M170240</link>
      <description>&lt;P&gt;I've done some testing with your reproducer with an early version of the next ifx compiler. It prints the correct answers on CPU when compiled with -qopenmp and without.&lt;/P&gt;&lt;P&gt;But with the Intel GPU I don't get the right answers. I'm investigating.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 02 Jan 2024 22:26:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1559288#M170240</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2024-01-02T22:26:01Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563396#M170414</link>
      <description>&lt;P&gt;&lt;SPAN&gt;&amp;gt;&amp;gt; On the CPU it actually seg faults on the embedded DC loops when using -fopenmp.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Please use -qopenmp with ifx. -fopenmp is deprecated and behaves differently from -qopenmp.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2024 23:38:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563396#M170414</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2024-01-16T23:38:02Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563400#M170415</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It still segfaults with -qopenmp:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="none"&gt;EDGE_GPU_INTEL: ~/bug1 $ ifx -v
ifx version 2024.0.2
EDGE_GPU_INTEL: ~/bug1 $ ifx -O3 -xHost -fp-model precise -heap-arrays -qopenmp dc_gpu_bug1_v2.f90 -o test_qopenmp
EDGE_GPU_INTEL: ~/bug1 $ ./test_qopenmp
4088.00000000000 0.000000000000000E+000
4088.00000000000 4088.00000000000
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 00007F9221A54DB0 Unknown Unknown Unknown
libiomp5.so 00007F9221F7236C Unknown Unknown Unknown
libiomp5.so 00007F9221F7233A Unknown Unknown Unknown
libiomp5.so 00007F9221F743DF Unknown Unknown Unknown
test_qopenmp 000000000040D7F8 Unknown Unknown Unknown
test_qopenmp 00000000004076A0 Unknown Unknown Unknown
test_qopenmp 000000000040525D Unknown Unknown Unknown
libc.so.6 00007F9221A3FEB0 Unknown Unknown Unknown
libc.so.6 00007F9221A3FF60 __libc_start_main Unknown Unknown
test_qopenmp 0000000000405175 Unknown Unknown Unknown&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;- Ron&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2024 23:49:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563400#M170415</guid>
      <dc:creator>caplanr</dc:creator>
      <dc:date>2024-01-16T23:49:26Z</dc:date>
    </item>
    <item>
      <title>Re:Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563401#M170416</link>
      <description>&lt;P&gt;Now for the the GPU version...&lt;/P&gt;&lt;P&gt;The Fortran OpenMP compiler guys said to add "shared(ntm, npm)" to the outer DO CONCURRENT loop.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;From 11.1.7.5 of the F2018 standard&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;28 1 The locality of a variable that appears in a DO CONCURRENT construct is LOCAL, LOCAL_INIT, SHARED, &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;29 or unspecified. A construct or statement entity of a construct or statement within the DO CONCURRENT &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;30 construct has SHARED locality if it has the SAVE attribute. If it does not have the SAVE attribute, it is a &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;31 different entity in each iteration, similar to LOCAL locality.&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;I made that change and ran successfully on Intel GPU.&lt;/P&gt;&lt;P&gt;My compiler options: ifx -what -fopenmp-target-do-concurrent -qopenmp -fopenmp-targets=spir64&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 16 Jan 2024 23:58:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563401#M170416</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2024-01-16T23:58:33Z</dc:date>
    </item>
    <item>
      <title>Re: Re:Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563402#M170417</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This works!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, I think it should work without the "shared".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;ntm and npm are scalars which according to the spec are assumed to be local/private here as they do not have the SAVE attribute.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This should be fine as long as they are treated as "firstprivate", in which case the inner DC loop should have correct values.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It seems the compiler is not handling this correctly since the original code reproducer works fine for GCC and NV (with NV working for both CPU and GPU).&amp;nbsp;&lt;/P&gt;&lt;P&gt;Unless these other compilers are assuming/helping more than the spec requires?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;- Ron&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jan 2024 00:10:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563402#M170417</guid>
      <dc:creator>caplanr</dc:creator>
      <dc:date>2024-01-17T00:10:57Z</dc:date>
    </item>
    <item>
      <title>Re: Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563404#M170418</link>
      <description>&lt;P&gt;I tried with your original reproducer and your compiler options on CPU only on 2 different Linux distros. One ran ok, the other got a segmentation fault.&lt;/P&gt;
&lt;P&gt;However, the version of your reproducer with the added "shared(ntm, npm)" on the outer DO CONCURRENT compiled with your compiler options ran ok on CPU on both distros.&lt;/P&gt;
&lt;P&gt;I'm using&amp;nbsp;ifx (IFX) 2024.0.0 20231017.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jan 2024 00:16:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563404#M170418</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2024-01-17T00:16:53Z</dc:date>
    </item>
    <item>
      <title>Re:Do Concurrent offload getting incorrect results</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563405#M170419</link>
      <description>&lt;P&gt;I don't speak "standard" very well. There are others on this Forum who do. I'll let them comment.&lt;/P&gt;&lt;P&gt;I wonder how this statement in the F2018 translates to nested DO CONCURRENT?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 17 Jan 2024 00:23:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Do-Concurrent-offload-getting-incorrect-results/m-p/1563405#M170419</guid>
      <dc:creator>Barbara_P_Intel</dc:creator>
      <dc:date>2024-01-17T00:23:02Z</dc:date>
    </item>
  </channel>
</rss>

