Hi Kevin,

Victor_V_ · ‎02-18-2014

Hello,

we are frustrating with a weird problem caused by the latest Intel's compiler suites (ifort/icc v14.0.2.144, Linux x86_64/EM64T). Depending on the optimization level, the following code generates different results:

CSTART

        subroutine MakeWwdHlp2 (Ww,W1,dima,dimbe,dimga)
        implicit none
        integer dima,dimb,dimab,dimbe,dimga,key
        real*8 Ww(1:dima,1:dimbe,1:dimga)
        real*8 W1(1:dima,1:dimbe,1:dima,1:dimga)

integer a,be,ga

        do ga=1,dimga
        do be=1,dimbe
        do a=1,dima
          Ww(a,be,ga)=W1(a,be,a,ga)
        end do
        end do
        end do
C Uncomment to get correct results/loop's counters
c       print *,ga,be,a

return
end

CEND

The '-O0' as well as '-O1' optimization levels give us correct results while '-O2' doesn't. If the printing line is uncommented then results become always correct, regardless the optimization level used. The 'dimga','dima', and 'dimbe' counters are in the range [10,12]. Enclosed please find the assembler listing generated via '-00','-O1', and '-O2'.

It would be great to identify the compler's option that causes the problem.

Thank in advance!

Victor.

Steven_L_Intel1 · ‎02-18-2014

Sorry, there's nothing we can do for you here without sources we can use to build and run the program. The .s files are of no use in diagnostics.

Victor_V_ · ‎02-19-2014

Hi Steve,

our project is a quite big one. Well, we will try to find a workaround either by playing around with compiler options or by reworking this subroutine. So, could you please provide us a hint how to list compiler options enabled and activated by default for a certain optimization level, i.e something similar to:

gfortran -v -Q -O2 -c ...

options enabled: -falign-labels -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves ....

Thank you in advance!

Victor.

Kevin_D_Intel · ‎02-19-2014

You can try the ifort -list option which produces a listing file (.lst) containing a section COMPILER OPTIONS BEING USED. I do not know how fruitful that will be.

Does the entire program require being compiling at -O1 or -O0 to produce the correct results or just the single subroutine shown?

Could you perhaps isolate/create a reproducer by dumping the arrays to a file prior to calling the suspect routine and then create a driver to read in the data and only call the suspect routine only to see whether that reproduces the incorrect results at -O2 and correct results at -O1/-O0?

Victor_V_ · ‎02-19-2014

Hi Kevin,

thank you for your reply.

The whole project is compiled by using the '-O2' optimization level. However, in order to get correct results we have to recompile only this single subroutine with '-O0' and relink binaries. In this way we always get correct results. Finally, I have identified the compiler options that cause an optimization problem:

"-vec -simd"

If I keep the '-O2' level and disable vectorization of this subroutine via '-no-vec -no-simd' compiler options then everything works properly.

> Could you perhaps isolate/create a reproducer by dumping the arrays

No problem, I will do it. It just takes some time.

With best regards,

Victor.

Kevin_D_Intel · ‎02-19-2014

Ok, thank you for the additional clues Victor. I'm wondering if perhaps a driver that simply initializes the arrays with dummy values will show the bad results too. I can try that now.

Kevin_D_Intel · ‎02-19-2014

I'm having no success producing incorrect results with a simple driver for the earlier provided subroutine when varying optimizations so I hope you will be able to isolate something that can help us reproduce this.

Victor_V_ · ‎02-19-2014

Hi Kevin,

Kevin Davis (Intel) wrote:

I'm having no success producing incorrect results with a simple driver for the earlier provided subroutine when varying optimizations so I hope you will be able to isolate something that can help us reproduce this.

interestingly, interchanging the loops along with providing a hint to compiler solves the problem:

        subroutine MakeWwdHlp2 (Ww,W1,dima,dimbe,dimga)
        implicit none
        integer dima,dimb,dimab,dimbe,dimga,key
        real*8 Ww(1:dima,1:dimbe,1:dimga)
        real*8 W1(1:dima,1:dimbe,1:dima,1:dimga)

integer a,be,ga

        do a=1,dima
        do ga=1,dimga
cDEC$ VECTOR UNALIGNED
        do be=1,dimbe
           Ww(a,be,ga)=W1(a,be,a,ga)
        end do
        end do
        end do
C Uncomment to get correct results/loop's counters
C        print *,ga,be,a

return
end

Could it be problem that working arrays are not aligned on 16-byte boundary?

With best regards,

Victor.

Kevin_D_Intel · ‎02-19-2014

Fortunately the smallness of the routine enabled our developer to spot a defect with loop collapse. I opened a defect report (internal tracking id noted below) and will keep you updated on the status as I learn it. From all those you found, you could choose which work around best fits your app.

They further wrote about the other items you noted:

“-no-vec –no-simd” helps because that shuts off transformations that enable more/better vectorization.

“Reordering loops” alone doesn’t help since the compiler reorders loops for better memory locality. Adding the directive disables such reordering, which in turn affects loop collapsing decision (and collapsing decides not to kick-in).

Thanks for reporting this issue.

(Internal tracking id: DPD200253575)
(Resolution Update on 09/11/2014): This defect is fixed in the Intel® Composer XE 2013 SP1 Update 4 release (2013.1.4.211 - Linux) -AND- the Intel® Parallel Studio XE 2015 Initial Release (2015.0.090 - Linux).

Victor_V_ · ‎02-20-2014

Hi Kevin,

thank you very much for your assistance and expertise!

I will keep an eye on it.

With best regards,

Victor.

Kevin_D_Intel · ‎09-11-2014

Development indicates the fix for the earlier identified loop collapse defect is available in the latest Intel® Composer XE 2013 SP1 Update 4 release (2013.1.4.211 - Linux). It is also available in the newest Intel Parallel Studio XE 2015 release for Linux (Version 15.0.0.090 Build 20140723) should you be interested in upgrading to that new release.

Problem with processing multidimensional arrays via nested loops