Compiler 17 and /standard-semantics option: anyone facing issues with intrinsic assignment of allocatable arrays?

FortranFan · ‎09-13-2016

As mentioned in the Release Notes with compiler 17 (Intel Parallel Studio XE 2017),

A significant change in this release is that the default for intrinsic 
assignment to allocatable arrays is now to do the automatic 
(re)allocation if necessary, as specified by the standard. In past 
releases you needed to specify the assume realloc_lhs option to get this 
behavior. In some applications the additional checking may affect 
performance - you can revert to the previous behavior by specifying 
norealloc_lhs or the new option nostandard-realloc-lhs.

I had assumed the above will have no impact on code compiled with the /standard-semantics compiler option. However, I'm encountering a bunch of strange issues involving intrinsic assignment of allocatable arrays even when the assignment does not involve any reallocation i.e., the LHS and RHS arrays indeed have the same shape, only the values are different. This is happening in code that has worked reliably with no change since compiler 14. Unfortunately I have no liberty whatsoever to share the actual code. So I have spent considerable time since last week to put together a generic reproducer but without any success. Hence I was wondering if anyone else has been experiencing any issues.

The problems I face include the following, note the compiler options for the codes are /standard-semantics /libs:static /threads /warn:all /stand /On where n may be d or 1,2, or 3 depending on the tests.

1) Run-time failures with either "insufficient virtual memory" or "access violation" errors.

2) Same exact code with same exact compiler and optimization options will complete execution with results as expected with Win32 (IA32) platform option, but experience runtime errors with x64 (Intel 64) platform option.

3) Some of the codes execute successfully with /Od and /O1 optimization options, but face above-mentioned run-time exceptions with /O2 or higher options.

4) However some of the codes referred to in 3) above can break with an unnamed exception in the debugger; the debugger then breaks into free.c procedure in Microsoft's C run-time library with an "error reading register" on line 50 (retval = HeapFree(_crtheap, 0, pBlock); instruction); the call stack shows Fortran code with instruction on a line where an intrinsic assignment to an allocatable array is taking place, even though no reallocation is necessary in any of the cases. Note for_realloc_lhs is present on the call stack.

5) But now if one steps through the code in the debugger line by line, the above break is sometimes avoided.

6) If compiler version is reverted back to compiler 16 with NO change to the code, all the programs referred to in the above failures execute as expected.

Hence it appears to be some sort of regression with /standard-semantics option. As I mentioned above, I'm struggling to put together a reproducer and it's taking up too much time. Any feedback is appreciated. It'll be great if Intel can report what kind of regression testing was performed and perhaps post a case or two that was used in the testing. I can then try the same and see if I notice expected behavior.

Thanks,

Steven_L_Intel1 · ‎09-13-2016

Please provide test cases. In the process of enabling this as a default we found and fixed many bugs that had not been exercised before, but it's possible we broke something. We ran our entire, VERY large test suite with this change and were in beta test with it for months.

FortranFan · ‎09-13-2016

Steve Lionel (Intel) wrote:

Please provide test cases. In the process of enabling this as a default we found and fixed many bugs that had not been exercised before, but it's possible we broke something. We ran our entire, VERY large test suite with this change and were in beta test with it for months.

Well, I'm raking my brains to create a reproducer.

The actual codes that experience these run-time failures do so with allocatable array components of REAL type in derived types which get defined (and redefined) with instructions involving intrinsic assignment in type-bound procedures, meaning typical OO code with methods working on passed objects and their data. One of my issues I have tons of very, very simple, generic OO type of code and I don't notice problems with these trivial cases. Then we have these extremely heavy-duty code libraries with deep OO hierachies which fail now but for which my hands are totally tied.

I just realized on one of the other build environments that the problems seem to have been introduced with update 1 of compiler 17 beta, meaning the codes all work with the initial compiler 17 beta as well as compiler 16, update 3. I wonder if this helps in anyway in figuring out my problems might be, say in connection with the changes that went into update 1 of compiler 17 beta.

Steven_L_Intel1 · ‎09-13-2016

It doesn't help - sorry. Only a reproducing test case helps.

FortranFan · ‎09-13-2016

Steve Lionel (Intel) wrote:

It doesn't help - sorry. Only a reproducing test case helps.

How about this?

module k_m

   use, intrinsic :: iso_fortran_env, only : WP => real64

   implicit none

   private

   public :: WP

end module k_m

module b_m

   implicit none

   private

   type, abstract, public :: b_t
      private
   end type b_t

end module b_m

module d_m

   use k_m, only : WP
   use b_m, only : b_t

   implicit none

   private

   type, extends(b_t), public :: d_t
      private
      real(WP), allocatable :: m_d(:)
      real(WP), allocatable :: m_norm_d(:)
   contains
      private
      procedure, pass(this), public :: norm_d => get_norm_d
      procedure, pass(this), public :: set => set_d
   end type d_t

contains

   pure subroutine set_d( this, d )

      class(d_t), intent(inout) :: this
      real(WP), intent(in)      :: d(:)

      this%m_d = d
      this%m_norm_d = this%m_d /sum( this%m_d )

      return

   end subroutine set_d

   pure function get_norm_d(this) result( norm_d )

      !.. Argument list
      class(d_t), intent(in) :: this
      !.. Function result
      real(WP), allocatable :: norm_d(:)

      norm_d = this%m_norm_d

      return

   end function get_norm_d

end module d_m

module a_m

   use k_m, only : WP
   use b_m, only : b_t
   use d_m, only : d_t

   implicit none

   private

   type, extends(b_t), abstract, public :: a_t
      private
      type(d_t), allocatable :: m_d
   contains
      private
      procedure, pass(this), public :: norm_d => get_norm_d
      procedure, pass(this), public :: set => set_a_t
   end type a_t

contains

   pure subroutine set_a_t( this, d )

      class(a_t), intent(inout) :: this
      real(WP), intent(in)      :: d(:)

      if ( .not. allocated(this%m_d) ) then
         allocate( this%m_d )
      end if
      call this%m_d%set( d )

      return

   end subroutine set_a_t

   pure function get_norm_d(this) result( norm_d )

      !.. Argument list
      class(a_t), intent(in) :: this
      !.. Function result
      real(WP), allocatable :: norm_d(:)

      norm_d = this%m_d%norm_d()

      return

   end function get_norm_d

end module a_m

module c_m

   use a_m, only : a_t

   implicit none

   private

   type, extends(a_t), public :: c_t
      private
   end type c_t

end module c_m

module f_m

   use k_m, only : WP
   use a_m, only : a_t
   use b_m, only : b_t
   use c_m, only : c_t

   type, extends(b_t), public :: f_t
      private
      class(a_t), allocatable :: m_a(:)
   contains
      private
      procedure, pass(this) :: load_f_t
      procedure, pass(this), public :: norm_d => get_norm_d
      generic, public :: load => load_f_t
   end type f_t

contains

   pure subroutine load_f_t( this, d )

      class(f_t), intent(inout) :: this
      real(WP), intent(in)      :: d(:)

      if ( .not. allocated(this%m_a) ) then
         allocate( c_t :: this%m_a(1) )
      end if

      call this%m_a(1)%set( d )

      return

   end subroutine load_f_t

   pure function get_norm_d(this) result( norm_d )

      !.. Argument list
      class(f_t), intent(in) :: this
      !.. Function result
      real(WP), allocatable :: norm_d(:)

      norm_d = this%m_a(1)%norm_d()

      return

   end function get_norm_d

end module f_m

module xf_m

   use k_m, only : WP
   use b_m, only : b_t
   use f_m, only : f_t

   type, extends(b_t), public :: xf_t
      private
      logical :: m_calc = .false.
   contains
      private
      procedure, pass(this) :: calc_xf_t
      generic, public :: calc => calc_xf_t
   end type xf_t

contains

   pure subroutine calc_xf_t( this, f, d )

      class(xf_t), intent(inout) :: this
      type(f_t), intent(inout)   :: f
      real(WP), intent(in)       :: d(:)

      call f%load( d )

      this%m_calc = .true.

      return

   end subroutine calc_xf_t

end module xf_m

module ef_m

   use k_m, only : WP
   use b_m, only : b_t
   use f_m, only : f_t
   use xf_m, only : xf_t

   type, extends(b_t), public :: ef_t
      private
      logical :: m_calc = .false.
      type(xf_t), allocatable :: m_xf
      type(f_t), allocatable :: m_f
   contains
      private
      procedure, pass(this) :: calc_ef_t
      generic, public :: calc => calc_ef_t
   end type ef_t

contains

   pure subroutine calc_ef_t( this, d )

      class(ef_t), intent(inout) :: this
      real(WP), intent(in)       :: d(:)

      if ( .not. allocated(this%m_f) ) then
         allocate( this%m_f )
      end if
      call this%m_f%load( d )

      if ( .not. allocated(this%m_xf) ) then
         allocate( this%m_xf )
      end if
      call this%m_xf%calc( this%m_f, d=this%m_f%norm_d() )

      this%m_calc = .true.

      return

   end subroutine calc_ef_t

end module ef_m

module ec_m

   use k_m, only : WP
   use b_m, only : b_t
   use ef_m, only : ef_t

   type, extends(b_t), public :: ec_t
      private
      logical :: m_calc = .false.
      type(ef_t), allocatable :: m_ef
   contains
      private
      procedure, pass(this) :: calc_ec_t
      generic, public :: calc => calc_ec_t
   end type ec_t

contains

   pure subroutine calc_ec_t( this, d )

      class(ec_t), intent(inout) :: this
      real(WP), intent(in)       :: d(:)

      if ( .not. allocated(this%m_ef) ) then
         allocate( this%m_ef )
      end if
      call this%m_ef%calc( d )

      this%m_calc = .true.

      return

   end subroutine calc_ec_t

end module ec_m

program p

   use k_m, only : WP
   use ec_m, only : ec_t

   implicit none

   type(ec_t) :: foo
   real(WP) :: d(2)
   integer :: I

   do I = 1, 2

      call random_number( d )
      call foo%calc( d )

   end do

   stop

end program p

C:\..>ifort /c /standard-semantics /O2 /warn:all /stand /libs:static /threads m.f90
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R
) 64, Version 17.0.0.109 Build 20160721
Copyright (C) 1985-2016 Intel Corporation.  All rights reserved.


C:\..>ifort /c /standard-semantics /O2 /warn:all /stand /libs:static /threads p.f90
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R
) 64, Version 17.0.0.109 Build 20160721
Copyright (C) 1985-2016 Intel Corporation.  All rights reserved.


C:\..>link /out:p.exe /subsystem:console p.obj m.obj
Microsoft (R) Incremental Linker Version 12.00.40629.0
Copyright (C) Microsoft Corporation.  All rights reserved.


C:\..>p.exe

Steven_L_Intel1 · ‎09-14-2016

Thanks - I can reproduce the error, but it's complicated. I will comment that the actual implementation of realloc_lhs changed significantly in this release to be better for optimization (much smaller code size), but so far I don't see that itself being the culprit. Looks like a stack corruption issue somewhere. Still working on it.

FortranFan · ‎09-14-2016

Steve Lionel (Intel) wrote:

Thanks - I can reproduce the error, but it's complicated. I will comment that the actual implementation of realloc_lhs changed significantly in this release to be better for optimization (much smaller code size), but so far I don't see that itself being the culprit. Looks like a stack corruption issue somewhere. Still working on it.

Thanks, Steve.

Re: your comments, "it's complicated" and "Looks like a stack corruption issue somewhere", are you thinking the problem(s) lie(s) with the code as shown above? And that the code will need to be fixed to work with compiler 17? Fyi, I tried the above reproducer with gfortran and I also tested one of the actual code libraries that runs into issues with compiler 17; just as with Intel Fortran compiler 16, gfortran didn't raise any issues. So I'll be curious to know if there is something wrong with the code.

Thanks much,

Steven_L_Intel1 · ‎09-14-2016

I have no evidence your source code is at fault, but this sort of error is exceedingly difficult to chase down, because any minor change causes the problem to move or hide. When I have more info I will let you know, and also if I can come up with a workaround. I did determine that it isn't optimization-related.

Steven_L_Intel1 · ‎09-15-2016

I identified a compiler bug that is causing the failure. Right now the only workaround I can offer is to write the code assuming automatic reallocation doesn't occur (and add /assume:norealloc_lhs or /nostd-realloc-lhs). In particular the assignment at line 51 is not being handled correctly.

I have escalated this as issue DPD200414403 and asked that it be given priority as I think this problem will affect many users.

FortranFan · ‎09-15-2016

Steve Lionel (Intel) wrote:

I identified a compiler bug that is causing the failure. Right now the only workaround I can offer is to write the code assuming automatic reallocation doesn't occur (and add /assume:norealloc_lhs or /nostd-realloc-lhs). In particular the assignment at line 51 is not being handled correctly.

I have escalated this as issue DPD200414403 and asked that it be given priority as I think this problem will affect many users.

Steve,

Thanks very, very much for your most prompt attention to this issue and your follow-up with Intel Fortran development team; appreciate it greatly.

Please note I didn't want to prejudice your investigation in any way, especially because you had commented previously on the possible complications and issues such as with stack corruption. But it appeared to me there is perhaps a problem in compiler 17 with plausibly an array temporary associated with the function result used as actual argument in line 252 and what might happen consequently on the stack during the instruction (I didn't do any assembler level checks though, so I might be wrong about this). Given this line of thinking, the workaround I came up with is shown below. It involves making use of a 'work area', a private component of the derived type, and employing it as the actual argument. As you would have realized, the reproducer in Message #5 is a severely emaciated body extracted from one of the actual codes; in this stripped down silly-looking example, the component m_d of derived type d_t represents 'raw data' whereas component norm_d represents some transformed/normalized data, the instruction on line 52 being just a nutty prototype of the actual transformations. Now, in the actual code libraries, the 'classes' i.e., the derived types represented by ef_t in the example above already include private 'scratch space' type of components with the same shape as the input 'data'. Hence the methods of subobjects of instances of ef_t (i.e., those equivalent to m_xf%calc on line 252 in the above example), which need to 'operate' on the data, NOW have these scratch vectors as actual arguments in our 'worked around' code as opposed to function invocations. The tests all seem to run well with such a workaround, so we'll forge ahead with it!

module ef_m

   use k_m, only : WP
   use b_m, only : b_t
   use f_m, only : f_t
   use xf_m, only : xf_t

   type, extends(b_t), public :: ef_t
      private
      logical :: m_calc = .false.
      type(xf_t), allocatable :: m_xf
      type(f_t), allocatable :: m_f
      real(WP), allocatable :: m_norm_d(:)   !** add a 'work area' component here; it's already present in actual code
   contains
      private
      procedure, pass(this) :: calc_ef_t
      generic, public :: calc => calc_ef_t
   end type ef_t

contains

   pure subroutine calc_ef_t( this, d )

      class(ef_t), intent(inout) :: this
      real(WP), intent(in)       :: d(:)

      if ( .not. allocated(this%m_f) ) then
         allocate( this%m_f )
      end if
      call this%m_f%load( d )
      this%m_norm_d = this%m_f%norm_d()      !** load 'transformed' data into 'work area'

      if ( .not. allocated(this%m_xf) ) then
         allocate( this%m_xf )
      end if
      call this%m_xf%calc( this%m_f, d=this%m_norm_d ) !** use 'work area' as actual argument to subobject

      this%m_calc = .true.

      return

   end subroutine calc_ef_t

end module ef_m

A question for you is whether the above workaround provides you and/or the development team with any further hints on the underlying issue(s) with compiler 17 and the steps toward a proper resolution.

By the way, please note the actual codes are totally reliant on the standard feature of automatic reallocation, hence options such as /assume:norealloc_lhs are a no-go for us.

Thanks,

jimdempseyatthecove · ‎09-15-2016

FortranFan,

Great job on the work around.

Why not place the work area array m_norm_d inside the calc_ef_t subroutine? I suppose it may reduce the number of allocations to have additional copies of the array (assuming they are not overly large and not in abundance).

Jim Dempsey

FortranFan · ‎09-15-2016

jimdempseyatthecove wrote:

FortranFan,

Great job on the work around.

Why not place the work area array m_norm_d inside the calc_ef_t subroutine? I suppose it may reduce the number of allocations to have additional copies of the array (assuming they are not overly large and not in abundance).

Jim Dempsey

Jim,

Good point, a local array should also work and in some situations, it may be the better option. In the case of actual code libraries, as I mention above, the work areas of appropriate shape already existed and therefore the workaround was simply to make use of them. By the way, the arrays in most of the use cases of these codes are not large, though they do vary quite a bit from simulation to simulation anywhere from 2**8 to 2**16 elements.

Steven_L_Intel1 · ‎09-15-2016

Do you have evidence that there is a problem with the function value as argument? That would be entirely unrelated to the reallocation issue.

The problem I found is that on a second call to the routine with the assignment, the information passed to the "does this need reallocation?" check contains uninitialized fields, leading to unpredictable behavior. Sometimes it seems to work, sometimes it corrupts heap storage. (I didn't see evidence that it actually corrupts the stack.)

I do want to put in a plug for Intel Inspector XE's "memory inspection" which was a big help in locating the problem for me, but then I had to go digging into assembly to figure out exactly what went wrong.

FortranFan · ‎09-15-2016

Steve Lionel (Intel) wrote:

Do you have evidence that there is a problem with the function value as argument? That would be entirely unrelated to the reallocation issue.

The problem I found is that on a second call to the routine with the assignment, the information passed to the "does this need reallocation?" check contains uninitialized fields, leading to unpredictable behavior. Sometimes it seems to work, sometimes it corrupts heap storage. (I didn't see evidence that it actually corrupts the stack.)

I do want to put in a plug for Intel Inspector XE's "memory inspection" which was a big help in locating the problem for me, but then I had to go digging into assembly to figure out exactly what went wrong.

"Do you have evidence that there is a problem with the function value as argument?" Well, as you can see, it's anecdotal. Outside of the reproducer code in Message #5, I've not been able to create an issue with automatic reallocation; but this very same reproducer seems to work if the function in the actual argument in line 252 is replaced with an array, as shown with the workaround in Message #10.

Re: stack corruption, I was only going by what you said in Message #6.

Thanks,

jimdempseyatthecove · ‎09-15-2016

>> a local array should also work and in some situations

If you know for a fact that threaded and/or recursive use will not be used, the work area can be attributed as SAVE and you could explicitly use array slices for size require, enlarging the array only when necessary. This would reduce the number of allocate/deallocate (assuming that is a performance issue). I noticed in V17 that some of the OpenMP TASK is using the TBB scalable allocator. I haven't looked closer to see if the ALLOCATE and DEALLOCAT statements are now using the TBB scalable allocator. If so, then, depending on the sizes of the allocations the time "saved" on reducing the allocate/deallocate might not be so great.

Jim Dempsey

Steven_L_Intel1 · ‎09-15-2016

I was just speculating regarding stack corruption until I found the real culprit.

Steven_L_Intel1 · ‎11-01-2016

The bug reported here will be fixed in Update 2.