Solved: Why are these automatic arrays problematic although stack limit is unlimited?

efnacy · ‎04-24-2018

I have an issue understanding automatic arrays, especially where in between stack or heap it is stored.

I have the following main program.

program myprog
use mymod
implicit none
integer, parameter :: N = 477, M = 4
complex*16, parameter :: imnum = dcmplx(1.0d0,1.0d0)
complex*16 :: A(N,N,M), B(N,N,M-1), C(N,N,2:M), X(M*N), Y(M*N)
integer :: i, j, k

do k = 1,M
   do i = 1,N
      do j = 1,N
         A(i,j,k) = i*j/dble(N)
         if (k < M) then
            B(i,j,k) = 1.0d0 + imnum/(i+j)
            C(i,j,k+1) = 1.0d0 - imnum/(i+j)
         end if
      end do
   end do
end do

do i = 1,M*N
   X(i) = dble(i)
end do

call mysub('G', N, M, A, B, C, X, Y)

end program

Subroutine mysub is located inside module mymod.f90.

! =============================================================                                                                                                                                                                                                                                                                                                             
subroutine mysub(ulh, N, M, diag, udiag, ldiag, vinp, vout)
implicit none
integer :: i, il, ic, ir
character, intent(in) :: ulh*1
integer, intent(in) :: N, M
complex*16, intent(in) :: diag(:,:,:), udiag(:,:,:), vinp(:) 
complex*16, intent(in), optional :: ldiag(:,:,2:)
complex*16, intent(out) :: vout(M*N)
complex*16, allocatable :: ldiag1(:,:,:)        
complex*16 :: v_l(size(diag,1)), v_c(size(diag,1)), v_r(size(diag,1))


allocate(ldiag1(N,N,2:M))

if (ulh == "G") then
   if (present(ldiag)) then
      ldiag1 = ldiag
   else
      stop "In subroutine tribl_matmul, argument ldiag is required when argument ulh = 'G'."
   end if
end if


!allocate(v_l(N), v_c(N), v_r(N))
!call omp_set_num_threads(2)    
!!$OMP PARALLEL DO PRIVATE(ic, il, ir, v_l, v_r, v_c)
do i = 1,M                                    
   ic = (i-1)*N+1
   il = (i-1-1)*N+1
   ir = (i+1-1)*N+1

   if (i == 1) then
      v_l = dcmplx(0.0d0,0.0d0)
      call zgemv('n', N, N, 1.0d0, udiag(:,:,i), N, vinp(ir:ir+N-1), 1, 0.0d0, v_r, 1)
   else if (i == M) then
      call zgemv('n', N, N, 1.0d0, ldiag1(:,:,i), N, vinp(il:il+N-1), 1, 0.0d0, v_l, 1)
      v_r = dcmplx(0.0d0,0.0d0)
   else
      call zgemv('n', N, N, 1.0d0, ldiag1(:,:,i), N, vinp(il:il+N-1), 1, 0.0d0, v_l, 1)
      call zgemv('n', N, N, 1.0d0, udiag(:,:,i), N, vinp(ir:ir+N-1), 1, 0.0d0, v_r, 1)
   end if
                                                                                                                                                                                                                                                                                             
   call zgemv('n', N, N, 1.0d0, diag(:,:,i), N, vinp(ic:ic+N-1), 1, 0.0d0, v_c, 1)    
   write(*,"(a25,i3,3(f20.3,2x))") "problem, v_l, v_c, v_r", i, abs(sum(v_l)), abs(sum(v_c)), abs(sum(v_r)) 
   vout(ic:ic+N-1) = v_l + v_c + v_r
end do
!!$OMP END PARALLEL DO
!deallocate(v_l, v_c, v_r) 
deallocate(ldiag1)

end subroutine

I compiled and ran it using these commands

#!/bin/sh                                                                                                                                                                                                  
ulimit -s unlimited
longflag="-L/opt/intel/composerxe-2011/mkl/10.2.3.029/lib/em64t -lmkl_solver_ilp64_sequential -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lpthread -lm"
ifort -openmp -check arg_temp_created -i8 -c ./mymod.f90 ${longflag}
ifort -openmp -check arg_temp_created -i8 ./myprog.f90 mymod.o ${longflag}
./a.out

The calculation was erroneous as it prints out

   problem, v_l, v_c, v_r  1               0.000        8673538245.000                   NaN
   problem, v_l, v_c, v_r  2        54265785.750       23341568032.702                   NaN
   problem, v_l, v_c, v_r  3       171280368.928       33841597383.353                   NaN
   problem, v_l, v_c, v_r  4       270608107.622       36760173582.811                 0.000

Then if I add -heap-arrays flag in the compilation of mymod.f90, the problem is remedied and those NaN values become just valid numbers. This means without heap-arrays all automatic arrays (no temporary arrays were created) are most likely located in stack. But I have removed the stack limit so I expected the execution should be fine without heap-arrays flag. So, where are automatic arrays located actually?

As for which arrays are automatic inside mysub, they must be v_l, v_c, and v_r. This is because when I declared them as allocatable and remove the comment marks in the allocatable and deallocatable statements of these arrays, without heap-arrays flag, the calculation didn't yield NaN. The reason I prefer these three arrays to be automatic rather than allocatable is that when I went with the latter and run the program with those openmp directives activated, the program did not print anything, not even error or warning flags, which I interpreted as an error anyway.

mecej4 · ‎04-26-2018

It turns out that there is a rather elusive bug in your calls to ZGEMV, and it took me a while to realize its presence. Among the arguments are the scalar multipliers alpha and beta, which must be complex*16. You are only passing 1.0d0 and 0.0d0, which are of the incorrect type. The BLAS routine then picks up a junk value from the next eight bytes for the absent imaginary part.

Fix this by creating scalar multipliers, say, Z_ONE = cmplx(1d0,0d0,kind=kind(0d0)) and Z_Zero = cmplx(0d0,0d0,kind=kind(0d0)), and use these in the argument list.

The compiler would have caught this error if you had placed include 'mkl_blas.fi' after the implicit none line in MYSUB.

View solution in original post

mecej4 · ‎04-25-2018

I think that the bug has more to do with code generation/optimization than stack, heap, etc. Here is a greatly simplified reproducer:

module mymod
contains
subroutine mysub(N, A, vinp, vout)
implicit none
integer, parameter :: dp = kind(0.0)
integer, intent(in) :: N
complex, intent(in) :: A(N,N), vinp(N) 
complex, intent(out) :: vout(N)
complex :: v_c(N)

call cgemv('n', N, N, 1.0, A, N, vinp, 1, 0.0, v_c, 1)    
write(*,10) v_c
10 format(2(3x,'(',ES10.3,',',ES10.3,')',:))
vout(1:N) = v_c
end subroutine
end module

program myprog
use mymod
implicit none
integer, parameter :: N = 2
complex :: A(N,N), X(N), Y(N)
integer :: i, j

   do i = 1,N
      do j = 1,N
         A(i,j) = i*j/real(N)
      end do
   end do

do i = 1,N
   X(i) = i*4.0
end do

call mysub(N, A, X, Y)

end program

When I compile and run the program with the 18.0.2 Windows compiler with /O2 /Qmkl, I find that each run produces a different value for the imaginary part of v_c(2).

S:\LANG>efns
   ( 1.000E+01, 2.609E-38)   ( 2.000E+01, 7.977E+17)

S:\LANG>efns
   ( 1.000E+01, 2.609E-38)   ( 2.000E+01, 1.030E+15)

S:\LANG>efns
   ( 1.000E+01, 2.609E-38)   ( 2.000E+01,-1.626E-18)

S:\LANG>efns
   ( 1.000E+01, 2.609E-38)   ( 2.000E+01, 7.715E-07)

Please try this code with your version of the compiler. I find that with the 17.0.4 version and the same compiler options the bug does not occur.

P.S. The real bug, discovered a day later, is the use of real constants 1.0 and 0.0 instead of complex constants with those values, on Line 11.

efnacy · ‎04-25-2018

Hi, thanks for replying.

My compiler produces

   ( 1.000E+01, 0.000E+00)   ( 2.000E+01, 0.000E+00)
   ( 1.000E+01, 0.000E+00)   ( 2.000E+01, 0.000E+00)
   ( 1.000E+01, 0.000E+00)   ( 2.000E+01, 0.000E+00)
   ( 1.000E+01, 0.000E+00)   ( 2.000E+01, 0.000E+00)

It was compiled with ifort -O2 -mkl test_autoarr.f90. The result of ifort -V is

Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.0.2.137 Build 20110112
Copyright (C) 1985-2011 Intel Corporation.  All rights reserved.

jimdempseyatthecove · ‎04-26-2018

>>Then if I add -heap-arrays flag in the compilation of mymod.f90, the problem is remedied and those NaN values become just valid numbers.

Heap arrays have an implicit alignment of that of the allocation granularity, stack/automatic arrays have implicit natural boundary alignment of the cell type (real*8 in this case). Depending on the O/S the heap granularity may be 16 bytes. This may have prevented a complex number from spanning a cache line.

Try using aligned automatic arrays:

!dir$ attributes align: 16:: your, complex, arrays, here, ...

or better yet, use cache line aligned arrays:

!dir$ attributes align: 64:: your, complex, arrays, here, ...

Jim Dempsey

mecej4 · ‎04-26-2018

It turns out that there is a rather elusive bug in your calls to ZGEMV, and it took me a while to realize its presence. Among the arguments are the scalar multipliers alpha and beta, which must be complex*16. You are only passing 1.0d0 and 0.0d0, which are of the incorrect type. The BLAS routine then picks up a junk value from the next eight bytes for the absent imaginary part.

Fix this by creating scalar multipliers, say, Z_ONE = cmplx(1d0,0d0,kind=kind(0d0)) and Z_Zero = cmplx(0d0,0d0,kind=kind(0d0)), and use these in the argument list.

The compiler would have caught this error if you had placed include 'mkl_blas.fi' after the implicit none line in MYSUB.

efnacy · ‎04-26-2018

Hi mecej4,

yes I know that alpha and beta are complex*16 in the dummy arguments of zgemv but I passed real*8 anyway because I presumed the variable passing process behaves the same way as the usual variable assignments. For example if I declare variable X as complex*16 and then later it is assigned with value using e.g. X = 1.0d0 then the compiler by defaults will understand it as being the same as X = cmplx(1.0d0, 0.0d0). But I will try following your suggestion.

Jim,

Array alignment is a new concept to me, I have heard about this several times but never encounter a situation where I have to use it. I will try to get some information about aligning an array but I will also appreciate if you wouldn't mind telling me briefly what array alignment means.

efnacy · ‎04-26-2018

Mecej4,

I tried your suggestion and it worked! Still I am wondering though, if passing real*8 for scalars alpha and beta would prompt the compiler to pick up undefined value for the imaginary parts, then compiling with heap-arrays flag should not have solved the problem as it did in my first post because this flag has nothing to do with scalars right?

mecej4 · ‎04-26-2018

efnacy wrote:
Yes, I know that alpha and beta are complex*16 in the dummy arguments of zgemv but I passed real*8 anyway because I presumed the variable passing process behaves the same way as the usual variable assignments.

Sorry, this assumption is quite wrong. In an assignment statement V = E, both the variable V and the expression E have types that the compiler can infer from that line and any other type declarations in the subprogram containing that statement. When you invoke a subroutine or function using an implicit interface (in this case, the implicit interface was wrong), you are essentially telling the compiler, "I'm passing the correct arguments, just obey me". In effect, the compiler sees only E in your code, since V is in the source code of MKL or another BLAS library, which was compiled separately. Therefore, it has no way of detecting the mismatch in order to do the type conversion.

Fiddling with compiler options, etc., will only add to the confusion (I confess to such confusion myself!). What you see the code do will depend on the junk contents of the undefined imaginary parts. Those values may change from run to run and may cause the program to (i) crash, (ii) run and give incorrect results, or even (iii) give correct results and keep you unaware of the bug!

Please try adding the include 'mkl_blas.fi' statement as I suggested earlier.

efnacy · ‎04-26-2018

Yes you are right, upon including that include line in mysub code and passing real*8 values, the compilation produced errors saying that the type do not match. So, if in a code I have lapack routines then to be safe it is better to include include 'mkl_lapack.fi' in the calling unit?

mecej4 · ‎04-27-2018

Yes, you can include mkl_lapack.fi, but there is a better way. You may have noticed that inclusion of that file makes the compilation take significantly longer, because the compiler has to read and digest the thousands of interface declarations in the file. To avoid this delay, you can do the following:

Create a new source file, say, mkli.f, containing

!
      module mkl_lapack
      include 'mkl_lapack.fi'
      end module

Compile this file once. This will produce the file mkl_lapack.mod for the chosen compiler configuration (IA32, LP64 or ILP64). Place the module file along the include path of the compiler or in the current working directory.

In any of your Fortran source files that call Lapack routines, add use mkl_lapack after the subprogram heading and before any declarations.