OpenMP Fortran Windous 10

Emil_J_ · ‎07-07-2016

I have Fortran code which works fine when I compile it for 32bit computer Windous 10, but it does not work when I compile it for 64-bit Windous 10 computer. In a 64-bit compute it just stops at: !$OMP DO SCHEDULE(STATIC,chunk)

These are the switches I use:

ifort a.f90 libiomp5md.lib /heap-arrays /assume:byterecl /assume:buffered_io /Qip- /Ob0 /Qopenmp /auto-scalar /exe:a.exe

subroutine colsol(a,v,ColTop,ColDONE,maxa,nn,kkk,na,nn1,ierr)
!   **************************************************************
!   *   Cholesky  Factorisation
!   ***************************************************************    
      implicit none
 
      real*8 a(na),v(nn),b,c
      integer*4 maxa(nn1),nn,l,n,kk,ic,nd,ki,j,k,nn1,na,kh,kl,kn,i_cnt,i_cnt_old, klt,ku,kkk,ierr
      real*8 sum1, amaxak
      integer *4  ColTop(nn),ColDONE(nn)     !...Cholesky
      integer *4  i,TOPij, chunk
      integer *4  iperct,iperct1, maxai, maxaj
!-----------------------------------------------------------
      ierr=0
      iperct=0
      iperct1=0
      
      chunk=1
 
       !...prepare 'ColTop'   
       do i = 1, nn
           ColTop(i) = i - (maxa(i + 1) - maxa(i)) + 1
       end do  
 
       !...Columns Done
       do i = 1, nn
           ColDONE(i) = 0   !... mark all columns as not done '0'
       end do  
       
!---------------------------------------------------------------------------------       
       !...factorisation  (Skyline)
        a(1) = dSqrt(a(1))
        ColDONE(1) = 1   !...colum 1 is done
        
!$OMP PARALLEL PRIVATE (i,j,k,maxaj,maxai,sum1,amaxak,TOPij) 
!$OMP DO SCHEDULE(STATIC,chunk)        
        do j = 2, nn                   !...loop for COL from 2 to nn
           maxaj=maxa(j) + j 
           do i = ColTop(j), j - 1     !...loop for ROW from top going down to diagonal

                !...wait intill Colum 'i' is done    
                do while(ColDONE(i) .ne. 1)
                end do
                
                sum1 = 0.0d0
                TOPij = Max(ColTop(i), ColTop(j))     !...find min column height for dot product
 
                maxai=maxa(i) + i 
                do k = TOPij, i - 1
                    sum1 = sum1 + a(maxai- k) * a(maxaj - k)
                end do
                a(maxaj - i) = (a(maxaj - i) - sum1) / a(maxai - i)
                
            end do
 
            !...do diagonal term J separatelly
            sum1 = 0.0d0
            do k = ColTop(j), j-1
                amaxak=a(maxaj - k)
                sum1 = sum1 + amaxak * amaxak
            end do
            
            a(maxa(j)) = dSqrt(a(maxa(j)) - sum1)
            ColDONE(j) = 1    !...colum 'j' is done
                       
        end do
!$OMP END DO
!$OMP END PARALLEL       
      
      return
      end

John_Campbell · ‎07-07-2016

The problem could be either with the compiler options you are now using or changes to the compiler's approach to optimisation for these options.

ifort could be modifying the DO WHILE loop, as there is nothing "changing" in the loop.

You may be better of selecting a lower optimisation and replacing the inner loops with dot_product or an optimised vector routine.

This is an interesting approach to COLSOL / omp. Why use Cholesky, as "a(maxa(j)) = dSqrt(a(maxa(j)) - sum1)" requires a positive definite matrix, while other COLSOL approaches do not ? I would be interested to know the history of this routine, as it has a backwards storage order for A.

John

John_Campbell · ‎07-08-2016

I adapted your approach to a COLSOL - Crout solver and found a problem with your DO WHILE loop being optimised away. I did get it to work with a more complex wait loop and included a timer call for the first wait cycle:

      DO Jeq = JB,JT
!
!       Wait until this column is complete
         iw = 0        
         DO
           if ( NA_done(JEQ) ) exit
           call small_delay (iw)
           iw = iw + 1
         END DO
...
  subroutine small_delay (iw)
      integer*4 :: iw
      integer*8 :: tick
      integer*8 QueryPerformance_tick
      external  QueryPerformance_tick
!
      if ( iw==0 ) tick = QueryPerformance_tick ()
  end subroutine small_delay

Your OMP solver approach works well for small problems but becomes constrained by a cache - memory bottleneck for larger problems.