Subsetting question regarding DGER

Mazur__Luke · ‎01-30-2018

I am calling dger on a subset of a 2 dimensional matrix. I call it in two different ways. One way has an extra set of brackets in the row subsetting of the matrix A - and this runs by around a factor of 4 slower than the way without the brackets.

Whether I compile this with

ifort -heap-arrays -O3 -mkl=sequential blas9F.f90 -o blas9test

or

ifort -heap-arrays -O3 blas9F.f90 -o blas9test -lblas

the second version runs much slower than the first - which doesn't make sense to me. When I compile it using gfortran with -lblas the two versions take the same time to run - as expected, and this is much faster than the slow version using the intel compiler (and slightly slower than the fast version using the intel compiler). What is the cause of this? Note that if NBR = 1 then the two versions take the same amount of time to run. I have a theory this has something to do with a temporary copy occurring, and that somehow the brackets confuse the compiler into thinking that a copy is necessary, even though the leading submatrix is always used if NBR = 0.

My first time posting here so let me know if I have broken the rules in some way.

program blas9F

      USE ISO_C_BINDING
      implicit none

      REAL(C_DOUBLE), DIMENSION(:,:), ALLOCATABLE :: MAT1
      REAL(C_DOUBLE), DIMENSION(:,:), ALLOCATABLE :: MAT2
      INTEGER(C_LONG) :: NROW
      INTEGER(C_LONG) :: NCOL
      INTEGER(C_LONG) :: FIRSTCOLUMNOFSQUARE
      INTEGER(C_LONG) :: NBR
      INTEGER(C_LONG) :: NUMBEROFROWSAWAYFROMBOTTOM
      INTEGER(C_LONG) :: CURRENTROW
      INTEGER(C_LONG) :: CURRENTCOLUMN
      INTEGER(C_LONG) :: I
      REAL :: STARTTIME, FINISHTIME
      REAL :: TIMETAKENVERSION3 = 0
      REAL :: TIMETAKENVERSION4 = 0
      REAL(C_DOUBLE) :: BLASALPHA = 2.0
      REAL(C_DOUBLE) :: BLASBETA = 2.0
      external dger
      NROW = 3001
      NCOL = 4001
      NBR = 0
      FIRSTCOLUMNOFSQUARE = NCOL - NROW + 1
      ALLOCATE(MAT1(NROW,NCOL))
      ALLOCATE(MAT2(NROW,NCOL))
      MAT1 = 1
      DO I = 1,NROW
      MAT1(I,FIRSTCOLUMNOFSQUARE+I-1) = 10*NCOL !diagonally dominant matrix
      END DO
      MAT2 = 1
      DO I = 1,NROW
      MAT2(I,FIRSTCOLUMNOFSQUARE+I-1) = 10*NCOL !diagonally dominant matrix
      END DO
      DO NUMBEROFROWSAWAYFROMBOTTOM=0,NROW-1
         CURRENTROW = NROW - NUMBEROFROWSAWAYFROMBOTTOM
         CURRENTCOLUMN = NCOL - NUMBEROFROWSAWAYFROMBOTTOM

      IF(CURRENTROW .NE. NBR) THEN
      call cpu_time(STARTTIME)
      call dger(CURRENTROW-1-NBR, &
      CURRENTROW-1-NBR, BLASALPHA, &
      MAT1(1+NBR:CURRENTROW-1,CURRENTCOLUMN), 1, &
      MAT1(CURRENTROW,FIRSTCOLUMNOFSQUARE+NBR:CURRENTCOLUMN-1), 1, &
      MAT1(1+NBR:NROW,FIRSTCOLUMNOFSQUARE:CURRENTCOLUMN-1), NROW - NBR)
      call cpu_time(FINISHTIME)
      TIMETAKENVERSION3 = TIMETAKENVERSION3 + FINISHTIME - STARTTIME
      END IF
      END DO
      DO NUMBEROFROWSAWAYFROMBOTTOM=0,NROW-1
         CURRENTROW = NROW - NUMBEROFROWSAWAYFROMBOTTOM
         CURRENTCOLUMN = NCOL - NUMBEROFROWSAWAYFROMBOTTOM
      IF(CURRENTROW .NE. NBR) THEN
      call cpu_time(STARTTIME)
      call dger(CURRENTROW-1-NBR, CURRENTROW-1-NBR, BLASALPHA, &
      MAT2(1+NBR:CURRENTROW-1,CURRENTCOLUMN), 1, &
      MAT2(CURRENTROW,FIRSTCOLUMNOFSQUARE+NBR:CURRENTCOLUMN-1), 1, &
      MAT2((1+NBR):NROW,(FIRSTCOLUMNOFSQUARE+NBR):(CURRENTCOLUMN-1)), &
      NROW - NBR)
      call cpu_time(FINISHTIME)
      TIMETAKENVERSION4 = TIMETAKENVERSION4 + FINISHTIME - STARTTIME
      END IF
      END DO
      PRINT *, "TIMETAKENVERSION3"
      PRINT *, TIMETAKENVERSION3
      PRINT *, "TIMETAKENVERSION4"
      PRINT *, TIMETAKENVERSION4
      end