Hi Jim & Tim,

Nathan_Weeks · ‎02-08-2015

For ifort 15.0.1.133, it appears that when -qopt-matmul is specified, the multi-threaded MKL will be used regardless if -mkl=sequential is also specified.

Consider the following MATMUL benchmark code, loosely based on the matvec driver from the Intel Composer XE Fortran vec_samples:

program matmul_test
   use iso_fortran_env, only: int64, real64
   implicit none
  
   integer, parameter :: N=1024
   integer, parameter :: REPEAT_COUNT = 50

   integer :: i
   real(kind=real64), dimension(N,N) :: A, B, C
   real(kind=real64)   :: cputime1, cputime2
   integer(kind=int64) :: count_rate, walltime1, walltime2

   call RANDOM_NUMBER(A)
   call RANDOM_NUMBER(B)

   call cpu_time(cputime1)
   call system_clock(walltime1, count_rate)
       
   do i=1,REPEAT_COUNT
      C = MATMUL(A, B)
      B(1,1) = B(1,1) + 0.000001
   enddo

   call cpu_time(cputime2)
   call system_clock(walltime2, count_rate)
   write (*,'(A,X,F8.3)') 'wall time:', (walltime2-walltime1)/REAL(count_rate, KIND=real64)
   write (*,'(A,X,F8.3)') 'cpu time:', cputime2-cputime1
   write (*,*) 'SUM(c):', SUM(C)
 
end program matmul_test

If -qopt-matmul is specified, the multi-threaded MKL is used by default:

$ ifort -qopt-matmul matmul_test.f90
$ ./a.out
wall time:    0.551
cpu time:    8.399
 SUM(c):   268554303.414090

If -mkl=sequential is also specified, the multi-threaded MKL is still used:

$ ifort -qopt-matmul -mkl=sequential matmul_test.f90
$ OMP_DISPLAY_ENV=1 ./a.out

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201307'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS: value is not defined
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


wall time:    0.545
cpu time:    8.375
 SUM(c):   268554303.414090

I realize that one could set the environment variable OMP_NUM_THREADS to 1 to effectively get sequential execution, but it would be preferable for the behavior to match or, if this behavior is intentional, that the man page be updated to clarify this interaction between -qopt-matmul and -mkl=sequential.

jimdempseyatthecove · ‎02-08-2015

Try adding /O2 or /O3

The reference manual under -qopt-matmul states: This option has no effect unless option O2 or higher is set.

In your second test, listing the OpenMP environment variables does not indicate if the application is using OpenMP.

Try using

export KMP_AFFINITY=verbose,none
./a.out

Jim Dempsey

Nathan_Weeks · ‎02-08-2015

Hi Jim,

Thanks for the info regarding the KMP_AFFINITY environment variable; it appeared to confirm that the multi-threaded MKL is still being used when both -qopt-matmul and -mkl=sequential are specified.

Per the ifort 15.0.1.133 man page, -O2 is the default. Explicitly specifying -O2 or -O3 did not seem to affect whether or not -mkl=sequential is honored when -qopt-matmul is specified; e.g.:

$ ifort -O2 -qopt-matmul -mkl=sequential matmul_test.f90
$ KMP_AFFINITY=verbose,none ./a.out
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #156: KMP_AFFINITY: 16 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 8 cores/pkg x 1 threads/core (16 total cores)
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 8 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 9 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 13 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 11 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 12 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 10 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 14 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
OMP: Info #242: KMP_AFFINITY: pid 21297 thread 15 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
wall time:    0.588
cpu time:    9.011
 SUM(c):   268554303.414090

TimP · ‎02-08-2015

ifort defaults to -O2, so -qopt-matmul would be expected to take effect. ifort -help states that it can only call threaded libraries, while the full help documentation does not discuss this point, but neither indicates that the -mkl link options have effect.

Nathan has a point in that this creates a further confusion in usage of MATMUL. For single threaded matmul, I used to prefer -O3 -qno-opt-matmul, as that can generate fully optimized in-line code with a lower size threshold for efficiency.

If this test is being run on a dual 8-core platform with threading, affinity setting would be important.

Nathan_Weeks · ‎02-08-2015

Hi Tim,

Thanks for the info. The goal of this exercise was to get the fastest sequential MATMUL possible on my hardware (Xeon E5-2650). This would be useful in, e.g., a program that utilizes coarrays, where there is likely to be one image per execution unit, and a multi-threaded MATMUL could oversubscribe the system (note that I haven't actually verified that in such a case the MKL runtime library doesn't dynamically adjust the number of threads to avoid oversubscription).

The single-thread performance obtained by specifying -qopt-matmul and OMP_NUM_THREADS=1 was about 2.2X of that obtained by -O3 -qno-opt-matmul:

$ ifort -O3 -qno-opt-matmul matmul_test.f90
$ ./a.out
wall time:   11.807
cpu time:   11.778
 SUM(c):   268554303.414090
$ ifort -qopt-matmul matmul_test.f90
$ OMP_NUM_THREADS=1 ./a.out
wall time:    5.314
cpu time:    5.298
 SUM(c):   268554303.414090

In a semi-related issue, it appears that setting MKL_NUM_THREADS to 1 does not appear to cause only a single thread to be used in this case---possibly a bug?

 $ ifort -qopt-matmul matmul_test.f90
 $ MKL_NUM_THREADS=1 ./a.out
 wall time:    0.522
 cpu time:    7.918
  SUM(c):   268554303.414090

In any case, I think it would be a nice feature to allow -mkl=sequential cause the sequential MKL library to be linked when -qopt-matmul is specified, so one doesn't have to remember to set yet another environment variable before executing their application to get this behavior.

Also, I think it would be helpful to update the man page to explicitly state that -qopt-matmul also causes MATMUL to utilize the MKL, as the current wording doesn't give me that impression:

The -qopt-matmul options tell the compiler to identify matrix multiplication loop nests (if any) and replace them with a matmul library call for improved performance.

TimP · ‎02-08-2015

You would set omp_num_threads to an appropriate value to avoid over-subscription. I don't know about coarrays inheriting mpi facility for pinning threads to contiguous groups of cores. i_mpi_debug=5 includes pinning information in the diagnostics.

Your platform should have good flexibility for varying numbers of coarray images and omp threads per image.

jimdempseyatthecove · ‎02-08-2015

Nathan,

You might be able to get what you want by separating your test program into two parts. A PROGRAM part, that basically calls a subroutine containing your code. The subroutine is compiled separately specifying opt-matmul and the main program is specified as using the mkl single threaded version (and supplied with the .o containing the subroutine). Thus having a main program specifying the single threaded mkl library using an externally compiled object file. Hopefully, everything links together the way you want.

Caution, you will need to disable IPO when compiling the main program.

Jim Dempsey

Nathan_Weeks · ‎02-08-2015

Hi Jim & Tim,

Thanks for the suggestions---I'll keep those in mind if/when I need to implement this in a production code. I'll reiterate my wish to any Intel employee reading this that I would appreciate it if:

When both -qopt-matmul and -mkl=sequential are specified, that the sequential MKL libraries be used, so what seems like workarounds would not be required to achieve the desired/expected behavior---or if this is undesirable for some reason, that the documentation (man page) be updated to reflect this
The documentation (man page) for -qopt-matmul were clarified (I'd be happy to offer suggestions or provide feedback!)

jimdempseyatthecove · ‎02-09-2015

Nathan,

Have you experimented with swapping the order of -qopt-matmul and -mkl=sequential switches on the command line?

Or how about 1.b) adding -qopt-matmul=sequential

Then you would have a similar issue with -qopt-matmul=sequential with -mkl=parallel combination.

Jim Dempsey

Steven_L_Intel1 · ‎02-09-2015

Lorri asked me to post this on her behalf (for some reason she's having trouble posting herself):

Please note that –qopt-matmul enables OpenMP, but does not affect the set of MKL libraries.
To confirm what libraries are being linked against, use the command line switch –dryrun.
This displays the internal commands used to invoke the ld executable, and will list all libraries being linked against.

Nathan_Weeks · ‎02-11-2015

Steve,

Thanks for forwarding that info; it helped clear things up. The -dryrun option
indicates that -qopt-matmul causes libmatmul, not the MKL per se, to be
statically linked.  Before seeing this diagnostic information, I mistakenly
thought the MATMUL library (libmatmul) was synonymous with the MKL, when it
only seems to leverage MKL components (per the assembly listing produced by
"ifort -S", and symbols in libmatmul listed by the "nm" utility). Adding
-mkl=sequential *does* cause the sequential MKL library to be dynamically
linked, but it has no apparent effect on the resulting execution.

This (at least partially) explains why MKL_NUM_THREADS doesn't seem to affect
the number of threads involved in the MATMUL when -qopt-matmul is used instead
of GEMM from the MKL proper; rather, only OMP_NUM_THREADS seems to have an
effect.

It would be nice if the ifort man page documented that the MATMUL library is
parallelized using OpenMP, and that the OMP_NUM_THREADS environment variable
can be used to control the number of threads involved a call to to MATMUL.
This would be particularly relevant to coarray (or pure MPI) programs that
could benefit from the optimized MATMUL provided by -qopt-matmul, but could
possibly cause a node to be oversubscribed if it is used with the default
number of threads.

TimP · ‎02-11-2015

Some of the hints in the docs imply that opt-matmul is a part of auto-parallel, thus implying reference to OpenMP library, but I agree that it leaves us guessing sometimes.

Nathan_Weeks · ‎02-16-2015

Hi all,

With ifort 15.0.1, -qopt-matmul option appears to be effectively disabled when -qopenmp is specified. Consider the following matrix multiplication code (culled & modified from a matrix multiplication benchmark by Professor Glenn Luecke at Iowa State University):

program test_matmul
   use omp_lib
   implicit none
   integer, parameter ::  N = 1024, NTRIAL = 128, NTESTS = 3
   integer, parameter ::  NFLUSH = 3*1024*1024
   double precision   ::  flush(NFLUSH) ! flush a 20 MB cache
   double precision,dimension(N,N) :: A, B, C, D, C_init
   double precision :: t0, time(0:NTRIAL), avg_time
   character(len=16) :: test_name(NTESTS) = ['OpenMP', 'DGEMM', 'MATMUL']
   integer :: i, j, k, test, trial

   call random_number(A)
   call random_number(B)
   call random_number(C)
   call random_number(flush)
   C_init = C ! keep initial value of C for error checking
   D = C + matmul(A,B) ! to be used to compute error

   do test = 1, NTESTS
      do trial = 0, NTRIAL
         C = C_init
         flush = flush + 0.0001d0 ! flush caches before timing
         t0 = omp_get_wtime()

         select case (test)
         case (1)
         !$omp parallel do schedule(static)
            do j = 1, N
               do k = 1, N
                  do i = 1, N
                     C(i,j) = C(i,j) + A(i,k)*B(k,j)
                  enddo
               enddo
            enddo

         case (2)
            call DGEMM('N', 'N', N, N, N, 1.d0, A, N, B, N, 1.d0, C, N)

         case (3)
            C =  C + matmul(A,B)

         end select
         time(trial) = omp_get_wtime() - t0 ! seconds
      enddo ! trial loop

      avg_time = SUM(time(1:NTRIAL))/DBLE(NTRIAL)
      write(*,*) test_name(test), avg_time, 'seconds', &
                 2.d-9*DBLE(N)**3/avg_time, 'gflops'
      write(*,*) 'Error  = ', MAXVAL(ABS(C-D))
      write(*,*) '        ', C(2,3)*flush(7)
      write(*,*) ' '
   end do
end program test_matmul

Using -qopenmp-stubs in addition to -qopt-matmul, it run time suggests---and the assembly output, which contains calls to matmul_mkl_f64_, appears to confirm---that the first loop is recognized as a matrix multiplication, and uses the MATMUL library. In addition, the call to MATMUL also uses the MATMUL library. Using -mkl=sequential allows us to compare this with the sequential performance of DGEMM:

$ ifort -qopenmp-stubs -mkl=sequential -qopt-matmul -Ofast -xHost hw3.f90
$ ./a.out
 OpenMP            1.026251353323460E-002 seconds   209.255134333854      gflops
 Error  =   1.136868377216160E-013
            201.752316821375

 DGEMM             0.105580719187856      seconds   20.3397330925457      gflops
 Error  =   1.136868377216160E-013
            205.064157088606

 MATMUL            1.218027807772160E-002 seconds   176.308261133042      gflops
 Error  =   0.000000000000000E+000
            208.375997355837

Setting OMP_NUM_THREADS=1 apparently scales back the matmul library to using a single thread, and all three variants appear to have comparable performance:

$ ifort -qopenmp-stubs -mkl=sequential -qopt-matmul -Ofast -xHost hw3.f90
$ OMP_NUM_THREADS=1 ./a.out
 OpenMP            0.105887426063418      seconds   20.2808182976685      gflops
 Error  =   1.136868377216160E-013
            201.752316821375

 DGEMM             0.104962434619665      seconds   20.4595449389248      gflops
 Error  =   1.136868377216160E-013
            205.064157088606

 MATMUL            0.108032410964370      seconds   19.8781423910669      gflops
 Error  =   0.000000000000000E+000
            208.375997355837

However, when compiled with -qopenmp, the -qopt-matmul option appears to be effectively disabled: while the -dryrun option indicates that -lmatmul is still linked, the assembly listing (via -S) appears to imply that no code is generated that calls matmul_mkl_f64_. The result is that the MATMUL performance is halved (note that OMP_STACKSIZE has to be explicitly set on this machine to get the program to run):

$ ifort -qopenmp -mkl=sequential -qopt-matmul -Ofast -xHost hw3.f90
$ OMP_NUM_THREADS=16 OMP_STACKSIZE=128M ./a.out
 OpenMP            1.436026766896248E-002 seconds   149.543427567263      gflops
 Error  =   2.842170943040401E-013
            201.752316821375

 DGEMM             0.105524888262153      seconds   20.3504944034157      gflops
 Error  =   1.136868377216160E-012
            205.064157088606

 MATMUL            0.220026727765799      seconds   9.76010355562725      gflops
 Error  =   0.000000000000000E+000
            208.375997355837

It seems counterintuitive that -qopt-matmul wouldn't generate code that calls the MATMUL library for the call to MATMUL in the sequential region.

-qopt-matmul with -mkl=sequential