- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For ifort 15.0.1.133, it appears that when -qopt-matmul is specified, the multi-threaded MKL will be used regardless if -mkl=sequential is also specified.
Consider the following MATMUL benchmark code, loosely based on the matvec driver from the Intel Composer XE Fortran vec_samples:
program matmul_test use iso_fortran_env, only: int64, real64 implicit none integer, parameter :: N=1024 integer, parameter :: REPEAT_COUNT = 50 integer :: i real(kind=real64), dimension(N,N) :: A, B, C real(kind=real64) :: cputime1, cputime2 integer(kind=int64) :: count_rate, walltime1, walltime2 call RANDOM_NUMBER(A) call RANDOM_NUMBER(B) call cpu_time(cputime1) call system_clock(walltime1, count_rate) do i=1,REPEAT_COUNT C = MATMUL(A, B) B(1,1) = B(1,1) + 0.000001 enddo call cpu_time(cputime2) call system_clock(walltime2, count_rate) write (*,'(A,X,F8.3)') 'wall time:', (walltime2-walltime1)/REAL(count_rate, KIND=real64) write (*,'(A,X,F8.3)') 'cpu time:', cputime2-cputime1 write (*,*) 'SUM(c):', SUM(C) end program matmul_test
If -qopt-matmul is specified, the multi-threaded MKL is used by default:
$ ifort -qopt-matmul matmul_test.f90 $ ./a.out wall time: 0.551 cpu time: 8.399 SUM(c): 268554303.414090
If -mkl=sequential is also specified, the multi-threaded MKL is still used:
$ ifort -qopt-matmul -mkl=sequential matmul_test.f90 $ OMP_DISPLAY_ENV=1 ./a.out OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP='201307' [host] OMP_CANCELLATION='FALSE' [host] OMP_DISPLAY_ENV='TRUE' [host] OMP_DYNAMIC='FALSE' [host] OMP_MAX_ACTIVE_LEVELS='2147483647' [host] OMP_NESTED='FALSE' [host] OMP_NUM_THREADS: value is not defined [host] OMP_PLACES: value is not defined [host] OMP_PROC_BIND='false' [host] OMP_SCHEDULE='static' [host] OMP_STACKSIZE='4M' [host] OMP_THREAD_LIMIT='2147483647' [host] OMP_WAIT_POLICY='PASSIVE' OPENMP DISPLAY ENVIRONMENT END wall time: 0.545 cpu time: 8.375 SUM(c): 268554303.414090
I realize that one could set the environment variable OMP_NUM_THREADS to 1 to effectively get sequential execution, but it would be preferable for the behavior to match or, if this behavior is intentional, that the man page be updated to clarify this interaction between -qopt-matmul and -mkl=sequential.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try adding /O2 or /O3
The reference manual under -qopt-matmul states: This option has no effect unless option O2 or higher is set.
In your second test, listing the OpenMP environment variables does not indicate if the application is using OpenMP.
Try using
export KMP_AFFINITY=verbose,none
./a.out
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Thanks for the info regarding the KMP_AFFINITY environment variable; it appeared to confirm that the multi-threaded MKL is still being used when both -qopt-matmul and -mkl=sequential are specified.
Per the ifort 15.0.1.133 man page, -O2 is the default. Explicitly specifying -O2 or -O3 did not seem to affect whether or not -mkl=sequential is honored when -qopt-matmul is specified; e.g.:
$ ifort -O2 -qopt-matmul -mkl=sequential matmul_test.f90 $ KMP_AFFINITY=verbose,none ./a.out OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids. OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #156: KMP_AFFINITY: 16 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 8 cores/pkg x 1 threads/core (16 total cores) OMP: Info #242: KMP_AFFINITY: pid 21297 thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 8 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 9 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 13 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 11 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 12 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 10 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 14 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} OMP: Info #242: KMP_AFFINITY: pid 21297 thread 15 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} wall time: 0.588 cpu time: 9.011 SUM(c): 268554303.414090
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ifort defaults to -O2, so -qopt-matmul would be expected to take effect. ifort -help states that it can only call threaded libraries, while the full help documentation does not discuss this point, but neither indicates that the -mkl link options have effect.
Nathan has a point in that this creates a further confusion in usage of MATMUL. For single threaded matmul, I used to prefer -O3 -qno-opt-matmul, as that can generate fully optimized in-line code with a lower size threshold for efficiency.
If this test is being run on a dual 8-core platform with threading, affinity setting would be important.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
Thanks for the info. The goal of this exercise was to get the fastest sequential MATMUL possible on my hardware (Xeon E5-2650). This would be useful in, e.g., a program that utilizes coarrays, where there is likely to be one image per execution unit, and a multi-threaded MATMUL could oversubscribe the system (note that I haven't actually verified that in such a case the MKL runtime library doesn't dynamically adjust the number of threads to avoid oversubscription).
The single-thread performance obtained by specifying -qopt-matmul and OMP_NUM_THREADS=1 was about 2.2X of that obtained by -O3 -qno-opt-matmul:
$ ifort -O3 -qno-opt-matmul matmul_test.f90 $ ./a.out wall time: 11.807 cpu time: 11.778 SUM(c): 268554303.414090 $ ifort -qopt-matmul matmul_test.f90 $ OMP_NUM_THREADS=1 ./a.out wall time: 5.314 cpu time: 5.298 SUM(c): 268554303.414090
In a semi-related issue, it appears that setting MKL_NUM_THREADS to 1 does not appear to cause only a single thread to be used in this case---possibly a bug?
$ ifort -qopt-matmul matmul_test.f90 $ MKL_NUM_THREADS=1 ./a.out wall time: 0.522 cpu time: 7.918 SUM(c): 268554303.414090
In any case, I think it would be a nice feature to allow -mkl=sequential cause the sequential MKL library to be linked when -qopt-matmul is specified, so one doesn't have to remember to set yet another environment variable before executing their application to get this behavior.
Also, I think it would be helpful to update the man page to explicitly state that -qopt-matmul also causes MATMUL to utilize the MKL, as the current wording doesn't give me that impression:
The -qopt-matmul options tell the compiler to identify matrix multiplication loop nests (if any) and replace them with a matmul library call for improved performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You would set omp_num_threads to an appropriate value to avoid over-subscription. I don't know about coarrays inheriting mpi facility for pinning threads to contiguous groups of cores. i_mpi_debug=5 includes pinning information in the diagnostics.
Your platform should have good flexibility for varying numbers of coarray images and omp threads per image.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nathan,
You might be able to get what you want by separating your test program into two parts. A PROGRAM part, that basically calls a subroutine containing your code. The subroutine is compiled separately specifying opt-matmul and the main program is specified as using the mkl single threaded version (and supplied with the .o containing the subroutine). Thus having a main program specifying the single threaded mkl library using an externally compiled object file. Hopefully, everything links together the way you want.
Caution, you will need to disable IPO when compiling the main program.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim & Tim,
Thanks for the suggestions---I'll keep those in mind if/when I need to implement this in a production code. I'll reiterate my wish to any Intel employee reading this that I would appreciate it if:
-
When both -qopt-matmul and -mkl=sequential are specified, that the sequential MKL libraries be used, so what seems like workarounds would not be required to achieve the desired/expected behavior---or if this is undesirable for some reason, that the documentation (man page) be updated to reflect this
- The documentation (man page) for -qopt-matmul were clarified (I'd be happy to offer suggestions or provide feedback!)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nathan,
Have you experimented with swapping the order of -qopt-matmul and -mkl=sequential switches on the command line?
Or how about 1.b) adding -qopt-matmul=sequential
Then you would have a similar issue with -qopt-matmul=sequential with -mkl=parallel combination.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lorri asked me to post this on her behalf (for some reason she's having trouble posting herself):
Please note that –qopt-matmul enables OpenMP, but does not affect the set of MKL libraries.
To confirm what libraries are being linked against, use the command line switch –dryrun.
This displays the internal commands used to invoke the ld executable, and will list all libraries being linked against.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve,
Thanks for forwarding that info; it helped clear things up. The -dryrun option indicates that -qopt-matmul causes libmatmul, not the MKL per se, to be statically linked. Before seeing this diagnostic information, I mistakenly thought the MATMUL library (libmatmul) was synonymous with the MKL, when it only seems to leverage MKL components (per the assembly listing produced by "ifort -S", and symbols in libmatmul listed by the "nm" utility). Adding -mkl=sequential *does* cause the sequential MKL library to be dynamically linked, but it has no apparent effect on the resulting execution. This (at least partially) explains why MKL_NUM_THREADS doesn't seem to affect the number of threads involved in the MATMUL when -qopt-matmul is used instead of GEMM from the MKL proper; rather, only OMP_NUM_THREADS seems to have an effect. It would be nice if the ifort man page documented that the MATMUL library is parallelized using OpenMP, and that the OMP_NUM_THREADS environment variable can be used to control the number of threads involved a call to to MATMUL. This would be particularly relevant to coarray (or pure MPI) programs that could benefit from the optimized MATMUL provided by -qopt-matmul, but could possibly cause a node to be oversubscribed if it is used with the default number of threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some of the hints in the docs imply that opt-matmul is a part of auto-parallel, thus implying reference to OpenMP library, but I agree that it leaves us guessing sometimes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
With ifort 15.0.1, -qopt-matmul option appears to be effectively disabled when -qopenmp is specified. Consider the following matrix multiplication code (culled & modified from a matrix multiplication benchmark by Professor Glenn Luecke at Iowa State University):
program test_matmul use omp_lib implicit none integer, parameter :: N = 1024, NTRIAL = 128, NTESTS = 3 integer, parameter :: NFLUSH = 3*1024*1024 double precision :: flush(NFLUSH) ! flush a 20 MB cache double precision,dimension(N,N) :: A, B, C, D, C_init double precision :: t0, time(0:NTRIAL), avg_time character(len=16) :: test_name(NTESTS) = ['OpenMP', 'DGEMM', 'MATMUL'] integer :: i, j, k, test, trial call random_number(A) call random_number(B) call random_number(C) call random_number(flush) C_init = C ! keep initial value of C for error checking D = C + matmul(A,B) ! to be used to compute error do test = 1, NTESTS do trial = 0, NTRIAL C = C_init flush = flush + 0.0001d0 ! flush caches before timing t0 = omp_get_wtime() select case (test) case (1) !$omp parallel do schedule(static) do j = 1, N do k = 1, N do i = 1, N C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo case (2) call DGEMM('N', 'N', N, N, N, 1.d0, A, N, B, N, 1.d0, C, N) case (3) C = C + matmul(A,B) end select time(trial) = omp_get_wtime() - t0 ! seconds enddo ! trial loop avg_time = SUM(time(1:NTRIAL))/DBLE(NTRIAL) write(*,*) test_name(test), avg_time, 'seconds', & 2.d-9*DBLE(N)**3/avg_time, 'gflops' write(*,*) 'Error = ', MAXVAL(ABS(C-D)) write(*,*) ' ', C(2,3)*flush(7) write(*,*) ' ' end do end program test_matmul
Using -qopenmp-stubs in addition to -qopt-matmul, it run time suggests---and the assembly output, which contains calls to matmul_mkl_f64_, appears to confirm---that the first loop is recognized as a matrix multiplication, and uses the MATMUL library. In addition, the call to MATMUL also uses the MATMUL library. Using -mkl=sequential allows us to compare this with the sequential performance of DGEMM:
$ ifort -qopenmp-stubs -mkl=sequential -qopt-matmul -Ofast -xHost hw3.f90 $ ./a.out OpenMP 1.026251353323460E-002 seconds 209.255134333854 gflops Error = 1.136868377216160E-013 201.752316821375 DGEMM 0.105580719187856 seconds 20.3397330925457 gflops Error = 1.136868377216160E-013 205.064157088606 MATMUL 1.218027807772160E-002 seconds 176.308261133042 gflops Error = 0.000000000000000E+000 208.375997355837
Setting OMP_NUM_THREADS=1 apparently scales back the matmul library to using a single thread, and all three variants appear to have comparable performance:
$ ifort -qopenmp-stubs -mkl=sequential -qopt-matmul -Ofast -xHost hw3.f90 $ OMP_NUM_THREADS=1 ./a.out OpenMP 0.105887426063418 seconds 20.2808182976685 gflops Error = 1.136868377216160E-013 201.752316821375 DGEMM 0.104962434619665 seconds 20.4595449389248 gflops Error = 1.136868377216160E-013 205.064157088606 MATMUL 0.108032410964370 seconds 19.8781423910669 gflops Error = 0.000000000000000E+000 208.375997355837
However, when compiled with -qopenmp, the -qopt-matmul option appears to be effectively disabled: while the -dryrun option indicates that -lmatmul is still linked, the assembly listing (via -S) appears to imply that no code is generated that calls matmul_mkl_f64_. The result is that the MATMUL performance is halved (note that OMP_STACKSIZE has to be explicitly set on this machine to get the program to run):
$ ifort -qopenmp -mkl=sequential -qopt-matmul -Ofast -xHost hw3.f90 $ OMP_NUM_THREADS=16 OMP_STACKSIZE=128M ./a.out OpenMP 1.436026766896248E-002 seconds 149.543427567263 gflops Error = 2.842170943040401E-013 201.752316821375 DGEMM 0.105524888262153 seconds 20.3504944034157 gflops Error = 1.136868377216160E-012 205.064157088606 MATMUL 0.220026727765799 seconds 9.76010355562725 gflops Error = 0.000000000000000E+000 208.375997355837
It seems counterintuitive that -qopt-matmul wouldn't generate code that calls the MATMUL library for the call to MATMUL in the sequential region.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page