- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I detected a performance regression in ifort 14.0 when using allocatable arrays. Optimization is perfect when using fixed size arrays, optimization does not work properly when using autoparallelization and allocatable arrays. This did work with full performance with ifort 13.1.
When single threaded code is produced, there is a partial - but extremely significant - loss of performance when using allocatable arrays.
I attached two files, ifortregression.txt, showing compiler version, compiler parameters and execution times, and matmul.F, the source code of the test programm.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can run the program with "time", cpu and wall time output is consistent with the overall values reported by time:
cp003421_matmul> OMP_NUM_THREADS=4 time ./a.out 6000
Running with array sizes 6000 by 6000
dtime: 0.480 real time: 0.121 init
dtime: 17.740 real time: 4.437 ikj
dtime: 17.960 real time: 4.494 jki
Sum of elements: 533082291842201.125
36.09user 0.16system 0:09.21elapsed 393%CPU (0avgtext+0avgdata 877168maxresident)k
0inputs+0outputs (0major+3881minor)pagefaults 0swaps
-opt-report reveales that (with option -parallel) the 14.0 compiler "forgets" to replace the matrix multiplication with the matmul intrinsic, while the 13.1 compiler does this replacement. I wonder why this feature has been dropped.
Many loops require a par-threshold of 99 or less instead of the default 100 to be parallelized:
> ifort -O3 -parallel -par-threshold99 -par-report -DARDIM=4000 -DSIZEARGS -mavx matmul.F
matmul.F(77): (col. 18) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(82): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(90): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(103): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(106): (col. 7) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.
matmul.F(119): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(120): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(136): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
> ifort -O3 -parallel -par-threshold -par-report -DARDIM=4000 -DSIZEARGS -mavx matmul.F
matmul.F(106): (col. 7) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.
matmul.F(120): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
In parallel mode both matrix multiplikations are treated the same, loops are permuted if necessary. So I wonder why loop order jki is treated differently in sequential mode. In fact - starting with compiler version 10 - jki is the only loop order among the 6 possible orders to show reduced performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately -opt-matmul is ignored by the 14.0 compiler if allocatable arrays are used. My full test program (source attached) shows that even calling the matmul intrinsic directly gives a slower executable than with compiler version 13.1. Calling dgemm directly is also slower, and this is definitely a compiler feature, not a library feature, since I can run both executables with either libMKL, there is no difference in execution speed.
> ifort14.0 -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS -DDGEMM -mavx -mkl crunchtbvar.F
> OMP_NUM_THREADS=4 time ./a.out 4000
Running with array sizes 4000 by 4000
dtime: 0.210 real time: 0.056 init
dtime: 9.520 real time: 2.379 ijk
dtime: 9.770 real time: 2.444 ikj
dtime: 9.730 real time: 2.434 jik
dtime: 9.590 real time: 2.398 jki
dtime: 9.730 real time: 2.436 kij
dtime: 9.540 real time: 2.385 kji
Sum of elements: 105312418747995.297
dtime: 9.520 real time: 2.387 matmul
dtime: 6.840 real time: 2.050 dgemm
Sum of elements: 105312418747995.297
74.41user 0.10system 0:19.02elapsed 391%CPU (0avgtext+0avgdata 400392maxresident)k
0inputs+0outputs (0major+5068minor)pagefaults 0swaps
> ifort13.1 -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS -DDGEMM -mavx -mkl crunchtbvar.F
> OMP_NUM_THREADS=4 time ./a.out 4000
Running with array sizes 4000 by 4000
dtime: 0.230 real time: 0.059 init
dtime: 6.300 real time: 1.574 ijk
dtime: 5.300 real time: 1.329 ikj
dtime: 5.380 real time: 1.346 jik
dtime: 5.390 real time: 1.348 jki
dtime: 5.420 real time: 1.355 kij
dtime: 5.450 real time: 1.361 kji
Sum of elements: 105312418747995.312
dtime: 5.390 real time: 1.356 matmul
dtime: 5.420 real time: 1.357 dgemm
Sum of elements: 105312418747995.297
44.27user 0.08system 0:11.22elapsed 395%CPU (0avgtext+0avgdata 400376maxresident)k
0inputs+0outputs (0major+3820minor)pagefaults 0swaps
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You have so many different combinations of compiler options and tests that it's hard to see if there is any regression or not. Can we just focus on one case at a time and make sure we do apples-to-apples comparisons? In fact, the following case shows that 14.0.2 is about 2x faster than 13.1.2:
===================14.0.2=============
$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.144 Build 20140120
$ ifort -O3 -parallel -par_threshold90 -par-report -DARDIM=4000 matmul.F -o matmul-14.0.2.144-pt90.x
matmul.F(73): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
matmul.F(82): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
matmul.F(90): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
matmul.F(103): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
matmul.F(119): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
matmul.F(136): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
$ export OMP_NUM_THREADS=1
$ time ./matmul-14.0.2.144-pt90.x
Running with array sizes 4000 by 4000
dtime: 0.190 real time: 0.195 init
dtime: 5.500 real time: 5.504 ikj
dtime: 5.500 real time: 5.498 jki
Sum of elements: 105312418747995.250
real 0m15.850s
user 0m11.123s
sys 0m0.087s
$ export OMP_NUM_THREADS=4
$ time ./matmul-14.0.2.144-pt90.x
Running with array sizes 4000 by 4000
dtime: 0.200 real time: 0.054 init
dtime: 5.620 real time: 1.406 ikj
dtime: 5.600 real time: 1.400 jki
Sum of elements: 105312418747995.031
real 0m3.173s
user 0m11.336s
sys 0m0.106s
$
=========================13.1.2=====================
$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.2.183 Build 20130514
$ ifort -O3 -parallel -par_threshold90 -DARDIM=4000 matmul.F -par-report -o matmul-13.1.2.183-pt90.x
matmul.F(73): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(82): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(90): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(103): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(119): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(136): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
$ export OMP_NUM_THREADS=1
$ time ./matmul-13.1.2.183-pt90.x
Running with array sizes 4000 by 4000
dtime: 0.190 real time: 0.195 init
dtime: 11.000 real time: 10.999 ikj
dtime: 10.970 real time: 10.970 jki
Sum of elements: 105312418747995.250
real 0m22.492s
user 0m22.101s
sys 0m0.075s
$ export OMP_NUM_THREADS=4
$ time ./matmul-13.1.2.183-pt90.x
Running with array sizes 4000 by 4000
dtime: 0.200 real time: 0.053 init
dtime: 10.990 real time: 2.753 ikj
dtime: 10.990 real time: 2.747 jki
Sum of elements: 105312418747995.016
real 0m7.385s
user 0m22.113s
sys 0m0.088s
$
Perhaps this might be related to your test machine, which you indicated was a Xeon E31240. That's an older Sandy Bridge box with a smallish 8 MB cache. I ran my tests on a Ivy Bridge box with 20 MB
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-opt-matmul does not use MKL directly - it calls into an MKL-derived routine in the Fortran support library, since Fortran needs more than the generic xGEMM can supply.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Remarks on Patrick's comment:
As I mentioned in ifortregression.txt the two compiler versions have different strategies regarding use of avx instructions. In single threaded mode both require the -mavx flag and linking against MKL does not improve performance; execution speed is the same for both compiler versions. In parallel mode the 14.0.2 compiler's matmul intrinsic uses avx by default, while the 13.1.2 compiler ignores the -mavx flag and requires linking against MKL to use avx. That's why I use different compiler flags for the two versions.
Without SIZEARGS both compilers produce executables with the same performance, in fact the 14.0.2 compiler is somewhat better, since both loop orders give the same speed while with the 13.1.2 compiler loop order ikj is slower (and varies considerably when repeating thu run). With SIZEARGS the 14.0.2 compiler does not replace the loops with the matmul intrinsic even when -opt-matmul is specified.
I believe that allocatable arrays are the standard for production software, therefor it is regrettable that the latest compiler version gives slower code.
What really worries me is that the execution time of dgemm increases significantly when it is called with allocated arrays. It looks as if the memory layout of the allocated arrays is not as optimized as it used to be with the 13.1.2 compiler.
Axel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is your point that there are too many compiler options and directives to play with?
To my surprise, the option -align array32byte doesn't prove useful. avx and avx2 are helpful only at -O3.
As I suggested, the loop count directives make a huge difference (>10x, if not setting par-threshold) to your allocatable array case. Do you consider that unacceptable? They don't entirely eliminate the difference, presumably because the generated code must still allow for the run-time determination of array sizes and so it makes more code version branches.
Auto-parallelization seems unusually effective for this case, once you learn about using loop count directives when you deny the compiler the fixed dimension information. Given that various MKL solutions are effective, this doesn't make a strong case for auto-parallelization.
This seems to show a weakness of the opt-matmul scheme in that the loop count directives don't work for that case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Correction:
Some of the performance deficits I have detected look like an initialization effect of MKL. When I repeat the first nested loop (ifort13 -parallel -mkl) or the dgemm call (ifort14 -parallel -mkl) in the source, the second execution of the identical code sections runs at full speed. Sorry that I didn't notice that earlier. I do not yet understand why this initialization overhead varies so much when I repeat the runs.
Two problems remain: Both compiler versions produce slow code for the jki loop order in single threaded mode. When using -parallel all 6 loop orders and the explicit matmul call run at the same speed. The compiler version 14.0.2 produces slower code when autoparallelization is used in connection with allocatable arrays. This seems to be due to no longer offering the option to use MKL routines by specifying -mkl.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-mkl option is only a shortcut for linking commonly used groups of MKL libraries. It doesn't have effects such as implying -opt-matmul.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As far as compiler version 13.1.2 is concerned, -mkl does improve performance. I can only guess that identically named internal procedures are replaced by higher performance mkl functions:
> ifort --version
ifort (IFORT) 13.1.2 20130514
Copyright (C) 1985-2013 Intel Corporation. All rights reserved.
> ifort -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS matmul.F
> OMP_NUM_THREADS=4 ./a.out
Running with array sizes 4000 by 4000
dtime: 0.250 real time: 0.068 init
dtime: 10.140 real time: 2.537 ikj
dtime: 10.160 real time: 2.540 jki
Sum of elements: 105312418747995.031
> ifort -O3 -parallel -par-threshold99 -mkl -DARDIM=4000 -DSIZEARGS matmul.F
> OMP_NUM_THREADS=4 ./a.out
Running with array sizes 4000 by 4000
dtime: 0.210 real time: 0.056 init
dtime: 6.080 real time: 1.518 ikj
dtime: 5.240 real time: 1.311 jki
Sum of elements: 105312418747995.016
with compiler version 14.0.2 -mkl has no influence on performance:
> ifort --version
ifort (IFORT) 14.0.2 20140120
Copyright (C) 1985-2014 Intel Corporation. All rights reserved.
> ifort -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS matmul.F
> OMP_NUM_THREADS=4 ./a.out
Running with array sizes 4000 by 4000
dtime: 0.220 real time: 0.058 init
dtime: 12.980 real time: 3.246 ikj
dtime: 12.810 real time: 3.204 jki
Sum of elements: 105312418747995.016
> ifort -O3 -parallel -par-threshold99 -mkl -DARDIM=4000 -DSIZEARGS matmul.F
> OMP_NUM_THREADS=4 ./a.out
Running with array sizes 4000 by 4000
dtime: 0.220 real time: 0.058 init
dtime: 12.810 real time: 3.203 ikj
dtime: 12.820 real time: 3.205 jki
Sum of elements: 105312418747995.016
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>I have detected look like an initialization effect of MKL. When I repeat the first nested loop (ifort13 -parallel -mkl) or the dgemm call (ifort14 -parallel -mkl) in the source, the second execution of the identical code sections runs at full speed. Sorry that I didn't notice that earlier. I do not yet understand why this initialization overhead varies so much when I repeat the runs.
The default MKL library is multi-threaded through use of OpenMP. The first call contains the overhead of creating the OpenMP thread pool (*** internal to MKL). Therefore, for timing purposes, the SOP is to discard the timing results for the first pass .OR. insert into your code, prior to the timed section, a call to MKL that you know establishes its thread pool.
RE the ***
When your application is multithreaded you might want to consider/experiment linking with the single threaded MKL. IOW each of your application threads can concurrently call MKL where each call into MKL continues using the same thread. Should each of your application threads call the multithreaded MKL concurrently, then you tend to oversubscribe the number of threads (# concurrent calls) * (number of threads spawned per MKL instance). On a 4-core/8-thread system this could explode to 64 threads if done improperly.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your application offers opportunity for parallelism at a higher level than individual MKL function calls, that should be useful. As Jim said, this might be done with the mkl sequential library.
Even if you call the threaded MKL from an OpenMP threaded region, MKL shouldn't use additional threads until you set OMP_NESTED. Even though the library may be able to prevent over-subscription, there aren't satisfactory methods to maintain data locality, so I agree in general with Jim's cautions.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page