New in Intel MKL 10.2 Update 6:
New Features
o Integrated Netlib LAPACK 3.2.2 including one new computational routine (?GEQRFP) and two new auxiliary routines (?GEQR2P and ?LARFGP)
Performance improvements
o Improved DZGEMM performance on Intel Xeon processors series 5300 and 5400 with 64-bit operating systems
o Improved DSYRK performance on Intel Xeon processors series 5300 with 32-bit operating systems with the most significant improvements for small oblong matrices on 8 and more threads
o Improved the scalability of (C/Z)GGEV by parallelizing the reduction to generalized Hessenberg form ((C/Z)GGHRD)
o Improved performance for ?(SY/HE)EV and ?(SP/HP)TRS on very small matrices (< 20)
o Improved performance of FFTW2 wrappers for those cases where the descriptor remains constant from call to call
o Improved Scalability of threaded applications that use non-threaded FFTs on multi-socket systems
o Significantly improved performance of cluster FFTs through better load balancing when the input data cannot be evenly distributed between MPI processes
o Improved scalability of cluster FFTs on systems with a non-power-of-2 number of cores/processors
o Improved performance of factorization step in PARDISO out-of-core for huge matrices through reduction in the number of disk IO operations
o Parallelized solve step in PARDISO
Usability/Interface improvements
o Improved support for F77 in FFTW2 and MPI FFTW2 interfaces
o Implemented rfftwnd_create_plan_specific and its 2d and 3d variants
o Added 2D Convolution/Correlation examples
Link Copied
The solver is now able to compute the exact bit identical solutionKind regards,
independent on the number of cores without effecting the scalability.
Here are some results for a nonlinear FE model with 500'000 elements.
Intel MKL PARDISO 10.2
1 core - factor: 17.980 sec., solve: 1.13 sec.
2 cores - factor: 9.790 sec., solve: 1.13 sec.
4 cores - factor: 6.120 sec., solve: 1.05 sec.
8 cores - factor: 3.830 sec., solve: 1.05 sec.
U Basel PARDISO 4.0.0:
1 core - factor: 16.820 sec., solve: 1.09 sec.
2 cores - factor: 9.021 sec., solve: 0.67 sec.
4 cores - factor: 5.186 sec., solve: 0.53 sec.
8 cores - factor: 3.170 sec., solve: 0.43 sec.
Rene,
PARDISO 4.0 from the PARDISO-website supports a bit-to-bit correspondence onlyfor symmetric indefinite matrices. Migration to a machine with another instruction set breaks this bit-to-bit compatibility. Sothe comaptibilitycan be observed for a prescribed set of machines with identical instruction set and the same number of cpus.
Moreover sparse direct solvers are quite sensitive to a matrix structure. So the performance should suffer in cases when the usage of dynamic parallelization gives essential advanatage over static parallelization with prescribed list of jobs for each thread. In most cases, theoretically, the performance has to suffer.
We have been unable to verify the performance information you quote due to a lack of information on how to reproduce them.
All the best
Sergey
For more complete information about compiler optimizations, see our Optimization Notice.