Intel MKL 10.2 Update 6 is now available

Gennady_F_Intel · ‎09-01-2010

New in Intel MKL 10.2 Update 6:

New Features

o Integrated Netlib LAPACK 3.2.2 including one new computational routine (?GEQRFP) and two new auxiliary routines (?GEQR2P and ?LARFGP)

Performance improvements

o Improved DZGEMM performance on Intel Xeon processors series 5300 and 5400 with 64-bit operating systems

o Improved DSYRK performance on Intel Xeon processors series 5300 with 32-bit operating systems with the most significant improvements for small oblong matrices on 8 and more threads

o Improved the scalability of (C/Z)GGEV by parallelizing the reduction to generalized Hessenberg form ((C/Z)GGHRD)

o Improved performance for ?(SY/HE)EV and ?(SP/HP)TRS on very small matrices (< 20)

o Improved performance of FFTW2 wrappers for those cases where the descriptor remains constant from call to call

o Improved Scalability of threaded applications that use non-threaded FFTs on multi-socket systems

o Significantly improved performance of cluster FFTs through better load balancing when the input data cannot be evenly distributed between MPI processes

o Improved scalability of cluster FFTs on systems with a non-power-of-2 number of cores/processors

o Improved performance of factorization step in PARDISO out-of-core for huge matrices through reduction in the number of disk IO operations

o Parallelized solve step in PARDISO

Usability/Interface improvements

o Improved support for F77 in FFTW2 and MPI FFTW2 interfaces

o Implemented rfftwnd_create_plan_specific and its 2d and 3d variants

o Added 2D Convolution/Correlation examples

Bug fixes

MotivatedUser · ‎10-07-2010

Dear MKL developers,

as far as I know the feature 'Parallelized solve step in PARDISO' is a new feature in PARDISO4 from the University of Basel. I'm very interested in this but also in another important feature of that version : Reproducibility of exact numerical results on multi-core architectures. Is this also included in MKL10.2 Update6? If not, are there any plans to integrate this feature?

Gennady_F_Intel · ‎10-07-2010

Hi,

"Parallelized solve step in PARDISO" is the latest features of PARDISO API which provided by Intel MKL in Update6.

MKL doesn't support PARDSIO 4 API at all. Regarding the "Reproducibility of exact numerical results on multi-core architectures" feature.the current version of MKL don't support this feature. I need to check our plan and inform you soon.

--Gennady

MotivatedUser · ‎10-19-2010

Hello Gennady,
do you have any news regarding the "reproducibility-feature"? I have seen in other threads, that some MKL users also report on a non-deterministic behaviour of PARDISO when used in parallel. So I think this feature would be very appreciated.
Kind regards,
Rene

Gennady_F_Intel · ‎10-19-2010

Hello Rene,

actually there are big concerns regarding implementation of similar features, at least without significantperformnce degradation and there no such plan ( I have to double check our plans and will back if I am mistaken) to do that.

Could you please look at the article:"Getting reproducible Results". Will it looks reasonable for you?

--Gennady

Gennady_F_Intel · ‎10-19-2010

sorry, forgett to add the link:

http://software.intel.com/en-us/articles/getting-reproducible-results-with-intel-mkl/

MotivatedUser · ‎10-19-2010

Hello Gennady,
thank you for the information. I will try it, although I'm not very optimistic that it helps in my case.
I want to point out again that I'm very interested in a feature that deletes non-determinism (in a certain range) in PARDISO. Regarding your concerns, I don't believe that the performance will suffer too much. I got this comparison from the PARDISO-website:

    The solver is now able to compute the exact bit identical solution
    independent on the number of cores without effecting the scalability.
    Here are some results for a nonlinear FE model with 500'000 elements.

    Intel MKL PARDISO 10.2
    1 core  - factor: 17.980 sec., solve: 1.13 sec.
    2 cores - factor:  9.790 sec., solve: 1.13 sec.
    4 cores - factor:  6.120 sec., solve: 1.05 sec.
    8 cores - factor:  3.830 sec., solve: 1.05 sec.

   U Basel PARDISO 4.0.0:
   1 core  - factor: 16.820 sec., solve: 1.09 sec.
   2 cores - factor:  9.021 sec., solve: 0.67 sec.
   4 cores - factor:  5.186 sec., solve: 0.53 sec.
   8 cores - factor:  3.170 sec., solve: 0.43 sec.

Kind regards,
Rene

Sergey_K_Intel1 · ‎10-22-2010

Rene,

PARDISO 4.0 from the PARDISO-website supports a bit-to-bit correspondence onlyfor symmetric indefinite matrices. Migration to a machine with another instruction set breaks this bit-to-bit compatibility. Sothe comaptibilitycan be observed for a prescribed set of machines with identical instruction set and the same number of cpus.

Moreover sparse direct solvers are quite sensitive to a matrix structure. So the performance should suffer in cases when the usage of dynamic parallelization gives essential advanatage over static parallelization with prescribed list of jobs for each thread. In most cases, theoretically, the performance has to suffer.

We have been unable to verify the performance information you quote due to a lack of information on how to reproduce them.

All the best
Sergey

MotivatedUser · ‎10-28-2010

Hello Sergey,
I can imagine that the performance potentially suffers in cases when a dynamic parallelization has advantages over a static one. However, I thought that my request would be an option which you would provide to the user. So if the user wants to get out the last drop of performance he/she has the freedom not to use it.

I provided some data to a colleague of you (Sergey Gololobov). This data is of course different from the one in my quotation, but it also results from a nonlinear FE model. If you like, you can use this instead.

Kind regards,
Rene