CPardiso phase 33 scaling

Emond__Guillaume · ‎12-18-2018

Hi,

We want to use Cluster Pardiso for our finite element application. To get an estimate of performances, we used a simple code (attached file) to read a matrix from the Sparse Suite Collection (Matrix Market format) and then measure execution time for each phase (11, 22 and 33).

Factorisation (22) phase shows good scale up with MPI and OpenMP parallelization but solving phase (33) performances are not nearly as good as factorisation.

For example, the table below shows running times (in seconds) for differents combination of MPI processes and OMP threads (by process).

Serena.mtx:

MPI / OMP Phase=22 Phase=33

2 / 2 408.70 2.4668

2 / 4 249.52 1.3382

2 / 8 234.87 3.7524

2 / 16 93.879 1.3181

4 / 2 327.69 1.8661

4 / 4 162.16 1.9664

4 / 8 96.526 4.4899

4 / 16 58.619 1.3763

8 / 2 175.61 1.1638

8 / 4 90.975 1.1006

8 / 8 67.704 2.4264

8 / 16 39.654 0.9049

16 / 2 127.61 1.4321

16 / 4 62.155 0.9136

16 / 8 53.761 2.0407

16 / 16 26.957 0.7122

32 / 8 36.447 2.1856

32 / 16 24.977 0.3729

We can observe that solving does not always decrease with more MPI process or OpenMP threads. We tested other matrices (RM07R) but the same behaviour was observed. Is this normal or is it an issue? Is there a way to get better scaling?

Thanks a lot for any advice

Guillaume

Gennady_F_Intel · ‎12-19-2018

Hi Guillaume,

This might be the scalability issue at the solution stage.

What version of MKL do you run?

thanks, Gennady

Emond__Guillaume · ‎12-19-2018

Hi,

We used MKL 2017.0.4 for 64 architecture

Gennady_F_Intel · ‎12-20-2018

Have you had a chance to take version 2019 and check the scalability with this version of mkl? You may take the latest update for free.

if not, then we will check these numbers on our side.

Emond__Guillaume · ‎12-20-2018

Our application runs on Graham, a cluster at Compute Canada (https://docs.computecanada.ca/wiki/Graham). At the moment, the most recent version available is MKL 2018.0.3 and a request to install MKL 2019 could take a while before it is processed.

It seems that switching to MKL 2018 does not solve the problem. Here are the results for Serena.mtx (with MKL 2018).

MPI / OMP Phase=22 Phase=33

2 / 2 431.21 2.4546

2 / 4 266.05 1.5849

2 / 8 266.05 1.7487

2 / 16 83.262 1.3433

4 / 2 245.56 1.4342

4 / 4 134.69 1.3420

4 / 8 87.154 2.0223

4 / 16 54.010 2.1256

8 / 2 152.77 1.0543

8 / 4 89.457 1.0726

8 / 8 60.434 2.1468

8 / 16 38.522 1.0357

16 / 2 110.03 1.5032

16 / 4 55.077 0.5922

16 / 8 40.506 0.9916

16 / 16 28.883 0.7282

32 / 8 34.848 0.7771

32 / 16 27.193 0.4550

I would appreciate if you could check these results.

Thank you!

Guillaume

PS: Our application is compiled with these flags

-O3 -qopenmp -mkl=parallel -std=c++11 -Wall

-lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl -lmpi

Gennady_F_Intel · ‎12-21-2018

sure, will try to check and get you back

Gennady_F_Intel · ‎12-23-2018

Here what I obtained with OMP threads only due to some cluster access problem.

RM07M. The scalability is too small.

MKL version 2019 u1.

OMP threads	phase == 22 (sec)	phase == 33 (sec)
1	        973.9	            5.04
2	        567.57	            3.01
4	        313.77	            2.37
8	        192.01	            1.94
16	        124.3	            1.76
32	        107.69	            1.72