Making the cluster_sparse_solver Solve phase scalable in parallel

Marcos_V_ · ‎08-09-2016

Hello,

I have a problem with a symmetric matrix (either positive or semi-definite) for which I need to do many solutions as the time intergation of my method progresses. So I'm mostly interested in getting parallel scalability of the solve phase on the cluster_sparse_solver.

So far I haven't been very lucky testing a 5 point stencil Poisson problem with ~250K unknowns divided in 2, 4, 8 MPI processes and 6 OpenMP threads per process. The calculation is done on an infiniband Intel Xeon cluster where each node has two cpus with 6 cores each. So for the previous numbers of MPI processes I used 1, 2, or 4 nodes. For MPI I use open-mpi 1.8.2 for infiniband and the custom BLACS for this mpi library version. The Timings are pretty much flat. I use 0 iterative refinement steps on the solve.

I would very much appreciate any hints or suggestions on trying to speed up and make more scalable this solve phase.

Thank you,

Marcos

Marcos_V_ · ‎08-11-2016

Should I post this as a question?

Ying_H_Intel · ‎08-15-2016

Hi Marcos,

You can attach your test code and the command line of compilation here so more peoples can be help.

As i understand, if you have, for example multiple right-hand sides sides to feed to the solve phase on the cluster_sparse_solver, it should be parallel already. could you please show your code and your result?

Best Regards,

Ying

Marcos_V_1 · ‎08-16-2016

Hi Ying, thank you for your reply!

My code solves the constant coefficient Poisson equation at this point, and is implemented within a much larger solver for the Low Mach equations of fire dynamics. Although the matrix is factored once at the beginning, I don't have/know the multiple right hand sides at once, I know one RHS for the problem for each integration timestep, as it depends on the evolution of other variables. So I can call the solve phase for one right hand side at a time.

The matrix is built in parallel, where each process builds a set of consecutive rows for it. I put it in CSR distributed format and feed it to cluster_sparse_solver for symbolic and numerical factorization before entering the time step loop. That works just fine, and doesn't take much time.

Then as time integration progresses I call the solve phase twice per time step (We use an explicit RK2 time integrator) in the form:

1. Build Right hand side F_H. (H is the unknown here, the head or Bernoulli integral)

2. Solve with factored matrix given by handle PT_H:

   IF ( H_MATRIX_INDEFINITE ) THEN
      MTYPE  = -2 ! symmetric indefinite
   ELSE ! positive definite
      MTYPE  =  2
   ENDIF

   !.. Back substitution and iterative refinement
   IPARM(8) =  0 ! max numbers of iterative refinement steps
   PHASE    = 33 ! only solving
#ifdef WITH_PARDISO
   CALL PARDISO(PT_H, MAXFCT, MNUM, MTYPE, PHASE, NUNKH_TOTAL, &
              A_H, IA_H, JA_H, PERM, NRHS, IPARM, MSGLVL, F_H, X_H, ERROR)
#elif WITH_CLUSTER_SPARSE_SOLVER
   CALL CLUSTER_SPARSE_SOLVER(PT_H, MAXFCT, MNUM, MTYPE, PHASE, NUNKH_TOTAL, &
                A_H, IA_H, JA_H, PERM, NRHS, IPARM, MSGLVL, F_H, X_H, MPI_COMM_WORLD, ERROR)
#endif

3. Finally, with the solution in X_H continue computations of time integrator.

So I basically need to know if there is a way to make PHASE=33 (solve) with one RHS F_H more scalable. To give you timings, the code compiled with -O2 -ipo optimization flags and run on the cluster I described before with a problem with 360K unknowns distributed evenly among MPI processes gives time-stepping wall times (~460 time steps) of:

MPI Processes OMP_THREADS per MPI proc WALL_TIME Cluster (sec) for time-stepping

2 6 270.4

4 6 239.7

8 6 549.2

You can see that scaling is pretty much flat between 2 and 4 mpi processes, and gets bad going from 4 to 8. I understand this example is not very detailed, but if you want to compile and test the code, I can give you access to it in a more private communication. It would definitely help our efforts on upgrading our solver.

I have seen in a publication given on the intel website for the cluster_sparse_solver timings that, on a similar cluster and similar problem there was good parallel scaling with MPI processes, specially if several OpenMP threads are used. I assume this was considering the whole solution process and scaling was obtained on the most expensive numerical factorization. Is this something you guys have seen?

Thank you very much for your time and help,

Marcos