Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Very large sparse matrix solving problem

YONGHEE_L_
Beginner
457 Views

Hi dear Intel
Now I'm using 'MKL2017 update 1' and 'MPICH3.1.4'.
And I have 2 machines with 512GB memory in each.

When I tired to solve the SPD sparse matrix having 488 and 1500 million elements with thease two machines, MKL showed the error code as the following....

===============================================================================================
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1610)........: MPI_Bcast(buf=0x2ab9fd981080, count=438400118, MPI_LONG_LONG_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1462)...:
MPIR_Bcast(1486)........:
MPIR_Bcast_intra(1295)..:
MPIR_Bcast_binomial(252): message sizes do not match across processes in the collective routine: Received -32766 but expected -787766352
[proxy:0:0@phas0007] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:0@phas0007] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@phas0007] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@phas0007] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@phas0007] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@phas0007] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@phas0007] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
=============================================================================================== 

(However, in 100 million case, the 'cluster_sparse_solver_64' function was well behaved.)

Could you please let me know the way how I deal with this problem.
Thank you so much in advance.
Have a nice day.

Regards,
Yong-hee


P.S. If I use the intel MPI instead of the MPICH, will it be possible to see the good result?

0 Kudos
3 Replies
Gennady_F_Intel
Moderator
457 Views

Hi Yong-hee. Actually we validate MPICH 3.1 as well as 3.2. Yes, you can try to use Intel MPI. Please let us know how it will work. Could you identify at which Pardiso computation stage ( reorgering, factorization or solution) the problem appear?

0 Kudos
YONGHEE_L_
Beginner
457 Views

Re hi Gennady!
I checked the out file and I found that my program stopped at the factorization stage. (reordering stage was completed)
The count number, 438400118, written at original post, is the number of 'ia' array element of CSR.
(Therefore, 'ia' array have the size 438,400,118 * 8byte = 3,507,200,944 byte = 3.27 Gbyte. Maybe, 'ja' is larger than 'ia')
As you can see, my matrix size exceeded the 'MPI_Bcast' limit.
I anticipated that Intel MKL divide the large array into some small arrays to satisfy the 'MPI_Bcast' limitation, but it didn't work as I expected.

Actually, I generated the MPI code by using the example of MKL cluster, and my code have no additional line without data loading part.
Therefore, you can imagine my code easily. 
(If you want to check my code, I can upload it with matrix generator.)

Thank you in advance for your support.

Regards,
Yong-hee

0 Kudos
YONGHEE_L_
Beginner
457 Views

.

0 Kudos
Reply