cluster sparse solver returns error = -1

Arrigoni__Viviana · ‎10-03-2019

I can't understand why I get the "input inconsistent" error when I run the code that I am attaching. I am using a toy example to check on it and that works for P = 4, where P is the number of MPI processes.
I set:
iparm[34] = 1;// indexing from 0
iparm[36] = 0; //csr format
iparm[39] = 2; //matrix distributed

in the csrA_rand.txt file, the first element is n, that is the number of rows and columns of the sparse matrix A. Process 0 reads n and broadcasts it to all other processes. The first n%P processes have (n / P) + 1 rows, whereas the remaining P - (n%P) ones have (n / P) rows each. In this case anyway, the matrix size n is 48, so all processes have the same number of rows (12). Since indexing starts from 0, all the ia's are such that ia[0] = 0 and ia[myn] = nA. All processes read their portion of a, ja and ia correctly, or at least this is what it seems to me.

*** Edit, found an error when generating ja

Gennady_F_Intel · ‎10-03-2019

What do you mean by "*** Edit, found an error when generating ja"?

Arrigoni__Viviana · ‎10-05-2019

I mean that the ja was not correct, it had two elements in the same position (same row and column).

By the way, when I run the program attached on bigger data, it doesn't seem quite efficient. I have access to a cluster and I run experiments on P =k^2 MPI processes, for different values of k. In order to do so, I launch jobs where I involve k compute nodes, each node has k tasks and the number of cpus per task is 4 (compute nodes have 2 x 18-cores Intel Xeon E5-2697 v4 (Broadwell) processors (2.30 GHz)). I see that performances improve a lot when instead I launch jobs with k^2 compute nodes and 1 task per node (and again 4 cpus per task).

In the developer reference guide - C of MKL when it speaks about cluster_sparse_solver, it says: "A hybrid implementation combines Message Passing Interface (MPI) technology for data exchange between parallel tasks (processes) running on different nodes, and OpenMP* technology for parallelism inside each node of the cluster. This approach effectively uses modern hardware resources such as clusters consisting of nodes with multi-core processors. The solver code is optimized for the latest Intel processors, but also performs well on clusters consisting of non-Intel processors."

When it says "for data exchange between parallel tasks (processes) running on different nodes", it means that a configuration where distributed (MPI) processes run on the same compute node (as in the case of k compute nodes and k tasks per node) is poorly supported?
I am compiling with mpiicc and linking the following:
-lmkl_intel_ilp64
-lmkl_intel_thread
-lmkl_core
-lmkl_blacs_intelmpi_ilp64
-liomp5
-lpthread
-lm
-ldl

Kirill_V_Intel · ‎10-06-2019

Hello,

I'd like to ask a couple of clarifying questions. What does k equals to in you setup? What do you mean by "task", is it an MPI process? Also, what is the size of the matrix for your problem? And finally, what are the timing approximately which you get?

As a rule of thumb, for best performance one should use one MPI process per computational node (or per socket) and dedicate all the cores available to OpenMP. With that, you need to set thread affinity correctly (for example, if you have hyperthreading enabled, you would not want to use secondary threads).

Hope this helps.

Best,
Kirill

Arrigoni__Viviana · ‎10-07-2019

Hello,

in my experiments, k is in {3,4,6,7,9}. I have run experiments on several matrix sizes, up to 300000, on which the execution time for factorization + computation that I get approximately 30 seconds, if not more, for P = 36, 49, 81.
I am not sure about hyperthreading actually.