<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Making the cluster_sparse_solver Solve phase scalable in parallel in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Making-the-cluster-sparse-solver-Solve-phase-scalable-in/m-p/1091479#M23281</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;I have a problem with a symmetric matrix (either positive or semi-definite) for which I need to do many solutions as the time intergation of my method progresses. So I'm mostly interested in getting parallel scalability of the solve phase on the cluster_sparse_solver.&lt;/P&gt;

&lt;P&gt;So far I haven't been very lucky testing a 5 point stencil Poisson problem with ~250K unknowns divided in 2, 4, 8 MPI processes and 6 OpenMP threads per process. The calculation is done on an infiniband Intel Xeon cluster where each node has two cpus with 6 cores each. So for the previous numbers of MPI processes I used 1, 2, or 4 nodes. For MPI I use open-mpi 1.8.2 for infiniband and the custom BLACS for this mpi library version. The Timings are pretty much flat. I use 0 iterative refinement steps on the solve.&lt;/P&gt;

&lt;P&gt;I would very much appreciate any hints or suggestions on trying to speed up and make more scalable this solve phase.&lt;/P&gt;

&lt;P&gt;Thank you,&lt;/P&gt;

&lt;P&gt;Marcos&lt;/P&gt;</description>
    <pubDate>Tue, 09 Aug 2016 21:43:16 GMT</pubDate>
    <dc:creator>Marcos_V_</dc:creator>
    <dc:date>2016-08-09T21:43:16Z</dc:date>
    <item>
      <title>Making the cluster_sparse_solver Solve phase scalable in parallel</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Making-the-cluster-sparse-solver-Solve-phase-scalable-in/m-p/1091479#M23281</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;I have a problem with a symmetric matrix (either positive or semi-definite) for which I need to do many solutions as the time intergation of my method progresses. So I'm mostly interested in getting parallel scalability of the solve phase on the cluster_sparse_solver.&lt;/P&gt;

&lt;P&gt;So far I haven't been very lucky testing a 5 point stencil Poisson problem with ~250K unknowns divided in 2, 4, 8 MPI processes and 6 OpenMP threads per process. The calculation is done on an infiniband Intel Xeon cluster where each node has two cpus with 6 cores each. So for the previous numbers of MPI processes I used 1, 2, or 4 nodes. For MPI I use open-mpi 1.8.2 for infiniband and the custom BLACS for this mpi library version. The Timings are pretty much flat. I use 0 iterative refinement steps on the solve.&lt;/P&gt;

&lt;P&gt;I would very much appreciate any hints or suggestions on trying to speed up and make more scalable this solve phase.&lt;/P&gt;

&lt;P&gt;Thank you,&lt;/P&gt;

&lt;P&gt;Marcos&lt;/P&gt;</description>
      <pubDate>Tue, 09 Aug 2016 21:43:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Making-the-cluster-sparse-solver-Solve-phase-scalable-in/m-p/1091479#M23281</guid>
      <dc:creator>Marcos_V_</dc:creator>
      <dc:date>2016-08-09T21:43:16Z</dc:date>
    </item>
    <item>
      <title>Should I post this as a</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Making-the-cluster-sparse-solver-Solve-phase-scalable-in/m-p/1091480#M23282</link>
      <description>&lt;P&gt;Should I post this as a question?&lt;/P&gt;</description>
      <pubDate>Thu, 11 Aug 2016 19:35:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Making-the-cluster-sparse-solver-Solve-phase-scalable-in/m-p/1091480#M23282</guid>
      <dc:creator>Marcos_V_</dc:creator>
      <dc:date>2016-08-11T19:35:41Z</dc:date>
    </item>
    <item>
      <title>Hi Marcos, </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Making-the-cluster-sparse-solver-Solve-phase-scalable-in/m-p/1091481#M23283</link>
      <description>&lt;P&gt;Hi Marcos,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;You can attach your test code and the command line of compilation here so more peoples can be help.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;As i understand, if you have, for example&amp;nbsp;multiple right-hand sides sides to feed to&amp;nbsp;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;the solve phase on the cluster_sparse_solver, it should be parallel already. could you please show your code and your result?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Ying&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Aug 2016 06:20:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Making-the-cluster-sparse-solver-Solve-phase-scalable-in/m-p/1091481#M23283</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2016-08-16T06:20:37Z</dc:date>
    </item>
    <item>
      <title>Hi Ying, thank you for your</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Making-the-cluster-sparse-solver-Solve-phase-scalable-in/m-p/1091482#M23284</link>
      <description>&lt;P&gt;Hi Ying, thank you for your reply!&lt;/P&gt;

&lt;P&gt;My code solves the constant coefficient Poisson equation at this point, and is implemented within a much larger solver for the Low Mach equations of fire dynamics. Although the matrix is factored once at the beginning, I don't have/know the multiple right hand sides at once, I know one RHS for the problem for each integration timestep, as it depends on the evolution of other variables. So I can call the solve phase for one right hand side at a time.&lt;/P&gt;

&lt;P&gt;The matrix is built in parallel, where each process builds a set of consecutive rows for it. I put it in CSR distributed format and feed it to cluster_sparse_solver for symbolic and numerical factorization before entering the time step loop. That works just fine, and doesn't take much time.&lt;/P&gt;

&lt;P&gt;Then as time integration progresses I call the solve phase twice per time step (We use an explicit RK2 time integrator) in the form:&lt;/P&gt;

&lt;P&gt;1. Build Right hand side F_H. (H is the unknown here, the head or Bernoulli integral)&lt;/P&gt;

&lt;P&gt;2. Solve with factored matrix given by handle PT_H:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;   IF ( H_MATRIX_INDEFINITE ) THEN
      MTYPE  = -2 ! symmetric indefinite
   ELSE ! positive definite
      MTYPE  =  2
   ENDIF

   !.. Back substitution and iterative refinement
   IPARM(8) =  0 ! max numbers of iterative refinement steps
   PHASE    = 33 ! only solving
#ifdef WITH_PARDISO
   CALL PARDISO(PT_H, MAXFCT, MNUM, MTYPE, PHASE, NUNKH_TOTAL, &amp;amp;
              A_H, IA_H, JA_H, PERM, NRHS, IPARM, MSGLVL, F_H, X_H, ERROR)
#elif WITH_CLUSTER_SPARSE_SOLVER
   CALL CLUSTER_SPARSE_SOLVER(PT_H, MAXFCT, MNUM, MTYPE, PHASE, NUNKH_TOTAL, &amp;amp;
                A_H, IA_H, JA_H, PERM, NRHS, IPARM, MSGLVL, F_H, X_H, MPI_COMM_WORLD, ERROR)
#endif&lt;/PRE&gt;

&lt;P&gt;3. Finally, with the solution in X_H continue computations of time integrator.&lt;/P&gt;

&lt;P&gt;So I basically need to know if there is a way to make PHASE=33 (solve) with one RHS F_H more scalable. To give you timings, the code compiled with -O2 -ipo optimization flags and run on the cluster I described before with a problem with 360K unknowns distributed evenly among MPI processes gives time-stepping wall times (~460 time steps) of:&lt;/P&gt;

&lt;P&gt;MPI Processes&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; OMP_THREADS per MPI proc &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; WALL_TIME Cluster (sec) for time-stepping&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; 6&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; 270.4&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 239.7&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 8&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 549.2&lt;/P&gt;

&lt;P&gt;You can see that scaling is pretty much flat between 2 and 4 mpi processes, and gets bad going from 4 to 8. I understand this example is not very detailed, but if you want to compile and test the code, I can give you access to it in a more private communication. It would definitely help our efforts on upgrading our solver.&lt;/P&gt;

&lt;P&gt;I have seen in a publication given on the intel website for the cluster_sparse_solver timings that, on a similar cluster and similar problem there was good parallel scaling with MPI processes, specially if several OpenMP threads are used. I assume this was considering the whole solution process and scaling was obtained on the most expensive numerical factorization. Is this something you guys have seen?&lt;/P&gt;

&lt;P&gt;Thank you very much for your time and help,&lt;/P&gt;

&lt;P&gt;Marcos&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Aug 2016 14:35:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Making-the-cluster-sparse-solver-Solve-phase-scalable-in/m-p/1091482#M23284</guid>
      <dc:creator>Marcos_V_1</dc:creator>
      <dc:date>2016-08-16T14:35:24Z</dc:date>
    </item>
  </channel>
</rss>

