<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cluster sparse solver slower when using three machines instead of two in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Cluster-sparse-solver-slower-when-using-three-machines-instead/m-p/1325708#M32248</link>
    <description>&lt;P&gt;Hello!&lt;/P&gt;
&lt;P&gt;To be honest, I don't think your finding strikes me as unexpected. Parallelism inside the direct sparse solvers is very complex and it is known to use a sort of elimination tree, which is somehow distributed among parallel workers (threads, processes). As a typical binary tree structure, it's easier to balance it once one has a power of two for the number of workers (with the outermost level having MPI processes as workers).&lt;/P&gt;
&lt;P&gt;So I suspect that certain imbalance is hindering the performance in case of 3 MPIs. Partially, it is supported by the fact that going from 2 to 4 nodes didn't scale perfectly which might hint that the parallelism is getting limited and hence any imbalance will become more visible as a side-effect.&lt;/P&gt;
&lt;P&gt;As for how to evaluate quantitively the imbalance, there are no special features in the cluster sparse solver. Some sort of general purpose profiler would show it I believe.&lt;/P&gt;
&lt;P&gt;With the information you provided and without diving deep into the actual case, it's hard to say more.&lt;/P&gt;
&lt;P&gt;Best,&lt;BR /&gt;Kirill&lt;/P&gt;</description>
    <pubDate>Fri, 29 Oct 2021 04:21:44 GMT</pubDate>
    <dc:creator>Kirill_V_Intel</dc:creator>
    <dc:date>2021-10-29T04:21:44Z</dc:date>
    <item>
      <title>Cluster sparse solver slower when using three machines instead of two</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Cluster-sparse-solver-slower-when-using-three-machines-instead/m-p/1325575#M32246</link>
      <description>&lt;P&gt;My machines have 40 physical cores and on an infiniband network. I get excellent speedup when going from one to two machines ( ~30% ). But going from two to three machines there is no speedup. In fact the factorization takes a tiny bit longer.. But then if I use four machines I get a decent speedup compared to 2 machines ( ~20% ).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;See my factorization times below:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Larger matrix ( 4.7 million equations )&lt;LI-EMOJI id="lia_disappointed-face" title=":disappointed_face:"&gt;&lt;/LI-EMOJI&gt;&lt;/P&gt;
&lt;P&gt;1 machine: 31 s&lt;BR /&gt;2 machines: 18 s&lt;BR /&gt;3 machines:&amp;nbsp; 19 s&lt;BR /&gt;4 machines: 12 s&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Small matrix ( 1.2 million equations )&lt;LI-EMOJI id="lia_disappointed-face" title=":disappointed_face:"&gt;&lt;/LI-EMOJI&gt;&lt;/P&gt;
&lt;P&gt;1 machine : 3.5 s&lt;BR /&gt;2 machines: : 2.6 s&lt;BR /&gt;3 machines:&amp;nbsp; 2.9 s&lt;BR /&gt;4 machines: 2.4 s&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I see there have been some other threads about this below. Is this expected behavior or am I doing something wrong ?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/No-speedup-of-cluster-sparse-solver-beyond-32-cpus/m-p/1082933#M22872" target="_blank" rel="noopener"&gt;https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/No-speedup-of-cluster-sparse-solver-beyond-32-cpus/m-p/1082933#M22872&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Direct-Sparse-Solver-for-Clusters-poor-scaling/m-p/1147400#M26817" target="_blank" rel="noopener"&gt;https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Direct-Sparse-Solver-for-Clusters-poor-scaling/m-p/1147400#M26817&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Oct 2021 17:42:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Cluster-sparse-solver-slower-when-using-three-machines-instead/m-p/1325575#M32246</guid>
      <dc:creator>segmentation_fault</dc:creator>
      <dc:date>2021-10-28T17:42:16Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster sparse solver slower when using three machines instead of two</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Cluster-sparse-solver-slower-when-using-three-machines-instead/m-p/1325708#M32248</link>
      <description>&lt;P&gt;Hello!&lt;/P&gt;
&lt;P&gt;To be honest, I don't think your finding strikes me as unexpected. Parallelism inside the direct sparse solvers is very complex and it is known to use a sort of elimination tree, which is somehow distributed among parallel workers (threads, processes). As a typical binary tree structure, it's easier to balance it once one has a power of two for the number of workers (with the outermost level having MPI processes as workers).&lt;/P&gt;
&lt;P&gt;So I suspect that certain imbalance is hindering the performance in case of 3 MPIs. Partially, it is supported by the fact that going from 2 to 4 nodes didn't scale perfectly which might hint that the parallelism is getting limited and hence any imbalance will become more visible as a side-effect.&lt;/P&gt;
&lt;P&gt;As for how to evaluate quantitively the imbalance, there are no special features in the cluster sparse solver. Some sort of general purpose profiler would show it I believe.&lt;/P&gt;
&lt;P&gt;With the information you provided and without diving deep into the actual case, it's hard to say more.&lt;/P&gt;
&lt;P&gt;Best,&lt;BR /&gt;Kirill&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 04:21:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Cluster-sparse-solver-slower-when-using-three-machines-instead/m-p/1325708#M32248</guid>
      <dc:creator>Kirill_V_Intel</dc:creator>
      <dc:date>2021-10-29T04:21:44Z</dc:date>
    </item>
  </channel>
</rss>

