<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic As an update, I used export in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/dss-solve-real-takes-more-time-to-solve-a-linear-system/m-p/917791#M12759</link>
    <description>&lt;P&gt;As an update, I used&amp;nbsp;export KMP_AFFINITY=verbose,granularity=find,proclist=[0,1,2,3],explicit to bind threads to a different processor. Here is the output:&lt;/P&gt;
&lt;P&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,4}&lt;BR /&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {1,5}&lt;BR /&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {2,6}&lt;BR /&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {3,7}&lt;/P&gt;
&lt;P&gt;The threaded version of LU factorization takes 0.238 sec while sequential version takes 0.757 sec. So, threading definitely works here. However, threaded version for dss_solve_real takes 6 times more than sequential version which so wierd.&lt;/P&gt;</description>
    <pubDate>Wed, 04 Sep 2013 05:41:09 GMT</pubDate>
    <dc:creator>Pouya_Z_</dc:creator>
    <dc:date>2013-09-04T05:41:09Z</dc:date>
    <item>
      <title>dss_solve_real takes more time to solve a linear system</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/dss-solve-real-takes-more-time-to-solve-a-linear-system/m-p/917788#M12756</link>
      <description>&lt;P&gt;I need to solve a system of linear equations with three righ-hand-side vectors. Initially, I was using the sequential version of MKL (compiling with "libmkl_sequential.a")&amp;nbsp;and solving for each rhs vector sequentially as:&lt;/P&gt;
&lt;P&gt;dss_solve_real(DSS_handle, solOpt, rhs1, 1, x1);&lt;/P&gt;
&lt;P&gt;dss_solve_real(DSS_handle, solOpt, rhs1 + numOfVars, 1, x1 + numOfVars);&lt;/P&gt;
&lt;P&gt;dss_solve_real(DSS_handle, solOpt, rhs1 + 2*numOfVars, 1, x1&amp;nbsp;+ 2*numOfVars);&lt;/P&gt;
&lt;P&gt;where, numOfVars represent number of variables.&lt;/P&gt;
&lt;P&gt;Then, I decided to ask dss_solve_real to solve for all rhs vectors at once and I assumed that it will roughly lead to 3 times improvement. So, I compiled the code using "libmkl_intel_thread.a"&amp;nbsp;and used following code:&lt;/P&gt;
&lt;P&gt;dss_solve_real(DSS_handle, solOpt, rhs1, 3, x1);&lt;/P&gt;
&lt;P&gt;In my surprise, the timing is very wierd. Sequential version takes 0.548 sec while when I want to solve for all rhs vectors at once, it takes 5.024sec, which is almost 10 times more than sequential version.&lt;/P&gt;
&lt;P&gt;I feel there is something wrong here and I may be needed to set some environment variables. So, please let me know if you have similar experience.&lt;/P&gt;
&lt;P&gt;Any help is appreciated.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 03 Sep 2013 23:05:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/dss-solve-real-takes-more-time-to-solve-a-linear-system/m-p/917788#M12756</guid>
      <dc:creator>Pouya_Z_</dc:creator>
      <dc:date>2013-09-03T23:05:06Z</dc:date>
    </item>
    <item>
      <title>what is the problem size?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/dss-solve-real-takes-more-time-to-solve-a-linear-system/m-p/917789#M12757</link>
      <description>&lt;P&gt;what is the problem size? &amp;nbsp;and did you measure the dss_solve_real stage only? &amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 04 Sep 2013 03:27:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/dss-solve-real-takes-more-time-to-solve-a-linear-system/m-p/917789#M12757</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2013-09-04T03:27:08Z</dc:date>
    </item>
    <item>
      <title>The matrix is 3408 * 3408 and</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/dss-solve-real-takes-more-time-to-solve-a-linear-system/m-p/917790#M12758</link>
      <description>&lt;P&gt;The matrix is 3408 * 3408 and I am just measuring the execution time of dss_solve_real.&lt;/P&gt;
&lt;P&gt;I set KMP_AFFINITY=verbose to monitor thread creation and apparently mkl creates 4 threads. Here's the output:&lt;/P&gt;
&lt;P&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;BR /&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}&lt;/P&gt;
&lt;P&gt;Please note that I am calling a shared library from MATLAB and threads 0...3 are created by MATLAB.&lt;/P&gt;
&lt;P&gt;I tried to bind threads to some specific CPUs but it causes serious degradation in the performance.&lt;/P&gt;
&lt;P&gt;Please let me know what I should do to fix the problem.&lt;/P&gt;
&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Wed, 04 Sep 2013 04:59:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/dss-solve-real-takes-more-time-to-solve-a-linear-system/m-p/917790#M12758</guid>
      <dc:creator>Pouya_Z_</dc:creator>
      <dc:date>2013-09-04T04:59:41Z</dc:date>
    </item>
    <item>
      <title>As an update, I used export</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/dss-solve-real-takes-more-time-to-solve-a-linear-system/m-p/917791#M12759</link>
      <description>&lt;P&gt;As an update, I used&amp;nbsp;export KMP_AFFINITY=verbose,granularity=find,proclist=[0,1,2,3],explicit to bind threads to a different processor. Here is the output:&lt;/P&gt;
&lt;P&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,4}&lt;BR /&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {1,5}&lt;BR /&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {2,6}&lt;BR /&gt;OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {3,7}&lt;/P&gt;
&lt;P&gt;The threaded version of LU factorization takes 0.238 sec while sequential version takes 0.757 sec. So, threading definitely works here. However, threaded version for dss_solve_real takes 6 times more than sequential version which so wierd.&lt;/P&gt;</description>
      <pubDate>Wed, 04 Sep 2013 05:41:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/dss-solve-real-takes-more-time-to-solve-a-linear-system/m-p/917791#M12759</guid>
      <dc:creator>Pouya_Z_</dc:creator>
      <dc:date>2013-09-04T05:41:09Z</dc:date>
    </item>
  </channel>
</rss>

