<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Time domain simulation parallelisation in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827374#M1413</link>
    <description>There's nothing visibly wrong with your first version of OpenMP parallel do. An Intel OpenMP thread team stays active for the time interval set by KMP_BLOCKTIME (default 0.200 seconds), so it's difficult to understand what you expected to accomplish differently with the second version. Your openmp-profile report would shed more light on what is happening in libiomp5. You could simply set LD_PRELOAD=&lt;IFORT library="" path=""&gt;/libiompprof5.so Did you consider trying Inspector to see if you have unintended data sharing between threads?&lt;BR /&gt;Do you get any parallel speedup from the OpenMP internal to MKL DGETRS when you turn off OpenMP in your calling code?&lt;/IFORT&gt;</description>
    <pubDate>Sun, 08 May 2011 10:39:19 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2011-05-08T10:39:19Z</dc:date>
    <item>
      <title>Time domain simulation parallelisation</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827373#M1412</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;I am developing a time-domain simulation tool for power systems. The heart of the simulator is this code:&lt;BR /&gt;&lt;CODE&gt;do while(not converged)&lt;BR /&gt;&lt;BR /&gt;CODE THAT HAS TO BE EXECUTED IN SERIAL (prepare data)&lt;BR /&gt;&lt;BR /&gt;do i=1, "several thousands"&lt;BR /&gt;CALL DGETRS(data of i)&lt;BR /&gt;enddo&lt;BR /&gt;&lt;BR /&gt;CODE THAT HAS TO BE EXECUTED IN SERIAL (check convergence)&lt;BR /&gt;&lt;BR /&gt;enddo&lt;/CODE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;IMG alt="Serial code" src="http://dl.dropbox.com/u/258337/serial.png" height="96" width="700" /&gt;&lt;BR /&gt;&lt;BR /&gt;The middle loop has no data dependencies and counts for 35% of my total CPU time in the serial version of the program (counted with intel vtune). DGETRS is a function in mkl_lapack95. I made a first parallelisation try by adding a &lt;CODE&gt;!$omp parallel do&lt;/CODE&gt; directive right before the middle do-loop and playing with the scheduling, number of threads etc to optimise. I received a very small speed-up (marginal). The cpu time spend on DGETRS is being evenly distributed between the threads but suddenly I have a huge cpu consumption from libomp5.so.&lt;BR /&gt;&lt;BR /&gt;&lt;IMG alt="Parallel code 1" src="http://dl.dropbox.com/u/258337/code1.png" height="164" width="700" /&gt;&lt;BR /&gt;&lt;BR /&gt;I thought this is because the threads are created and killed after each one of do while loops. So, my second approach was this:&lt;BR /&gt;&lt;CODE&gt;!$omp parallel&lt;BR /&gt;do while(not converged)&lt;BR /&gt;&lt;BR /&gt;!$omp single&lt;BR /&gt;CODE THAT HAS TO BE EXECUTED IN SERIAL (prepare data)&lt;BR /&gt;!$omp end single&lt;BR /&gt;&lt;BR /&gt;!$omp do&lt;BR /&gt;do i=1, "several thousands"&lt;BR /&gt;CALL DGETRS(data of i)&lt;BR /&gt;enddo&lt;BR /&gt;!$omp end do&lt;BR /&gt;&lt;BR /&gt;!$omp single&lt;BR /&gt;CODE THAT HAS TO BE EXECUTED IN SERIAL (check convergence)&lt;BR /&gt;!$omp end single&lt;BR /&gt;&lt;BR /&gt;enddo&lt;BR /&gt;!$omp end parallel&lt;/CODE&gt;&lt;BR /&gt;&lt;BR /&gt;This way I though the threads would stay alive throughout all the loop and have a better speedup. All the numbers come worst. More elapsed time, more cpu time and less concurrency. The time that was awarded to libomp5.so is halved now, but I have a lot of time spend on the !$omp end single (that I have 2).&lt;BR /&gt;&lt;BR /&gt;&lt;IMG alt="Parallel code 2" src="http://dl.dropbox.com/u/258337/code2.png" height="159" width="700" /&gt;&lt;BR /&gt;&lt;BR /&gt;I can provide any screenshots and other info from vtune or run any profiling you need. I use fortran 95 with the latest intel compiler on a linux (ubuntu) machine.&lt;BR /&gt;&lt;BR /&gt;Any comments (on the problem or in general) how to optimise the parallelisation are welcome! The incentive for parallelising is that the middle do loop is going to become more intensive soon with more detailed models. I expect the job done in there to go over 50% of the total.&lt;BR /&gt;&lt;BR /&gt;Thanks in advance,&lt;BR /&gt;Petros&lt;BR /&gt;</description>
      <pubDate>Sun, 08 May 2011 06:56:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827373#M1412</guid>
      <dc:creator>Petros</dc:creator>
      <dc:date>2011-05-08T06:56:22Z</dc:date>
    </item>
    <item>
      <title>Time domain simulation parallelisation</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827374#M1413</link>
      <description>There's nothing visibly wrong with your first version of OpenMP parallel do. An Intel OpenMP thread team stays active for the time interval set by KMP_BLOCKTIME (default 0.200 seconds), so it's difficult to understand what you expected to accomplish differently with the second version. Your openmp-profile report would shed more light on what is happening in libiomp5. You could simply set LD_PRELOAD=&lt;IFORT library="" path=""&gt;/libiompprof5.so Did you consider trying Inspector to see if you have unintended data sharing between threads?&lt;BR /&gt;Do you get any parallel speedup from the OpenMP internal to MKL DGETRS when you turn off OpenMP in your calling code?&lt;/IFORT&gt;</description>
      <pubDate>Sun, 08 May 2011 10:39:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827374#M1413</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-05-08T10:39:19Z</dc:date>
    </item>
    <item>
      <title>Time domain simulation parallelisation</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827375#M1414</link>
      <description>Petros,&lt;BR /&gt;&lt;BR /&gt;Some portions of code work best when parallization occurs inside MKL while others work best when parallization is performed outside MKL (with MKL restricted to one thread). Performing parallization both inside and outside MKL usually does not work well. From MKL docs:&lt;BR /&gt;&lt;B&gt;&lt;SPAN style="font-family: Verdana-Bold; font-size: xx-small;"&gt;&lt;SPAN style="font-family: Verdana-Bold; font-size: xx-small;"&gt;&lt;P align="left"&gt;WARNING.&lt;/P&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/B&gt;&lt;SPAN style="font-family: Verdana; font-size: xx-small;"&gt;&lt;SPAN style="font-family: Verdana; font-size: xx-small;"&gt;&lt;P align="left"&gt; It is not recommended to simultaneously parallelize your program and employ the&lt;/P&gt;&lt;P align="left"&gt;Intel MKL internal threading because this will slow down the performance. Note that&lt;/P&gt;&lt;P align="left"&gt;in case "d" above, DFT computation is automatically initiated in a single-threading&lt;/P&gt;&lt;P align="left"&gt;mode.&lt;/P&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;BR /&gt;Try the following experiment&lt;BR /&gt;&lt;BR /&gt;!$omp parallel&lt;BR /&gt;call mkl_set_num_threads(1) ! see note below code snip&lt;BR /&gt;do while(not converged)&lt;BR /&gt;&lt;BR /&gt;!$omp single&lt;BR /&gt;CODE THAT HAS TO BE EXECUTED IN SERIAL (prepare data)&lt;BR /&gt;!$omp end single&lt;BR /&gt;&lt;BR /&gt;!$omp do&lt;BR /&gt;do i=1, "several thousands"&lt;BR /&gt;CALL DGETRS(data of i)&lt;BR /&gt;enddo&lt;BR /&gt;!$omp end do&lt;BR /&gt;&lt;BR /&gt;!$omp single&lt;BR /&gt;CODE THAT HAS TO BE EXECUTED IN SERIAL (check convergence)&lt;BR /&gt;!$omp end single&lt;BR /&gt;&lt;BR /&gt;enddo&lt;BR /&gt;!$omp end parallel&lt;BR /&gt;&lt;BR /&gt;Note, TimP may be able to answer this, what I do not know is if mkl_set_num_threads(n) is global or thread local. Due to this uncertanty, this is the reason for placing " call mkl_set_num_threads(1)" inside the parallel region (it should be benign to reset it repeatedly).&lt;BR /&gt;&lt;BR /&gt;Also note, if you have other sections of code the work better with MKL internally parallized you will have to reset the MKL number of threads. And there are other, more complex, ways of managing threads between your application and MKL. &lt;BR /&gt;&lt;BR /&gt;As to if DGETRS works best with parallization inside MKL or outside MKL I cannot say. The experiment should be easy to run.&lt;BR /&gt;&lt;BR /&gt;From looking at the chart a significant portion of time is spent inside MKL in start_thread. This would seem to indicate that for your usage of DGETRS that parallization outside of MKL might be favorable.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Sun, 08 May 2011 15:35:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827375#M1414</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-05-08T15:35:31Z</dc:date>
    </item>
    <item>
      <title>Time domain simulation parallelisation</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827376#M1415</link>
      <description>I suppose the reason for supporting MKL_NUM_THREADS as well as OMP_NUM_THREADS is to permit independent specification for MKL functions in an application which also has its own OpenMP parallel. I haven't seen documentation, but I suppose either set_num_threads option takes effect for the next parallel region (next MKL function call counts); I doubt that mkl_set_num_threads is thread local, but it's an interesting question.&lt;BR /&gt;As Jim points out, calling threaded MKL inside a parallel region could be problematical, and the MKL team doesn't recommend ways of doing it, not that you couldn't experiment with this form of OMP_NESTED. Among other things, you would want to set omp parallel num_threads and MKL_NUM_THREADS such that the product doesn't oversubscribe the resources. Affinity could be tricky, even with combined use of MKL_AFFINITY and KMP_AFFINITY.&lt;BR /&gt;It may be that extreme MKL thread overhead is due to an excessive total number of threads. By setting MKL_NUM_THREADS=1, you should get nearly as good performance as with mkl_sequential library linkage. As Jim said, it's common for parallelism outside MKL to prove superior to inside MKL, but this depends on your data set, how it fits cache, and I suppose for DGETRS whether the number of right hand sides is sufficient to use all your cores efficiently.</description>
      <pubDate>Sun, 08 May 2011 19:18:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827376#M1415</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-05-08T19:18:54Z</dc:date>
    </item>
    <item>
      <title>Time domain simulation parallelisation</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827377#M1416</link>
      <description>The decision as to use parallelization inside MKL or outside MKL should be taken on a case by case basis as opposed to functionality by functionality basis. Take for example matrix multiply. If you were to have many different matricies to multiply I would postulate that benieth some size parallization would be more suitable outside MKL and larger than some size parallelization inside MKL would be better. Also, on some configurations, say multi-processor/socket a combination were each socket runs a seperate thread that calls MKL which then parallelizes within the socket although I do not know if MKL has that degree of flexibility (it may have).&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Mon, 09 May 2011 14:30:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827377#M1416</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-05-09T14:30:19Z</dc:date>
    </item>
    <item>
      <title>Time domain simulation parallelisation</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827378#M1417</link>
      <description>First of all thank you very much for devoting the time to help me!
&lt;DIV&gt;I link the project with this:&lt;/DIV&gt;&lt;DIV&gt;&lt;CODE&gt;-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_lapack95_lp64 -liomp5 -lma37&lt;/CODE&gt;&lt;/DIV&gt;&lt;DIV&gt;(ma37 is a sparse solver used elsewhere).&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;The size of the individual problems in DGETRS is not that big to use the threaded solvers (even in the serial version of the program, running the threaded MKL doesn't offer anything!).&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;I run the inspector and it didn't find and data race situations or other problems.
&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Going into the parallel code I see that I have a big delay on the parallel directive:&lt;/DIV&gt;&lt;DIV&gt;&lt;CODE&gt;Line	Source	CPU Time	Overhead Time	Wait Time
653	!$omp parallel do &amp;amp;	12.054s	49.952ms	14.241s&lt;/CODE&gt;&lt;/DIV&gt;&lt;DIV&gt;But also on a totally stupid line:&lt;/DIV&gt;&lt;DIV&gt;&lt;CODE&gt;Line	Source	CPU Time	Overhead Time	Wait Time
687     if(dabs(x(a(j)+k-1)) &amp;gt; 1)f(a(j)+k-1)=bt*f(a(j)+k-1)/(x(a(j)+k-1)*bt2)   7.618s&lt;/CODE&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;CODE&gt;&lt;BR /&gt;&lt;/CODE&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="white-space: pre;"&gt;x,f,a are shared vectors. But each parallelised loop affects a differrent set of these vectors. i.e. the first loop the elements from 1-3, second from 4-8 ... etc&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="white-space: pre;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="white-space: pre;"&gt;You think this could create the big overhead? Should I break the big vectors to small individual ones?&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="white-space: pre;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="white-space: pre;"&gt;Petros&lt;/SPAN&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 09 May 2011 14:52:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827378#M1417</guid>
      <dc:creator>Petros</dc:creator>
      <dc:date>2011-05-09T14:52:47Z</dc:date>
    </item>
    <item>
      <title>Time domain simulation parallelisation</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827379#M1418</link>
      <description>If different threads are updating in the same pair of cache lines, it can create a large false sharing overhead. It may be improved if you can arrange (e.g. by setting affinity) that most cache line sharing is local to a CPU.&lt;BR /&gt;This is a case where hyperthreading may tolerate 2 threads on the same core updating elements less than 64 bytes apart.</description>
      <pubDate>Mon, 09 May 2011 18:32:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Time-domain-simulation-parallelisation/m-p/827379#M1418</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-05-09T18:32:44Z</dc:date>
    </item>
  </channel>
</rss>

