<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Visual Fortran and Open MP in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909798#M83126</link>
    <description>You are using the wrong function to measure the efficiency of parallel code. cpu_time measures cumulative time for all processors. For example if your program runs for 1 min on a double CPU system with 100% load of both processors cpu_time will return 2 min (1 min on CPU1 + 1 min on CPU2). In fact the documentation about the behavior of cpu_time on multyprocessor system is vague and you are not the only one who was bitten by it.</description>
    <pubDate>Fri, 04 Apr 2008 04:48:16 GMT</pubDate>
    <dc:creator>izryu</dc:creator>
    <dc:date>2008-04-04T04:48:16Z</dc:date>
    <item>
      <title>Visual Fortran and Open MP</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909793#M83121</link>
      <description>&lt;FONT size="4"&gt;Dear All,&lt;BR /&gt;&lt;BR /&gt;I'm trying to use parallel computation with Visaul Intel Fortran +Open MP on matrices multiplication on a 2 core computer. Time computation does not change (not divide by about 2)!!! See souce code below.&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;&lt;BR /&gt;Didace   &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; program prod&lt;BR /&gt;&lt;BR /&gt; use dfport&lt;BR /&gt;&lt;BR /&gt; implicit none&lt;BR /&gt; &lt;BR /&gt; integer :: n, i, j&lt;BR /&gt; parameter (n = 1000)&lt;BR /&gt;&lt;BR /&gt; double precision :: time_end, time_begin&lt;BR /&gt;&lt;BR /&gt; complex*16, dimension(:,:), allocatable :: a, b&lt;BR /&gt;&lt;BR /&gt; allocate(a(n,n))&lt;BR /&gt; allocate(b(n,n))&lt;BR /&gt;&lt;BR /&gt; a = dcmplx(0.d+00,0.d+00)&lt;BR /&gt; b = dcmplx(0.d+00,0.d+00)&lt;BR /&gt;&lt;BR /&gt; do j=1,n&lt;BR /&gt;&lt;BR /&gt;  do i=1,n&lt;BR /&gt;&lt;BR /&gt;   a(i,j) = cmplx(rand(),rand())&lt;BR /&gt;   b(i,j) = cmplx(rand(),rand())&lt;BR /&gt;&lt;BR /&gt;  enddo&lt;BR /&gt;&lt;BR /&gt; enddo&lt;BR /&gt;&lt;BR /&gt; call cpu_time(time_begin)&lt;BR /&gt;&lt;BR /&gt; call prod_mat(a,b,n)&lt;BR /&gt;&lt;BR /&gt; call cpu_time(time_end)&lt;BR /&gt;&lt;BR /&gt; write(*,*)&lt;BR /&gt; write(*,*) ' CPU time : ', time_end -time_begin&lt;BR /&gt; write(*,*)&lt;BR /&gt;&lt;BR /&gt; deallocate(a)&lt;BR /&gt; deallocate(b)&lt;BR /&gt; &lt;BR /&gt; end program prod&lt;BR /&gt;&lt;BR /&gt;!&lt;BR /&gt;! ---------------------------------------------------- !&lt;BR /&gt;!&lt;BR /&gt;&lt;BR /&gt; subroutine prod_mat (a,b,n)&lt;BR /&gt;&lt;BR /&gt; implicit none&lt;BR /&gt;&lt;BR /&gt; integer, intent(in) :: n&lt;BR /&gt;&lt;BR /&gt; integer    :: i, j&lt;BR /&gt;&lt;BR /&gt; complex*16, dimension(n,n)    :: a, b&lt;BR /&gt;&lt;BR /&gt; complex*16, dimension( : ), allocatable :: v&lt;BR /&gt;&lt;BR /&gt; allocate(v(n))&lt;BR /&gt;&lt;BR /&gt; do i=1,n&lt;BR /&gt;&lt;BR /&gt;  do j=1,n&lt;BR /&gt;&lt;BR /&gt;   v(j) = a(i,j)&lt;BR /&gt;&lt;BR /&gt;  enddo&lt;BR /&gt;&lt;BR /&gt;! OMP PARALLEL DO SCHEDULE(DYNAMIC,2)&lt;BR /&gt;&lt;BR /&gt;  do j=1,n&lt;BR /&gt;&lt;BR /&gt;   a(i,j) = sum(v(:)*b(:,j))&lt;BR /&gt;&lt;BR /&gt;  enddo&lt;BR /&gt;&lt;BR /&gt;! OMP PARALLEL END DO&lt;BR /&gt;&lt;BR /&gt; enddo&lt;BR /&gt;&lt;BR /&gt; deallocate(v)&lt;BR /&gt;&lt;BR /&gt; end subroutine prod_mat&lt;BR /&gt;&lt;BR /&gt;!&lt;BR /&gt;! ---------------------------------------------------- !&lt;BR /&gt;!&lt;/FONT&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 31 Mar 2008 13:06:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909793#M83121</guid>
      <dc:creator>ekeom</dc:creator>
      <dc:date>2008-03-31T13:06:26Z</dc:date>
    </item>
    <item>
      <title>Re: Visual Fortran and Open MP</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909794#M83122</link>
      <description>The difficulty of coming up with efficient new ways of performing matrix multiplication is among the motivations for use of a pre-built library, such as Intel MKL (included in ifort Professional). It should be improved, if you were to push the i loop inside the parallized j loop. Static scheduling, with large chunk, is probably more appropriate to the situation you quote. With OpenMP, the compiler is not entitled to perform such optimizations.&lt;BR /&gt;As you have written it, the operation of setting up v(:) is likely to be time consuming. Then, you have both threads writing to the same cache line (false sharing).&lt;BR /&gt;If you want the compiler to perform loop nesting optimizations for parallelism, the -Qparallel option may work better than OpenMP.&lt;BR /&gt;Questions about efficient OpenMP programming might be more appropriate to the Threading forum.&lt;BR /&gt;</description>
      <pubDate>Mon, 31 Mar 2008 13:40:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909794#M83122</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2008-03-31T13:40:25Z</dc:date>
    </item>
    <item>
      <title>Re: Visual Fortran and Open MP</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909795#M83123</link>
      <description>&lt;P&gt;Didace,&lt;/P&gt;
&lt;P&gt;Consider the following change:&lt;/P&gt;&lt;PRE&gt;subroutine prod_mat (a,b,n)&lt;BR /&gt; implicit none&lt;BR /&gt; integer, intent(in) :: n&lt;BR /&gt;!$OMP PARALLEL&lt;BR /&gt; call prod_mat_parallel(a,b,n)&lt;BR /&gt;!$OMP END PARALLEL&lt;BR /&gt;end subroutine prod_mat&lt;/PRE&gt;&lt;PRE&gt;subroutine prod_mat_parallel (a,b,n)&lt;BR /&gt; use omp_lib&lt;BR /&gt; implicit none&lt;BR /&gt; integer, intent(in) :: n&lt;BR /&gt; integer :: i, j&lt;BR /&gt; integer :: iNumThreads, iThreadNum&lt;/PRE&gt;&lt;PRE&gt; complex*16, dimension(n,n) :: a, b&lt;/PRE&gt;&lt;PRE&gt; complex*16, dimension( : ), allocatable :: v&lt;/PRE&gt;&lt;PRE&gt; allocate(v(n)) ! seperate instance of v per thread&lt;/PRE&gt;&lt;PRE&gt; iNumThreads = OMP_GET_NUM_THREADS()&lt;BR /&gt; if(iNumThreads .eq. 0) then&lt;BR /&gt; iNumThreads = 1&lt;BR /&gt; iThreadNum = 0&lt;BR /&gt; else&lt;BR /&gt; iThreadNum = OMP_GET_THREAD_NUM()&lt;BR /&gt; endif&lt;/PRE&gt;&lt;PRE&gt; do i=1+iThreadNum, n, iNumThreads&lt;BR /&gt; do j=1,n&lt;BR /&gt; v(j) = a(i,j)&lt;BR /&gt; enddo&lt;BR /&gt; do j=1,n&lt;BR /&gt; a(i,j) = sum(v(:)*b(:,j))&lt;BR /&gt; enddo&lt;BR /&gt; enddo&lt;BR /&gt; deallocate(v)&lt;BR /&gt;end subroutine prod_mat_parallel&lt;BR /&gt;&lt;/PRE&gt;
&lt;P&gt;Note,&lt;/P&gt;
&lt;P&gt;The above code does not optimize the cache line usage (or avoid false sharing) but should give better performance. There are examples of tileing available on the internet.&lt;/P&gt;
&lt;P&gt;Also, as a furtheroptimization, I wouild suggest that you move the allocation of array v to outside prod_mat_parallel (pass in as arg) and make arrays of array v static (one for each thread) and then only when v for thread is not allocated or size of v less than n would you perform the allocation or reallocation.&lt;/P&gt;
&lt;P&gt;Something along the lines of the following untested code&lt;/P&gt;&lt;PRE&gt;module mod_prod_mat&lt;BR /&gt; type type_mod_prod_mat_v&lt;BR /&gt; complex*16, dimension( : ), allocatable :: v&lt;BR /&gt; end type type_mod_prod_mat_v&lt;/PRE&gt;&lt;PRE&gt; type(type_mod_prod_mat_v), dimension( : ), allocatable :: av&lt;BR /&gt;end module mod_prod_mat&lt;/PRE&gt;&lt;PRE&gt;subroutine prod_mat (a,b,n)&lt;BR /&gt; use mod_prod_mat&lt;BR /&gt; implicit none&lt;BR /&gt; integer, intent(in) :: n&lt;BR /&gt; integer :: iNumThreads, iThreadNum&lt;/PRE&gt;&lt;PRE&gt; iNumThreads = max(OMP_GET_NUM_THREADS(),1)&lt;/PRE&gt;&lt;PRE&gt; if(allocated(av)) then&lt;BR /&gt; if(size(av) .lt. iNumThreads) then&lt;BR /&gt; do iThreadNum=0, size(av)-1&lt;BR /&gt; if(allocated(av(iThreadNum)%v)) deallocate(av(iThreadNum)%v)&lt;BR /&gt; enddo&lt;BR /&gt; deallocate av&lt;BR /&gt; allocate(av(0:iNumThreads-1))&lt;BR /&gt; do iThreadNum=0, iNumThreads-1&lt;BR /&gt;
 allocate(av(iThreadNum)%v(n))&lt;BR /&gt; enddo&lt;BR /&gt; else&lt;BR /&gt; if(size(av%v(0)) .lt. n) then&lt;BR /&gt; do iThreadNum=0, iNumThreads-1&lt;BR /&gt; deallocate(av(iThreadNum)%v)&lt;BR /&gt; allocate(av(iThreadNum)%v(n))&lt;BR /&gt; enddo&lt;BR /&gt; endif&lt;BR /&gt; else&lt;BR /&gt; allocate(av(0:iNumThreads-1))&lt;BR /&gt; do iThreadNum=0, iNumThreads-1&lt;BR /&gt; allocate(av(iThreadNum)%v(n))&lt;BR /&gt; enddo&lt;BR /&gt; endif&lt;BR /&gt; &lt;BR /&gt;!$OMP PARALLEL default(shared) private(iThreadNum)&lt;BR /&gt; iThreadNum = OMP_GET_THREAD_NUM()&lt;BR /&gt; call prod_mat_parallel(a,b,n,av(iThreadNum)%v,iNumThreads, iThreadNum)&lt;BR /&gt;!$OMP END PARALLEL&lt;BR /&gt;end subroutine prod_mat&lt;/PRE&gt;&lt;PRE&gt;subroutine prod_mat_parallel (a,b,n,v,iNumThreads, iThreadNum)&lt;BR /&gt; use omp_lib&lt;BR /&gt; implicit none&lt;BR /&gt; integer, intent(in) :: n&lt;BR /&gt; complex*16, dimension(n,n) :: a, b&lt;BR /&gt; complex*16, dimension( : ), allocatable :: v&lt;BR /&gt; integer :: iNumThreads, iThreadNum&lt;BR /&gt; integer :: i, j&lt;/PRE&gt;&lt;PRE&gt; do i=1+iThreadNum, n, iNumThreads&lt;BR /&gt; do j=1,n&lt;BR /&gt; v(j) = a(i,j)&lt;BR /&gt; enddo&lt;BR /&gt; do j=1,n&lt;BR /&gt; a(i,j) = sum(v(:)*b(:,j))&lt;BR /&gt; enddo&lt;BR /&gt; enddo&lt;BR /&gt;end subroutine prod_mat_parallel&lt;BR /&gt;&lt;/PRE&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 31 Mar 2008 16:50:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909795#M83123</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2008-03-31T16:50:47Z</dc:date>
    </item>
    <item>
      <title>Re: Visual Fortran and Open MP</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909796#M83124</link>
      <description>&lt;P&gt;Oops, minor bug. use&lt;/P&gt;&lt;PRE&gt; if(size(av%v(0)) .lt. n) then&lt;BR /&gt; do iThreadNum=0, size(av)-1&lt;BR /&gt; deallocate(av(iThreadNum)%v)&lt;BR /&gt; allocate(av(iThreadNum)%v(n))&lt;BR /&gt; enddo&lt;BR /&gt; endif&lt;BR /&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;FONT face="Times New Roman"&gt;That covers the case where you reduced the numbers of threads.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;Also note, the example code is not suitable for nested parallel coding because OMP_GET_THREAD_NUM() returns thread team member number and not a global thread number.&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 31 Mar 2008 16:57:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909796#M83124</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2008-03-31T16:57:54Z</dc:date>
    </item>
    <item>
      <title>Re: Visual Fortran and Open MP</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909797#M83125</link>
      <description>Thank You very much.&lt;BR /&gt;&lt;BR /&gt;Didace&lt;BR /&gt;</description>
      <pubDate>Mon, 31 Mar 2008 20:14:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909797#M83125</guid>
      <dc:creator>ekeom</dc:creator>
      <dc:date>2008-03-31T20:14:04Z</dc:date>
    </item>
    <item>
      <title>Re: Visual Fortran and Open MP</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909798#M83126</link>
      <description>You are using the wrong function to measure the efficiency of parallel code. cpu_time measures cumulative time for all processors. For example if your program runs for 1 min on a double CPU system with 100% load of both processors cpu_time will return 2 min (1 min on CPU1 + 1 min on CPU2). In fact the documentation about the behavior of cpu_time on multyprocessor system is vague and you are not the only one who was bitten by it.</description>
      <pubDate>Fri, 04 Apr 2008 04:48:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Visual-Fortran-and-Open-MP/m-p/909798#M83126</guid>
      <dc:creator>izryu</dc:creator>
      <dc:date>2008-04-04T04:48:16Z</dc:date>
    </item>
  </channel>
</rss>

