<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Together with setting OMP in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120672#M131608</link>
    <description>&lt;P&gt;Together with setting OMP_PROC_BIND, you could set KMP_AFFINITY=verbose in order to get some idea of what the Intel OpenMP library is doing specifically to set affinity.&amp;nbsp; It's possible that the topology of the Intel box is recognized better than that of the AMD boxes.&amp;nbsp; You have also the option of specifying details of pinning threads to cores by number.&lt;/P&gt;</description>
    <pubDate>Tue, 10 May 2016 13:54:27 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2016-05-10T13:54:27Z</dc:date>
    <item>
      <title>OpenMP performance problem ( maybe related to subroutine calls )</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120658#M131594</link>
      <description>&lt;P&gt;Hello everybody,&lt;/P&gt;

&lt;P&gt;I am a student of &lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;physics&lt;/SPAN&gt;&amp;nbsp;and get some strange performance issues on a fortran program of my research. Although the original program is complicated and involve a lot of high level fortran features, I dig out that the hot spot is the same as the following minimal example&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;module test
contains
	subroutine do_it(A,B,C)
		complex(8) :: A(:,:),B(:),C(:)
		integer :: i,j,l
		do j=1,size(A,2)
			do l=1,size(A,1)
				A(l,j)=A(l,j)+B(l)*C(j)
			enddo
		enddo
		do j=1,size(A,1)
			B(j)=A(j,20)
			C(j)=A(j,120)
		enddo
	end subroutine
end module
program main
	use test
	implicit none
	integer :: i,j,k,l
	complex(8) :: A(512,512),B(512),C(512)

	A=0d0
	do k=1,size(B,1)
		B(k)=1d-5*(mod(k,10)+1)
		C(k)=4d-5*(mod(k,10)+1)
	enddo

	!$omp parallel do firstprivate(A,B,C) schedule(static,1)
	do k=1,8
		do i=1,8000
			call do_it(A,B,C)
		enddo
		!$omp critical
		write(*,*)A(40,130)
		!$omp end critical
	enddo
	!$omp end parallel do
end program&lt;/PRE&gt;

&lt;P&gt;The program is compiled with&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;ifort -qopenmp test.f90 -o test.out&lt;/PRE&gt;

&lt;P&gt;using Intel(R) 64, Version 16.0.3.210 Build 20160415 and running on 48-core AMD Opteron(tm) Processor 6176. During its running, there is enough core for its parallelization.&lt;/P&gt;

&lt;P&gt;In above code, we do eight time loop and using eight thread to parallel it. The performance is quite&amp;nbsp;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;unstable, ranging from&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;real 0m13.744s user 1m29.912s sys 0m4.447s &lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;to&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;real 0m23.247s user 2m24.685s sys 0m4.186s&lt;/PRE&gt;

&lt;P&gt;The ideal time is&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;real 0m6.537s user 0m6.521s sys 0m0.015s&lt;/PRE&gt;

&lt;P&gt;as we can obtain from doing one single loop.&lt;/P&gt;

&lt;P&gt;This confuse me a lot, as there is no data race and the program can be full paralleled. To the best of my knowledge, I guess it must have something to do with the subroutine. Maybe calling a subroutine allocate some additional memory resource which different threads may compete with. If this is the case, is there any way to work around this?&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 07:12:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120658#M131594</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-08T07:12:25Z</dc:date>
    </item>
    <item>
      <title>For me, your program can't</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120659#M131595</link>
      <description>&lt;P&gt;For me, your program can't run with more than 1 thread.&amp;nbsp; I'm not enough of any OpenMP language lawyer to figure out if you are in fact allocating overlapping stack or simply far too much for a normal platform to handle.&amp;nbsp; Intel Inspector does complain about the critical issue of allocating more stack without de-allocating previous allocation.&lt;/P&gt;

&lt;P&gt;Beyond that, you would have memory placement issues on your NUMA platform if you didn't set e.g. OMP_PROC_BIND=close&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 11:52:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120659#M131595</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-05-08T11:52:36Z</dc:date>
    </item>
    <item>
      <title>You also have to consider the</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120660#M131596</link>
      <description>&lt;P&gt;You also have to consider the "first touch" time. Virtual memory is not allocated when the heap or stack (given address ranges) is first acquired, but rather the first time it is used.&amp;nbsp;The first time touch (page granular) encounters the additional overhead of generating a page fault&amp;nbsp; (page granular), each time a virtual memory page is first used, the page fault traps to the O/S, which then must assign a page file page, assign physical RAM (which may involve a page out of some other processes VM), and depending on O/S settings, may involve a wipe of the page. In your sample program, all of this overhead will occur (for each page, for each thread) between the "allocation" of the firstprivate, and the copy of the firstprivate.&lt;/P&gt;

&lt;P&gt;Please add something like this to your test program and report the results:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;real(8) :: t1, t2, t3
...
t1 = omp_get_wtime()
!$omp parallel firstprivate(A,B,C)
!$omp barrier
if(omp_get_thread_num() == 0) t2 = omp_get_wtime() ! time after all first touch
!$omp do schedule(static,1)
do k=1,8
&amp;nbsp; do i=1,8000
&amp;nbsp;&amp;nbsp;&amp;nbsp; call do_it(A,B,C)
&amp;nbsp; enddo
&amp;nbsp; !$omp critical
&amp;nbsp; write(*,*)A(40,130)
&amp;nbsp; !$omp end critical
enddo
!$omp end do
!$omp end parallel
t3 = omp_get_wtime()
print *, 'First touch time = ', t2-t1
print *, 'run time = ' t3 - t1
&lt;/PRE&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 13:03:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120660#M131596</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-05-08T13:03:56Z</dc:date>
    </item>
    <item>
      <title>I find it is nothing to do</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120661#M131597</link>
      <description>&lt;P&gt;I find it has nothing to do with subroutine calls as following code have the same performance problem&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;program main
	implicit none
	integer :: i,j,k,l
	complex(8) :: A(512,512),B(512),C(512)

	A=0d0
	do k=1,size(B,1)
		B(k)=1d-5*(mod(k,10)+1)
		C(k)=4d-5*(mod(k,10)+1)
	enddo


	!$omp parallel do firstprivate(A,B,C)
	do k=1,8
		do i=1,8000
			do j=1,size(A,2)
				do l=1,size(A,1)
					A(l,j)=A(l,j)+B(l)*C(j)
				enddo
			enddo
			B(:)=A(:,20)
			C(:)=A(:,120)
		enddo
		!$omp critical
		write(*,*)A(40,130)
		!$omp end critical
	enddo
	!$omp end parallel do
end program
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Tim P. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;For me, your program can't run with more than 1 thread.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.512px;"&gt;In my servers, it seems like it is running with 8 threads ( I use top command to see the cpu usage).&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.008px; line-height: 19.512px;"&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Tim P. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Beyond that, you would have memory placement issues on your NUMA platform if you didn't set e.g. OMP_PROC_BIND=close&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;I try to set OMP_PROC_BIND to TRUE, FALSE, MASTER, CLOSE and SPREAD and find that FALSE have the best performance although it is still more than two times slower than ideal time.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thank you for the response.&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 13:05:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120661#M131597</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-08T13:05:00Z</dc:date>
    </item>
    <item>
      <title>Some additional light on the</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120662#M131598</link>
      <description>&lt;P&gt;Some additional light on the first touch issue will be shed if you incorporate the t1, t2, t3, above and additionally add a do loop to iterate twice the code from A=0d0 through the print of the run time. Presumably, for the second iteration, your process's virtual memory for the firstprivate copies of the arrays will reside at the same VM addresses and thus will not encounter the first touch overhead.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 13:12:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120662#M131598</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-05-08T13:12:35Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120663#M131599</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;You also have to consider the "first touch" time. Virtual memory is not allocated when the heap or stack (given address ranges) is first acquired, but rather the first time it is used.&amp;nbsp;The first time touch (page granular) encounters the additional overhead of generating a page fault&amp;nbsp; (page granular), each time a virtual memory page is first used, the page fault traps to the O/S, which then must assign a page file page, assign physical RAM (which may involve a page out of some other processes VM), and depending on O/S settings, may involve a wipe of the page. In your sample program, all of this overhead will occur (for each page, for each thread) between the "allocation" of the firstprivate, and the copy of the firstprivate.&lt;/P&gt;

&lt;P&gt;Please add something like this to your test program and report the results:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;real(8) :: t1, t2, t3
...
t1 = omp_get_wtime()
!$omp parallel firstprivate(A,B,C)
!$omp barrier
if(omp_get_thread_num() == 0) t2 = omp_get_wtime() ! time after all first touch
!$omp do schedule(static,1)
do k=1,8
&amp;nbsp; do i=1,8000
&amp;nbsp;&amp;nbsp;&amp;nbsp; call do_it(A,B,C)
&amp;nbsp; enddo
&amp;nbsp; !$omp critical
&amp;nbsp; write(*,*)A(40,130)
&amp;nbsp; !$omp end critical
enddo
!$omp end do
!$omp end parallel
t3 = omp_get_wtime()
print *, 'First touch time = ', t2-t1
print *, 'run time = ' t3 - t1
&lt;/PRE&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;The code is&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;program main
	use omp_lib
	implicit none
	integer :: i,j,k,l
	complex(8) :: A(512,512),B(512),C(512)
	real(8) :: t1, t2, t3

	A=0d0
	do k=1,size(B,1)
		B(k)=1d-5*(mod(k,10)+1)
		C(k)=4d-5*(mod(k,10)+1)
	enddo

	t1 = omp_get_wtime()
	!$omp parallel firstprivate(A,B,C)
	!$omp barrier
	if(omp_get_thread_num() == 0) t2 = omp_get_wtime() ! time after all first touch
	!$omp do schedule(static,1)
	do k=1,8
		do i=1,8000
			do j=1,size(A,2)
				do l=1,size(A,1)
					A(l,j)=A(l,j)+B(l)*C(j)
				enddo
			enddo
			B(:)=A(:,20)
			C(:)=A(:,120)
		enddo
		!$omp critical
		write(*,*)A(40,130)
		!$omp end critical
	enddo
	!$omp end do
	!$omp end parallel
	t3 = omp_get_wtime()
	write(*,*)'First touch time = ', t2-t1
	write(*,*)'run time = ',t3 - t1

end program&lt;/PRE&gt;

&lt;P&gt;And result&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt; (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 First touch time =   0.212038040161133     
 run time =    22.0410289764404&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 13:20:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120663#M131599</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-08T13:20:44Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120664#M131600</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Some additional light on the first touch issue will be shed if you incorporate the t1, t2, t3, above and additionally add a do loop to iterate twice the code from A=0d0 through the print of the run time. Presumably, for the second iteration, your process's virtual memory for the firstprivate copies of the arrays will reside at the same VM addresses and thus will not encounter the first touch overhead.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;If I understand it correctly (loop the code twice), here is the result&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt; (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 First touch time =   0.208889961242676     
 run time =    25.6519989967346     
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 First touch time =   3.344821929931641E-002
 run time =    25.6534941196442     
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 13:34:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120664#M131600</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-08T13:34:32Z</dc:date>
    </item>
    <item>
      <title>I try to run the program on</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120665#M131601</link>
      <description>&lt;P&gt;I try to run the program on &amp;nbsp;24-core&amp;nbsp;Intel(R) Xeon(R) CPU X7542 &amp;nbsp;@ 2.67GHz servers and find the performance is excellent&amp;nbsp;(almost equal to ideal time). So I guess it is related to CPU, as mentioned before I run the program on AMD CPU. Any idea to solve this problem?&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 14:44:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120665#M131601</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-08T14:44:45Z</dc:date>
    </item>
    <item>
      <title>I suppose the time to</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120666#M131602</link>
      <description>&lt;P&gt;I suppose the time to accomplish firstprivate allocation will be greater for the CPUs which incur remote access during initialization.&amp;nbsp; Presumably this will be so even if you run again and happen to reallocate the same memory.&lt;/P&gt;

&lt;P&gt;I have learned in my own examples that firstprivate arrays incur even more overhead and varying run time than private arrays.&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 14:49:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120666#M131602</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-05-08T14:49:14Z</dc:date>
    </item>
    <item>
      <title>Quote:Tim P. wrote:</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120667#M131603</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Tim P. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I suppose the time to accomplish firstprivate allocation will be greater for the CPUs which incur remote access during initialization.&amp;nbsp; Presumably this will be so even if you run again and happen to reallocate the same memory.&lt;/P&gt;

&lt;P&gt;I have learned in my own examples that firstprivate arrays incur even more overhead and varying run time than private arrays.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I replace the firstprivate to private and initial the array in the parallel block and the problem still remain. Also I test the program on another AMD servers (&amp;nbsp;AMD Opteron(tm) Processor 6282 SE) and find the performance is much nicer, Although the effect of parallelization&amp;nbsp;is not as good as Intel one ( a little slower than ideal time ).&lt;/P&gt;

&lt;P&gt;note: the ifort version may differ on different servers, but the lowest version is Version 15.0 Build 20150121&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 15:04:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120667#M131603</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-08T15:04:00Z</dc:date>
    </item>
    <item>
      <title>The Intel 75xx series 4-CPU</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120668#M131604</link>
      <description>&lt;P&gt;The Intel 75xx series 4-CPU machines may have a more expensive QPI setup with direct paths between each pair of CPUs, where the E-46xx and AMD 4-CPU machines may have only a ring connection topology, where the odd- and even-numbered CPUs have a 2-hop connection.&amp;nbsp; Normally, the slowest remote memory connection would determine your performance.&amp;nbsp; You didn't say how many cores and CPUs you have on your AMD 6282.&lt;/P&gt;</description>
      <pubDate>Sun, 08 May 2016 16:40:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120668#M131604</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-05-08T16:40:30Z</dc:date>
    </item>
    <item>
      <title>The CPU information of all</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120669#M131605</link>
      <description>&lt;P&gt;The &lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;CPU information of&amp;nbsp;&lt;/SPAN&gt;all three servers are listed as follow&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;AMD 6176. Parallel performance:&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;poor&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;Architecture:          x86_64
CPU op-mode(s):        64-bit
CPU(s):                48
Thread(s) per core:    1
Core(s) per socket:    12
CPU socket(s):         4
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            16
Model:                 9
Stepping:              1
CPU MHz:               800.000
Virtualization:        AMD-V
L1d cache:             512K
L1i cache:             512K
L2 cache:              512K
L3 cache:              0K
NUMA node0 CPU(s):     0-5
NUMA node1 CPU(s):     6-11
NUMA node2 CPU(s):     12-17
NUMA node3 CPU(s):     18-23
NUMA node4 CPU(s):     24-29
NUMA node5 CPU(s):     30-35
NUMA node6 CPU(s):     36-41
NUMA node7 CPU(s):     42-47
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;AMD 6282.&amp;nbsp;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;P&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;arallel performance:&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;good&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             4
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 1
Stepping:              2
CPU MHz:               1400.000
BogoMIPS:              5199.95
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31
NUMA node4 CPU(s):     32-39
NUMA node5 CPU(s):     40-47
NUMA node6 CPU(s):     48-55
NUMA node7 CPU(s):     56-63&lt;/PRE&gt;

&lt;P&gt;&lt;STRONG&gt;Intel X7542&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;.&amp;nbsp;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;P&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;arallel performance:&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;excellent&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    6
CPU socket(s):         4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 46
Stepping:              6
CPU MHz:               2659.964
BogoMIPS:              5320.01
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              18432K
NUMA node0 CPU(s):     0-5
NUMA node1 CPU(s):     6-11
NUMA node2 CPU(s):     12-17
NUMA node3 CPU(s):     18-23
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 09 May 2016 02:25:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120669#M131605</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-09T02:25:00Z</dc:date>
    </item>
    <item>
      <title>I do more test and find that</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120670#M131606</link>
      <description>&lt;P&gt;I do more test and find that more loops using more thread like 16 instead of 8, the same performance problem appears in other two servers too (Intel still beats AMDs).&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 09 May 2016 06:48:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120670#M131606</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-09T06:48:30Z</dc:date>
    </item>
    <item>
      <title>Here comes the most strange</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120671#M131607</link>
      <description>&lt;P&gt;Here comes the most strange part.&lt;/P&gt;

&lt;P&gt;Instead of parallel the program using OpenMP, I submit 8 serial programs at the same time. The problem is still there (see the results bellow). So I guess this problem is not related to OpenMP neither.&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;real    0m7.892s user    0m7.882s sys     0m0.006s
real    0m7.900s user    0m7.888s sys     0m0.010s
real    0m7.993s user    0m7.975s sys     0m0.014s
real    0m8.049s user    0m8.039s sys     0m0.009s
real    0m8.064s user    0m8.050s sys     0m0.012s
real    0m26.645s user    0m26.627s sys     0m0.010s
real    0m26.703s user    0m26.687s sys     0m0.011s
real    0m26.728s user    0m26.712s sys     0m0.010s&lt;/PRE&gt;

&lt;P&gt;Is it a CPU cache problem? Maybe Intel Fortran compiler highly&amp;nbsp;optimize the code&amp;nbsp;that it full use of CPU cache to speed up and some process may running slow compare to others when they can not get the CPU cache resource.&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 02:09:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120671#M131607</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-10T02:09:49Z</dc:date>
    </item>
    <item>
      <title>Together with setting OMP</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120672#M131608</link>
      <description>&lt;P&gt;Together with setting OMP_PROC_BIND, you could set KMP_AFFINITY=verbose in order to get some idea of what the Intel OpenMP library is doing specifically to set affinity.&amp;nbsp; It's possible that the topology of the Intel box is recognized better than that of the AMD boxes.&amp;nbsp; You have also the option of specifying details of pinning threads to cores by number.&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 13:54:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120672#M131608</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-05-10T13:54:27Z</dc:date>
    </item>
    <item>
      <title>The issue (IMHO) appears to</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120673#M131609</link>
      <description>&lt;P&gt;The issue (IMHO) appears to be the private array initializations are being performed by the main thread and not the individual threads. This results in the first touch by the master thread placing those data into the master's NUMA node. Modify the code such that the first touch of the private data (and/or slices of shared data) reside consistently within the NUMA node of the thread which will preponderantly process the data.&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;module test
contains
&amp;nbsp; subroutine do_it(A,B,C)
&amp;nbsp;&amp;nbsp;&amp;nbsp; complex(8) :: A(:,:),B(:),C(:)
&amp;nbsp;&amp;nbsp;&amp;nbsp; integer :: i,j,l
&amp;nbsp;&amp;nbsp;&amp;nbsp; do j=1,size(A,2)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; do l=1,size(A,1)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; A(l,j)=A(l,j)+B(l)*C(j)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; enddo
&amp;nbsp;&amp;nbsp;&amp;nbsp; enddo
&amp;nbsp;&amp;nbsp;&amp;nbsp; do j=1,size(A,1)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; B(j)=A(j,20)
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; C(j)=A(j,120)
&amp;nbsp;&amp;nbsp;&amp;nbsp; enddo
&amp;nbsp; end subroutine
end module

program main
&amp;nbsp; use test
&amp;nbsp; implicit none
&amp;nbsp; integer :: i,j,k,l
&amp;nbsp; complex(8) :: A(512,512),B(512),C(512)

&amp;nbsp; !$omp parallel private(A,B,C,i,j,k)
&amp;nbsp; A=0d0 ! each thread initializes private (hopefully in pinned NUMA node)
&amp;nbsp; do k=1,size(B,1)
&amp;nbsp;&amp;nbsp;&amp;nbsp; B(k)=1d-5*(mod(k,10)+1)
&amp;nbsp;&amp;nbsp;&amp;nbsp; C(k)=4d-5*(mod(k,10)+1)
&amp;nbsp; enddo

&amp;nbsp; !$omp do schedule(static,1)
&amp;nbsp; do k=1,8
&amp;nbsp;&amp;nbsp;&amp;nbsp; do i=1,8000
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; call do_it(A,B,C)
&amp;nbsp;&amp;nbsp;&amp;nbsp; enddo
&amp;nbsp;&amp;nbsp;&amp;nbsp; !$omp critical
&amp;nbsp;&amp;nbsp;&amp;nbsp; write(*,*)A(40,130)
&amp;nbsp;&amp;nbsp;&amp;nbsp; !$omp end critical
&amp;nbsp; enddo
&amp;nbsp; !$omp end parallel do
end program
&lt;/PRE&gt;

&lt;P&gt;What I find odd about your configuration is that your 4 socket machine is showing 8 NUMA nodes. I'd expected 4 NUMA nodes.&amp;nbsp;Maybe that is an AMD thing??&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 14:21:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120673#M131609</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-05-10T14:21:00Z</dc:date>
    </item>
    <item>
      <title>I will try those out. But as</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120674#M131610</link>
      <description>&lt;P&gt;I will try those out. But as I mentioned in my last comment, I don't think the performance issue is related to OpenMP.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 14:50:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120674#M131610</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-10T14:50:14Z</dc:date>
    </item>
    <item>
      <title>Hello and sorry for my late</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120675#M131611</link>
      <description>&lt;P&gt;Hello and sorry for my late reply as I couldn't get enough compute cores in past few days.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;As you have mentioned, I set `KMP_AFFINITY` to `verbose,granularity=fine,,proclist=[0,6,12,18,24,30,36,42],explicit` and find the performance problem disappeared,&amp;nbsp;&lt;/SPAN&gt;finally&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;The CPU topology of our server is&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;&lt;IMG alt="" src="http://cdn4.snapgram.co/images/2016/05/14/sg1.png" /&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.512px;"&gt;Again, t&lt;/SPAN&gt;hank you all for taking the time to help me.&lt;/P&gt;</description>
      <pubDate>Sun, 15 May 2016 08:41:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/OpenMP-performance-problem-maybe-related-to-subroutine-calls/m-p/1120675#M131611</guid>
      <dc:creator>Zuodong_Y_</dc:creator>
      <dc:date>2016-05-15T08:41:00Z</dc:date>
    </item>
  </channel>
</rss>

