Some additional light on the

Zuodong_Y_ · ‎05-08-2016

Hello everybody,

I am a student of physics and get some strange performance issues on a fortran program of my research. Although the original program is complicated and involve a lot of high level fortran features, I dig out that the hot spot is the same as the following minimal example

module test
contains
	subroutine do_it(A,B,C)
		complex(8) :: A(:,:),B(:),C(:)
		integer :: i,j,l
		do j=1,size(A,2)
			do l=1,size(A,1)
				A(l,j)=A(l,j)+B(l)*C(j)
			enddo
		enddo
		do j=1,size(A,1)
			B(j)=A(j,20)
			C(j)=A(j,120)
		enddo
	end subroutine
end module
program main
	use test
	implicit none
	integer :: i,j,k,l
	complex(8) :: A(512,512),B(512),C(512)

	A=0d0
	do k=1,size(B,1)
		B(k)=1d-5*(mod(k,10)+1)
		C(k)=4d-5*(mod(k,10)+1)
	enddo

	!$omp parallel do firstprivate(A,B,C) schedule(static,1)
	do k=1,8
		do i=1,8000
			call do_it(A,B,C)
		enddo
		!$omp critical
		write(*,*)A(40,130)
		!$omp end critical
	enddo
	!$omp end parallel do
end program

The program is compiled with

ifort -qopenmp test.f90 -o test.out

using Intel(R) 64, Version 16.0.3.210 Build 20160415 and running on 48-core AMD Opteron(tm) Processor 6176. During its running, there is enough core for its parallelization.

In above code, we do eight time loop and using eight thread to parallel it. The performance is quite unstable, ranging from

real 0m13.744s user 1m29.912s sys 0m4.447s

to

real 0m23.247s user 2m24.685s sys 0m4.186s

The ideal time is

real 0m6.537s user 0m6.521s sys 0m0.015s

as we can obtain from doing one single loop.

This confuse me a lot, as there is no data race and the program can be full paralleled. To the best of my knowledge, I guess it must have something to do with the subroutine. Maybe calling a subroutine allocate some additional memory resource which different threads may compete with. If this is the case, is there any way to work around this?

TimP · ‎05-08-2016

For me, your program can't run with more than 1 thread. I'm not enough of any OpenMP language lawyer to figure out if you are in fact allocating overlapping stack or simply far too much for a normal platform to handle. Intel Inspector does complain about the critical issue of allocating more stack without de-allocating previous allocation.

Beyond that, you would have memory placement issues on your NUMA platform if you didn't set e.g. OMP_PROC_BIND=close

jimdempseyatthecove · ‎05-08-2016

You also have to consider the "first touch" time. Virtual memory is not allocated when the heap or stack (given address ranges) is first acquired, but rather the first time it is used. The first time touch (page granular) encounters the additional overhead of generating a page fault (page granular), each time a virtual memory page is first used, the page fault traps to the O/S, which then must assign a page file page, assign physical RAM (which may involve a page out of some other processes VM), and depending on O/S settings, may involve a wipe of the page. In your sample program, all of this overhead will occur (for each page, for each thread) between the "allocation" of the firstprivate, and the copy of the firstprivate.

Please add something like this to your test program and report the results:

real(8) :: t1, t2, t3
...
t1 = omp_get_wtime()
!$omp parallel firstprivate(A,B,C)
!$omp barrier
if(omp_get_thread_num() == 0) t2 = omp_get_wtime() ! time after all first touch
!$omp do schedule(static,1)
do k=1,8
  do i=1,8000
    call do_it(A,B,C)
  enddo
  !$omp critical
  write(*,*)A(40,130)
  !$omp end critical
enddo
!$omp end do
!$omp end parallel
t3 = omp_get_wtime()
print *, 'First touch time = ', t2-t1
print *, 'run time = ' t3 - t1

Jim Dempsey

Zuodong_Y_ · ‎05-08-2016

I find it has nothing to do with subroutine calls as following code have the same performance problem

program main
	implicit none
	integer :: i,j,k,l
	complex(8) :: A(512,512),B(512),C(512)

	A=0d0
	do k=1,size(B,1)
		B(k)=1d-5*(mod(k,10)+1)
		C(k)=4d-5*(mod(k,10)+1)
	enddo


	!$omp parallel do firstprivate(A,B,C)
	do k=1,8
		do i=1,8000
			do j=1,size(A,2)
				do l=1,size(A,1)
					A(l,j)=A(l,j)+B(l)*C(j)
				enddo
			enddo
			B(:)=A(:,20)
			C(:)=A(:,120)
		enddo
		!$omp critical
		write(*,*)A(40,130)
		!$omp end critical
	enddo
	!$omp end parallel do
end program

Tim P. wrote:

For me, your program can't run with more than 1 thread.

In my servers, it seems like it is running with 8 threads ( I use top command to see the cpu usage).

Tim P. wrote:

Beyond that, you would have memory placement issues on your NUMA platform if you didn't set e.g. OMP_PROC_BIND=close

I try to set OMP_PROC_BIND to TRUE, FALSE, MASTER, CLOSE and SPREAD and find that FALSE have the best performance although it is still more than two times slower than ideal time.

Thank you for the response.

jimdempseyatthecove · ‎05-08-2016

Some additional light on the first touch issue will be shed if you incorporate the t1, t2, t3, above and additionally add a do loop to iterate twice the code from A=0d0 through the print of the run time. Presumably, for the second iteration, your process's virtual memory for the firstprivate copies of the arrays will reside at the same VM addresses and thus will not encounter the first touch overhead.

Jim Dempsey

Zuodong_Y_ · ‎05-08-2016

jimdempseyatthecove wrote:

You also have to consider the "first touch" time. Virtual memory is not allocated when the heap or stack (given address ranges) is first acquired, but rather the first time it is used. The first time touch (page granular) encounters the additional overhead of generating a page fault (page granular), each time a virtual memory page is first used, the page fault traps to the O/S, which then must assign a page file page, assign physical RAM (which may involve a page out of some other processes VM), and depending on O/S settings, may involve a wipe of the page. In your sample program, all of this overhead will occur (for each page, for each thread) between the "allocation" of the firstprivate, and the copy of the firstprivate.

Please add something like this to your test program and report the results:
real(8) :: t1, t2, t3
...
t1 = omp_get_wtime()
!$omp parallel firstprivate(A,B,C)
!$omp barrier
if(omp_get_thread_num() == 0) t2 = omp_get_wtime() ! time after all first touch
!$omp do schedule(static,1)
do k=1,8
  do i=1,8000
    call do_it(A,B,C)
  enddo
  !$omp critical
  write(*,*)A(40,130)
  !$omp end critical
enddo
!$omp end do
!$omp end parallel
t3 = omp_get_wtime()
print *, 'First touch time = ', t2-t1
print *, 'run time = ' t3 - t1
Jim Dempsey

The code is

program main
	use omp_lib
	implicit none
	integer :: i,j,k,l
	complex(8) :: A(512,512),B(512),C(512)
	real(8) :: t1, t2, t3

	A=0d0
	do k=1,size(B,1)
		B(k)=1d-5*(mod(k,10)+1)
		C(k)=4d-5*(mod(k,10)+1)
	enddo

	t1 = omp_get_wtime()
	!$omp parallel firstprivate(A,B,C)
	!$omp barrier
	if(omp_get_thread_num() == 0) t2 = omp_get_wtime() ! time after all first touch
	!$omp do schedule(static,1)
	do k=1,8
		do i=1,8000
			do j=1,size(A,2)
				do l=1,size(A,1)
					A(l,j)=A(l,j)+B(l)*C(j)
				enddo
			enddo
			B(:)=A(:,20)
			C(:)=A(:,120)
		enddo
		!$omp critical
		write(*,*)A(40,130)
		!$omp end critical
	enddo
	!$omp end do
	!$omp end parallel
	t3 = omp_get_wtime()
	write(*,*)'First touch time = ', t2-t1
	write(*,*)'run time = ',t3 - t1

end program

And result

 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 First touch time =   0.212038040161133     
 run time =    22.0410289764404

Zuodong_Y_ · ‎05-08-2016

jimdempseyatthecove wrote:

Some additional light on the first touch issue will be shed if you incorporate the t1, t2, t3, above and additionally add a do loop to iterate twice the code from A=0d0 through the print of the run time. Presumably, for the second iteration, your process's virtual memory for the firstprivate copies of the arrays will reside at the same VM addresses and thus will not encounter the first touch overhead.

Jim Dempsey

If I understand it correctly (loop the code twice), here is the result

 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 First touch time =   0.208889961242676     
 run time =    25.6519989967346     
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 (4.000012798440945E-010,0.000000000000000E+000)
 First touch time =   3.344821929931641E-002
 run time =    25.6534941196442

Zuodong_Y_ · ‎05-08-2016

I try to run the program on 24-core Intel(R) Xeon(R) CPU X7542 @ 2.67GHz servers and find the performance is excellent (almost equal to ideal time). So I guess it is related to CPU, as mentioned before I run the program on AMD CPU. Any idea to solve this problem?

TimP · ‎05-08-2016

I suppose the time to accomplish firstprivate allocation will be greater for the CPUs which incur remote access during initialization. Presumably this will be so even if you run again and happen to reallocate the same memory.

I have learned in my own examples that firstprivate arrays incur even more overhead and varying run time than private arrays.

Zuodong_Y_ · ‎05-08-2016

Tim P. wrote:

I suppose the time to accomplish firstprivate allocation will be greater for the CPUs which incur remote access during initialization. Presumably this will be so even if you run again and happen to reallocate the same memory.

I have learned in my own examples that firstprivate arrays incur even more overhead and varying run time than private arrays.

I replace the firstprivate to private and initial the array in the parallel block and the problem still remain. Also I test the program on another AMD servers ( AMD Opteron(tm) Processor 6282 SE) and find the performance is much nicer, Although the effect of parallelization is not as good as Intel one ( a little slower than ideal time ).

note: the ifort version may differ on different servers, but the lowest version is Version 15.0 Build 20150121

TimP · ‎05-08-2016

The Intel 75xx series 4-CPU machines may have a more expensive QPI setup with direct paths between each pair of CPUs, where the E-46xx and AMD 4-CPU machines may have only a ring connection topology, where the odd- and even-numbered CPUs have a 2-hop connection. Normally, the slowest remote memory connection would determine your performance. You didn't say how many cores and CPUs you have on your AMD 6282.

Zuodong_Y_ · ‎05-08-2016

The CPU information of all three servers are listed as follow

AMD 6176. Parallel performance: poor

Architecture:          x86_64
CPU op-mode(s):        64-bit
CPU(s):                48
Thread(s) per core:    1
Core(s) per socket:    12
CPU socket(s):         4
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            16
Model:                 9
Stepping:              1
CPU MHz:               800.000
Virtualization:        AMD-V
L1d cache:             512K
L1i cache:             512K
L2 cache:              512K
L3 cache:              0K
NUMA node0 CPU(s):     0-5
NUMA node1 CPU(s):     6-11
NUMA node2 CPU(s):     12-17
NUMA node3 CPU(s):     18-23
NUMA node4 CPU(s):     24-29
NUMA node5 CPU(s):     30-35
NUMA node6 CPU(s):     36-41
NUMA node7 CPU(s):     42-47

AMD 6282. Parallel performance: good

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             4
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 1
Stepping:              2
CPU MHz:               1400.000
BogoMIPS:              5199.95
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31
NUMA node4 CPU(s):     32-39
NUMA node5 CPU(s):     40-47
NUMA node6 CPU(s):     48-55
NUMA node7 CPU(s):     56-63

Intel X7542. Parallel performance: excellent

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    6
CPU socket(s):         4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 46
Stepping:              6
CPU MHz:               2659.964
BogoMIPS:              5320.01
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              18432K
NUMA node0 CPU(s):     0-5
NUMA node1 CPU(s):     6-11
NUMA node2 CPU(s):     12-17
NUMA node3 CPU(s):     18-23

Zuodong_Y_ · ‎05-08-2016

I do more test and find that more loops using more thread like 16 instead of 8, the same performance problem appears in other two servers too (Intel still beats AMDs).

Zuodong_Y_ · ‎05-09-2016

Here comes the most strange part.

Instead of parallel the program using OpenMP, I submit 8 serial programs at the same time. The problem is still there (see the results bellow). So I guess this problem is not related to OpenMP neither.

real    0m7.892s user    0m7.882s sys     0m0.006s
real    0m7.900s user    0m7.888s sys     0m0.010s
real    0m7.993s user    0m7.975s sys     0m0.014s
real    0m8.049s user    0m8.039s sys     0m0.009s
real    0m8.064s user    0m8.050s sys     0m0.012s
real    0m26.645s user    0m26.627s sys     0m0.010s
real    0m26.703s user    0m26.687s sys     0m0.011s
real    0m26.728s user    0m26.712s sys     0m0.010s

Is it a CPU cache problem? Maybe Intel Fortran compiler highly optimize the code that it full use of CPU cache to speed up and some process may running slow compare to others when they can not get the CPU cache resource.

TimP · ‎05-10-2016

Together with setting OMP_PROC_BIND, you could set KMP_AFFINITY=verbose in order to get some idea of what the Intel OpenMP library is doing specifically to set affinity. It's possible that the topology of the Intel box is recognized better than that of the AMD boxes. You have also the option of specifying details of pinning threads to cores by number.

jimdempseyatthecove · ‎05-10-2016

The issue (IMHO) appears to be the private array initializations are being performed by the main thread and not the individual threads. This results in the first touch by the master thread placing those data into the master's NUMA node. Modify the code such that the first touch of the private data (and/or slices of shared data) reside consistently within the NUMA node of the thread which will preponderantly process the data.

module test
contains
  subroutine do_it(A,B,C)
    complex(8) :: A(:,:),B(:),C(:)
    integer :: i,j,l
    do j=1,size(A,2)
      do l=1,size(A,1)
        A(l,j)=A(l,j)+B(l)*C(j)
      enddo
    enddo
    do j=1,size(A,1)
      B(j)=A(j,20)
      C(j)=A(j,120)
    enddo
  end subroutine
end module

program main
  use test
  implicit none
  integer :: i,j,k,l
  complex(8) :: A(512,512),B(512),C(512)

  !$omp parallel private(A,B,C,i,j,k)
  A=0d0 ! each thread initializes private (hopefully in pinned NUMA node)
  do k=1,size(B,1)
    B(k)=1d-5*(mod(k,10)+1)
    C(k)=4d-5*(mod(k,10)+1)
  enddo

  !$omp do schedule(static,1)
  do k=1,8
    do i=1,8000
      call do_it(A,B,C)
    enddo
    !$omp critical
    write(*,*)A(40,130)
    !$omp end critical
  enddo
  !$omp end parallel do
end program

What I find odd about your configuration is that your 4 socket machine is showing 8 NUMA nodes. I'd expected 4 NUMA nodes. Maybe that is an AMD thing??

Jim Dempsey

Zuodong_Y_ · ‎05-10-2016

I will try those out. But as I mentioned in my last comment, I don't think the performance issue is related to OpenMP.

Zuodong_Y_ · ‎05-15-2016

Hello and sorry for my late reply as I couldn't get enough compute cores in past few days.

As you have mentioned, I set `KMP_AFFINITY` to `verbose,granularity=fine,,proclist=[0,6,12,18,24,30,36,42],explicit` and find the performance problem disappeared, finally. The CPU topology of our server is

Again, thank you all for taking the time to help me.

OpenMP performance problem ( maybe related to subroutine calls )