- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everybody,
I am a student of physics and get some strange performance issues on a fortran program of my research. Although the original program is complicated and involve a lot of high level fortran features, I dig out that the hot spot is the same as the following minimal example
module test contains subroutine do_it(A,B,C) complex(8) :: A(:,:),B(:),C(:) integer :: i,j,l do j=1,size(A,2) do l=1,size(A,1) A(l,j)=A(l,j)+B(l)*C(j) enddo enddo do j=1,size(A,1) B(j)=A(j,20) C(j)=A(j,120) enddo end subroutine end module program main use test implicit none integer :: i,j,k,l complex(8) :: A(512,512),B(512),C(512) A=0d0 do k=1,size(B,1) B(k)=1d-5*(mod(k,10)+1) C(k)=4d-5*(mod(k,10)+1) enddo !$omp parallel do firstprivate(A,B,C) schedule(static,1) do k=1,8 do i=1,8000 call do_it(A,B,C) enddo !$omp critical write(*,*)A(40,130) !$omp end critical enddo !$omp end parallel do end program
The program is compiled with
ifort -qopenmp test.f90 -o test.out
using Intel(R) 64, Version 16.0.3.210 Build 20160415 and running on 48-core AMD Opteron(tm) Processor 6176. During its running, there is enough core for its parallelization.
In above code, we do eight time loop and using eight thread to parallel it. The performance is quite unstable, ranging from
real 0m13.744s user 1m29.912s sys 0m4.447s
to
real 0m23.247s user 2m24.685s sys 0m4.186s
The ideal time is
real 0m6.537s user 0m6.521s sys 0m0.015s
as we can obtain from doing one single loop.
This confuse me a lot, as there is no data race and the program can be full paralleled. To the best of my knowledge, I guess it must have something to do with the subroutine. Maybe calling a subroutine allocate some additional memory resource which different threads may compete with. If this is the case, is there any way to work around this?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For me, your program can't run with more than 1 thread. I'm not enough of any OpenMP language lawyer to figure out if you are in fact allocating overlapping stack or simply far too much for a normal platform to handle. Intel Inspector does complain about the critical issue of allocating more stack without de-allocating previous allocation.
Beyond that, you would have memory placement issues on your NUMA platform if you didn't set e.g. OMP_PROC_BIND=close
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You also have to consider the "first touch" time. Virtual memory is not allocated when the heap or stack (given address ranges) is first acquired, but rather the first time it is used. The first time touch (page granular) encounters the additional overhead of generating a page fault (page granular), each time a virtual memory page is first used, the page fault traps to the O/S, which then must assign a page file page, assign physical RAM (which may involve a page out of some other processes VM), and depending on O/S settings, may involve a wipe of the page. In your sample program, all of this overhead will occur (for each page, for each thread) between the "allocation" of the firstprivate, and the copy of the firstprivate.
Please add something like this to your test program and report the results:
real(8) :: t1, t2, t3 ... t1 = omp_get_wtime() !$omp parallel firstprivate(A,B,C) !$omp barrier if(omp_get_thread_num() == 0) t2 = omp_get_wtime() ! time after all first touch !$omp do schedule(static,1) do k=1,8 do i=1,8000 call do_it(A,B,C) enddo !$omp critical write(*,*)A(40,130) !$omp end critical enddo !$omp end do !$omp end parallel t3 = omp_get_wtime() print *, 'First touch time = ', t2-t1 print *, 'run time = ' t3 - t1
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I find it has nothing to do with subroutine calls as following code have the same performance problem
program main implicit none integer :: i,j,k,l complex(8) :: A(512,512),B(512),C(512) A=0d0 do k=1,size(B,1) B(k)=1d-5*(mod(k,10)+1) C(k)=4d-5*(mod(k,10)+1) enddo !$omp parallel do firstprivate(A,B,C) do k=1,8 do i=1,8000 do j=1,size(A,2) do l=1,size(A,1) A(l,j)=A(l,j)+B(l)*C(j) enddo enddo B(:)=A(:,20) C(:)=A(:,120) enddo !$omp critical write(*,*)A(40,130) !$omp end critical enddo !$omp end parallel do end program
Tim P. wrote:
For me, your program can't run with more than 1 thread.
In my servers, it seems like it is running with 8 threads ( I use top command to see the cpu usage).
Tim P. wrote:
Beyond that, you would have memory placement issues on your NUMA platform if you didn't set e.g. OMP_PROC_BIND=close
I try to set OMP_PROC_BIND to TRUE, FALSE, MASTER, CLOSE and SPREAD and find that FALSE have the best performance although it is still more than two times slower than ideal time.
Thank you for the response.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some additional light on the first touch issue will be shed if you incorporate the t1, t2, t3, above and additionally add a do loop to iterate twice the code from A=0d0 through the print of the run time. Presumably, for the second iteration, your process's virtual memory for the firstprivate copies of the arrays will reside at the same VM addresses and thus will not encounter the first touch overhead.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
You also have to consider the "first touch" time. Virtual memory is not allocated when the heap or stack (given address ranges) is first acquired, but rather the first time it is used. The first time touch (page granular) encounters the additional overhead of generating a page fault (page granular), each time a virtual memory page is first used, the page fault traps to the O/S, which then must assign a page file page, assign physical RAM (which may involve a page out of some other processes VM), and depending on O/S settings, may involve a wipe of the page. In your sample program, all of this overhead will occur (for each page, for each thread) between the "allocation" of the firstprivate, and the copy of the firstprivate.
Please add something like this to your test program and report the results:
real(8) :: t1, t2, t3 ... t1 = omp_get_wtime() !$omp parallel firstprivate(A,B,C) !$omp barrier if(omp_get_thread_num() == 0) t2 = omp_get_wtime() ! time after all first touch !$omp do schedule(static,1) do k=1,8 do i=1,8000 call do_it(A,B,C) enddo !$omp critical write(*,*)A(40,130) !$omp end critical enddo !$omp end do !$omp end parallel t3 = omp_get_wtime() print *, 'First touch time = ', t2-t1 print *, 'run time = ' t3 - t1Jim Dempsey
The code is
program main use omp_lib implicit none integer :: i,j,k,l complex(8) :: A(512,512),B(512),C(512) real(8) :: t1, t2, t3 A=0d0 do k=1,size(B,1) B(k)=1d-5*(mod(k,10)+1) C(k)=4d-5*(mod(k,10)+1) enddo t1 = omp_get_wtime() !$omp parallel firstprivate(A,B,C) !$omp barrier if(omp_get_thread_num() == 0) t2 = omp_get_wtime() ! time after all first touch !$omp do schedule(static,1) do k=1,8 do i=1,8000 do j=1,size(A,2) do l=1,size(A,1) A(l,j)=A(l,j)+B(l)*C(j) enddo enddo B(:)=A(:,20) C(:)=A(:,120) enddo !$omp critical write(*,*)A(40,130) !$omp end critical enddo !$omp end do !$omp end parallel t3 = omp_get_wtime() write(*,*)'First touch time = ', t2-t1 write(*,*)'run time = ',t3 - t1 end program
And result
(4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) First touch time = 0.212038040161133 run time = 22.0410289764404
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
Some additional light on the first touch issue will be shed if you incorporate the t1, t2, t3, above and additionally add a do loop to iterate twice the code from A=0d0 through the print of the run time. Presumably, for the second iteration, your process's virtual memory for the firstprivate copies of the arrays will reside at the same VM addresses and thus will not encounter the first touch overhead.
Jim Dempsey
If I understand it correctly (loop the code twice), here is the result
(4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) First touch time = 0.208889961242676 run time = 25.6519989967346 (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) (4.000012798440945E-010,0.000000000000000E+000) First touch time = 3.344821929931641E-002 run time = 25.6534941196442
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I try to run the program on 24-core Intel(R) Xeon(R) CPU X7542 @ 2.67GHz servers and find the performance is excellent (almost equal to ideal time). So I guess it is related to CPU, as mentioned before I run the program on AMD CPU. Any idea to solve this problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suppose the time to accomplish firstprivate allocation will be greater for the CPUs which incur remote access during initialization. Presumably this will be so even if you run again and happen to reallocate the same memory.
I have learned in my own examples that firstprivate arrays incur even more overhead and varying run time than private arrays.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim P. wrote:
I suppose the time to accomplish firstprivate allocation will be greater for the CPUs which incur remote access during initialization. Presumably this will be so even if you run again and happen to reallocate the same memory.
I have learned in my own examples that firstprivate arrays incur even more overhead and varying run time than private arrays.
I replace the firstprivate to private and initial the array in the parallel block and the problem still remain. Also I test the program on another AMD servers ( AMD Opteron(tm) Processor 6282 SE) and find the performance is much nicer, Although the effect of parallelization is not as good as Intel one ( a little slower than ideal time ).
note: the ifort version may differ on different servers, but the lowest version is Version 15.0 Build 20150121
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Intel 75xx series 4-CPU machines may have a more expensive QPI setup with direct paths between each pair of CPUs, where the E-46xx and AMD 4-CPU machines may have only a ring connection topology, where the odd- and even-numbered CPUs have a 2-hop connection. Normally, the slowest remote memory connection would determine your performance. You didn't say how many cores and CPUs you have on your AMD 6282.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The CPU information of all three servers are listed as follow
AMD 6176. Parallel performance: poor
Architecture: x86_64 CPU op-mode(s): 64-bit CPU(s): 48 Thread(s) per core: 1 Core(s) per socket: 12 CPU socket(s): 4 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 16 Model: 9 Stepping: 1 CPU MHz: 800.000 Virtualization: AMD-V L1d cache: 512K L1i cache: 512K L2 cache: 512K L3 cache: 0K NUMA node0 CPU(s): 0-5 NUMA node1 CPU(s): 6-11 NUMA node2 CPU(s): 12-17 NUMA node3 CPU(s): 18-23 NUMA node4 CPU(s): 24-29 NUMA node5 CPU(s): 30-35 NUMA node6 CPU(s): 36-41 NUMA node7 CPU(s): 42-47
AMD 6282. Parallel performance: good
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 4 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 21 Model: 1 Stepping: 2 CPU MHz: 1400.000 BogoMIPS: 5199.95 Virtualization: AMD-V L1d cache: 16K L1i cache: 64K L2 cache: 2048K L3 cache: 6144K NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 NUMA node2 CPU(s): 16-23 NUMA node3 CPU(s): 24-31 NUMA node4 CPU(s): 32-39 NUMA node5 CPU(s): 40-47 NUMA node6 CPU(s): 48-55 NUMA node7 CPU(s): 56-63
Intel X7542. Parallel performance: excellent
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 1 Core(s) per socket: 6 CPU socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 46 Stepping: 6 CPU MHz: 2659.964 BogoMIPS: 5320.01 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 18432K NUMA node0 CPU(s): 0-5 NUMA node1 CPU(s): 6-11 NUMA node2 CPU(s): 12-17 NUMA node3 CPU(s): 18-23
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do more test and find that more loops using more thread like 16 instead of 8, the same performance problem appears in other two servers too (Intel still beats AMDs).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here comes the most strange part.
Instead of parallel the program using OpenMP, I submit 8 serial programs at the same time. The problem is still there (see the results bellow). So I guess this problem is not related to OpenMP neither.
real 0m7.892s user 0m7.882s sys 0m0.006s real 0m7.900s user 0m7.888s sys 0m0.010s real 0m7.993s user 0m7.975s sys 0m0.014s real 0m8.049s user 0m8.039s sys 0m0.009s real 0m8.064s user 0m8.050s sys 0m0.012s real 0m26.645s user 0m26.627s sys 0m0.010s real 0m26.703s user 0m26.687s sys 0m0.011s real 0m26.728s user 0m26.712s sys 0m0.010s
Is it a CPU cache problem? Maybe Intel Fortran compiler highly optimize the code that it full use of CPU cache to speed up and some process may running slow compare to others when they can not get the CPU cache resource.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Together with setting OMP_PROC_BIND, you could set KMP_AFFINITY=verbose in order to get some idea of what the Intel OpenMP library is doing specifically to set affinity. It's possible that the topology of the Intel box is recognized better than that of the AMD boxes. You have also the option of specifying details of pinning threads to cores by number.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The issue (IMHO) appears to be the private array initializations are being performed by the main thread and not the individual threads. This results in the first touch by the master thread placing those data into the master's NUMA node. Modify the code such that the first touch of the private data (and/or slices of shared data) reside consistently within the NUMA node of the thread which will preponderantly process the data.
module test contains subroutine do_it(A,B,C) complex(8) :: A(:,:),B(:),C(:) integer :: i,j,l do j=1,size(A,2) do l=1,size(A,1) A(l,j)=A(l,j)+B(l)*C(j) enddo enddo do j=1,size(A,1) B(j)=A(j,20) C(j)=A(j,120) enddo end subroutine end module program main use test implicit none integer :: i,j,k,l complex(8) :: A(512,512),B(512),C(512) !$omp parallel private(A,B,C,i,j,k) A=0d0 ! each thread initializes private (hopefully in pinned NUMA node) do k=1,size(B,1) B(k)=1d-5*(mod(k,10)+1) C(k)=4d-5*(mod(k,10)+1) enddo !$omp do schedule(static,1) do k=1,8 do i=1,8000 call do_it(A,B,C) enddo !$omp critical write(*,*)A(40,130) !$omp end critical enddo !$omp end parallel do end program
What I find odd about your configuration is that your 4 socket machine is showing 8 NUMA nodes. I'd expected 4 NUMA nodes. Maybe that is an AMD thing??
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will try those out. But as I mentioned in my last comment, I don't think the performance issue is related to OpenMP.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello and sorry for my late reply as I couldn't get enough compute cores in past few days.
As you have mentioned, I set `KMP_AFFINITY` to `verbose,granularity=fine,,proclist=[0,6,12,18,24,30,36,42],explicit` and find the performance problem disappeared, finally. The CPU topology of our server is
Again, thank you all for taking the time to help me.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page