topic Quote:Gregg S. (Intel) wrote: in Intel® Moderncode for Parallel Architectures

Different perfermance on PC and cluster of Fortran-OpenMP code

ZT_X_ — Wed, 03 Aug 2016 06:39:09 GMT

Hello! I have encountered a problem when programming Fortran-OpenMP code. I used a PARALLEL DO clause to parallel a time-consuming part of my Fortran code. However, different perfermance of the code on the PC (Intel(R) Core(TM) i7-3770 @ 3.4GHz) and the cluster (Intel(R) Xeon(R) CPU E5-2620 @ 2.10GHz) was found. The perfermeance on the PC was statisfactory, about 74% parallel efficiency for 4 process; however only 26% on the cluster.

I'm frastruted by this problem for long time and I think there may be something beyond my knowledge, so I'm seaching for help there. Wish your help, thank you very much!

How many iterations does the

jimdempseyatthecove — Wed, 03 Aug 2016 16:37:15 GMT

How many iterations does the DO perform?

How many threads are available inside the parallel region?

Jim Dempsey

When initializing memory on

Gregg_S_Intel — Wed, 03 Aug 2016 22:44:15 GMT

When initializing memory on the presumably 2 socket server, be sure to first touch it it in parallel same way it will later be used. Otherwise, all the pages end up on 1 of the 2 sockets, and 12 of 24 cores have to do remote memory access.

Quote:jimdempseyatthecove

ZT_X_ — Thu, 04 Aug 2016 02:50:31 GMT

jimdempseyatthecove wrote:

How many iterations does the DO perform?

How many threads are available inside the parallel region?

Jim Dempsey

The iterations are controled by "n_vbns" are many thouands.

The threads availabe are controled by "n_cpu" is within 8 on PC and within 12 on the cluster.

Quote:Gregg S. (Intel) wrote:

ZT_X_ — Thu, 04 Aug 2016 03:01:04 GMT

Gregg S. (Intel) wrote:

When initializing memory on the presumably 2 socket server, be sure to first touch it it in parallel same way it will later be used. Otherwise, all the pages end up on 1 of the 2 sockets, and 12 of 24 cores have to do remote memory access.

I think this is helpful, but why the behaviors on the PC and Cluster are so different, can you give more details?

ZT,

jimdempseyatthecove — Thu, 04 Aug 2016 13:46:42 GMT

ZT,

When you post code, please copy to clipboard, then in the forum, click on the button

{...}
code

on the tool bar. This will open a dialog box with a pull-down control and an edit box.
Click the pull-down and select Fortran (or C++ if posting C++ code), then paste the contents of the clip board into the edit box.

Doing this is quicker than uploading a screenshot, .AND. will permit the readers to copy and paste your source in formulating a response. You can also paste the complete loop.

Gregg reply #3 interacts with your program in a behind-the-scenes manner. In a NUMA setup, the performance gains can only be attained by carefully managing your allocations and deallocations. Preferably, the allocations are to be reused by the same thread .OR. the same sections of the allocations are first touched and reused by the same thread.

The code you have shown, allocates (and presumably deallocates) rele_surf and weight n_va times _outside_ the parallel region by the main thread, then upon entry to the parallel region, for each of the additional threads of the team, allocates an additional n_va times. The likelihood of each of the thread allocations getting the same virtual addresses on each of the n_va * 6 (* number of threads) is virtually nil.

Consider making rele_surf and weight n_va module allocatable arrays and threadprivate, not allocated, then allocate each to the max size required. Sketch:

module threadprivate_data
    real, allocatable :: rele_surf(:), weight(:)
    !$omp threadprivate(rele_surf, weight)
end module threadprivate_data
    
program your_program
    use threadprivate_data
    implicit none
    ...    
    ! once only code
    ! after you know the working sizes
    ! compute the largest allocation size
    max_max_grid_len3 = 0
    do i_va = 1, n_va
      max_max_grid_len3 = max(max_max_grid_len3, va(i)%max_grid_len3)
    end do
    !$omp parallel
        ! each thread allocates (only once) the working data arrays
        allocate(rele_surf(max_max_grid_len3), stat=status)
        if(status .ne. 0) stop
        allocate(weight(max_max_grid_len3), stat=status)
        if(status .ne. 0) stop
    !$omp end parallel
    ...
    ! your code follows
    ! *** remove the allocation/deallocation
    ! *** remove the private(rele_surf,weight)
    ! *** do not use size(rele_surf) or size(weight)
    ! *** use copy of va(i_va)%max_grid_len3 instead

Jim Dempsey