- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello! I have encountered a problem when programming Fortran-OpenMP code. I used a PARALLEL DO clause to parallel a time-consuming part of my Fortran code. However, different perfermance of the code on the PC (Intel(R) Core(TM) i7-3770 @ 3.4GHz) and the cluster (Intel(R) Xeon(R) CPU E5-2620 @ 2.10GHz) was found. The perfermeance on the PC was statisfactory, about 74% parallel efficiency for 4 process; however only 26% on the cluster.
I'm frastruted by this problem for long time and I think there may be something beyond my knowledge, so I'm seaching for help there. Wish your help, thank you very much!
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How many iterations does the DO perform?
How many threads are available inside the parallel region?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When initializing memory on the presumably 2 socket server, be sure to first touch it it in parallel same way it will later be used. Otherwise, all the pages end up on 1 of the 2 sockets, and 12 of 24 cores have to do remote memory access.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
How many iterations does the DO perform?
How many threads are available inside the parallel region?
Jim Dempsey
The iterations are controled by "n_vbns" are many thouands.
The threads availabe are controled by "n_cpu" is within 8 on PC and within 12 on the cluster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gregg S. (Intel) wrote:
When initializing memory on the presumably 2 socket server, be sure to first touch it it in parallel same way it will later be used. Otherwise, all the pages end up on 1 of the 2 sockets, and 12 of 24 cores have to do remote memory access.
I think this is helpful, but why the behaviors on the PC and Cluster are so different, can you give more details?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ZT,
When you post code, please copy to clipboard, then in the forum, click on the button
{...}
code
on the tool bar. This will open a dialog box with a pull-down control and an edit box.
Click the pull-down and select Fortran (or C++ if posting C++ code), then paste the contents of the clip board into the edit box.
Doing this is quicker than uploading a screenshot, .AND. will permit the readers to copy and paste your source in formulating a response. You can also paste the complete loop.
Gregg reply #3 interacts with your program in a behind-the-scenes manner. In a NUMA setup, the performance gains can only be attained by carefully managing your allocations and deallocations. Preferably, the allocations are to be reused by the same thread .OR. the same sections of the allocations are first touched and reused by the same thread.
The code you have shown, allocates (and presumably deallocates) rele_surf and weight n_va times _outside_ the parallel region by the main thread, then upon entry to the parallel region, for each of the additional threads of the team, allocates an additional n_va times. The likelihood of each of the thread allocations getting the same virtual addresses on each of the n_va * 6 (* number of threads) is virtually nil.
Consider making rele_surf and weight n_va module allocatable arrays and threadprivate, not allocated, then allocate each to the max size required. Sketch:
module threadprivate_data real, allocatable :: rele_surf(:), weight(:) !$omp threadprivate(rele_surf, weight) end module threadprivate_data program your_program use threadprivate_data implicit none ... ! once only code ! after you know the working sizes ! compute the largest allocation size max_max_grid_len3 = 0 do i_va = 1, n_va max_max_grid_len3 = max(max_max_grid_len3, va(i)%max_grid_len3) end do !$omp parallel ! each thread allocates (only once) the working data arrays allocate(rele_surf(max_max_grid_len3), stat=status) if(status .ne. 0) stop allocate(weight(max_max_grid_len3), stat=status) if(status .ne. 0) stop !$omp end parallel ... ! your code follows ! *** remove the allocation/deallocation ! *** remove the private(rele_surf,weight) ! *** do not use size(rele_surf) or size(weight) ! *** use copy of va(i_va)%max_grid_len3 instead
Jim Dempsey
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page