- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've written a parallel fortran program for numerical computing based on openMP. I tested it on a workstation with 4 CPU(48 cores /96 threads in total). but I found that the time consumption did not change much when I switched the num of threads from 24 to 48. Does anyone know the possible reasons?
the program is very long and I just show its outline here:
!$ call omp_set_num_threads(Threads_num) do CurrentTimeStep = StartTimeStep + 1, EndTimeStep !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Call Streaming() ! !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ !$OMP PARALLEL !$OMP DO PRIVATE(i) SCHEDULE(guided,4) do i = 1, ELE_num if ( ELE_PML_mark(i) ) then !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Call Collision_LBGK_PML(i) ! !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ else !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Call Collision_LBGK(i) ! !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ end if end do !$OMP END DO !$OMP END PARALLEL end do
Inside the subroutine Collision_LBGK_PML(i) and Collision_LBGK(i) there is no large DO loop
Inside the subroutine Streaming() there are several large OMP DO loop:
do RK_i = 1, RK_stage !$OMP PARALLEL !$OMP DO PRIVATE(i,j,tmpi,face,k) Do i = 1, BOU_num Do j = 1, BouInfo(i)%FaceNum tmpi = BouInfo(i)%ElementID(j) ! face = BouInfo(i)%FaceID(j) ! do k = 1, Nfp call boundary_conditions(tmpi,face,k) ! end do End do End do !$OMP END DO !$OMP END PARALLEL !$OMP PARALLEL !$OMP DO PRIVATE(i,j,alpha,F2E,invM_Sx_f,invM_Sy_f,invM_Sz_f,k,invM_R,face...) SCHEDULE(guided,4) do i = 1, ELE_num do j = 1, Np invM_Sx_f = 0.0 invM_Sy_f = 0.0 invM_Sz_f = 0.0 do k = 1, Np do alpha = 1, 18 invM_Sx_f(alpha)=invM_Sx_f(alpha)+... invM_Sx_f(alpha)=invM_Sx_f(alpha)+... invM_Sx_f(alpha)=invM_Sx_f(alpha)+... end do end do do alpha = 1, 18 invM_R = 0.0 do face = 1, 4 if ( ELE(i)%n_ea(face,alpha) < 0.0 ) then if ( ELE(i)%F2B(face)==0 ) then ! do k = 1, Nfp invM_R = invM_R + ... end do else ! do k = 1, Nfp invM_R = invM_R + ... end do end if end if end do ELE(i)%df(alpha,j) = a(RK_i)*ELE(i)%df(alpha,j)& + dt*( invM_R - ea(alpha,1)*invM_Sx_f(alpha) & - ea(alpha,2)*invM_Sy_f(alpha) & - ea(alpha,3)*invM_Sz_f(alpha) ) end do end do end do !$OMP END DO !$OMP END PARALLEL !$OMP PARALLEL !$OMP DO PRIVATE(i) SCHEDULE(guided,4) do i = 1, ELE_num ELE(i)%f(1:18,:) = ELE(i)%f(1:18,:) + b(RK_i)*ELE(i)%df(1:18,:) end do !$OMP END DO !$OMP END PARALLEL end do
subroutine boundary_conditions does not contain large loop
ELE_num is a large integer(10^5~10^6) , so all of the loop i =1, ELE_num is parallelized
there is NO read/write operation in all subroutines
the program is compiled using Intel fortran compiler 2019 under the following setup:
optimization level: O3
favor fast code
parallelization: yes
threshold for auto-parallelization : 100
threshold for auto-vectorization : 100
prefetch insertion : agressive
interprocedure optimization : Yes
enable matrix multiply call : yes
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Which version of Windows were you running? I'm wondering if this is a processor group issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This may be obvious, but in my limited experience increasing the number of threads with OpenMP does not automatically result in a speed increase. I believe that's because of bus contention(?) - the bus that's shared for memory accesses is often the bottleneck. Posters with more knowledge can supply a better explanation.
It would be interesting to know how the speed varies with number of threads as it ranges from 1 to 24.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Running on such a large number of cores is complicated enough that with such extremely limited information we can't guess much. You don't even say whether it makes a difference when you enable nested parallelism, but first you may want to explore the inner and outer parallel separately. You don't reveal the results of setting omp_places=cores which is almost certainly needed, as well as testing with the threads pinned to 1 and 2 CPUs. As you probably need to test scaling for thread numbers 1,2,4,..12... perhaps you are hinting that it's too much work to investigate, but then asking us isn't productive.
I haven't much experience with it, but I'd guess unnecessary used of guided may work better pinned to 1 CPU.
Wow, now that I have an internet service (Viasat) which doesn't go through AT&T or Verizon, I'm not blocked from reaching Intel's login server.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>the program is compiled using Intel fortran compiler 2019 under the following setup:
optimization level: O3
favor fast code
parallelization: yes **********************
threshold for auto-parallelization : 100
threshold for auto-vectorization : 100
*** Do not mix auto-parallelization of loops with OpenMP parallelization
Use either /openmp .OR. /parallel .NOT. both
Your description sounds like you are using (attempting to use) nested parallelism. While you can do this, you also must be careful in how you nest your parallel regions. Simply enabling nested parallel regions then parallel everything generally results in severe oversubscription of threads.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page