Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28435 Discussions

Why does my program not run faster on more threads?

Xia__Brian
Beginner
1,313 Views

I've written a parallel fortran program for numerical computing based on openMP. I tested it on a workstation with 4 CPU(48 cores /96 threads in total). but I found that the time consumption did not change much when I switched the num of threads from 24 to 48. Does anyone know the possible reasons? 

the program is very long and I just show its outline here:

 

!$ call omp_set_num_threads(Threads_num)	
	   do CurrentTimeStep = StartTimeStep + 1, EndTimeStep  

		   !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
			Call Streaming()            !
		   !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

			!$OMP PARALLEL 
			!$OMP DO PRIVATE(i) SCHEDULE(guided,4)
			do i = 1, ELE_num
				if ( ELE_PML_mark(i) ) then
			   !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~							
					 Call Collision_LBGK_PML(i)   !
			   !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				else 
			   !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~						
					Call Collision_LBGK(i)   	! 
			   !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~				
				end if 
			end do 
			!$OMP END DO      
			!$OMP END PARALLEL

		end do

Inside the subroutine Collision_LBGK_PML(i) and Collision_LBGK(i)  there is no large DO loop

Inside the subroutine Streaming() there are several large OMP DO loop:

		do RK_i = 1, RK_stage
		
			!$OMP PARALLEL			
			!$OMP DO PRIVATE(i,j,tmpi,face,k) 	
			Do i = 1, BOU_num
				Do j = 1, BouInfo(i)%FaceNum
					tmpi = BouInfo(i)%ElementID(j)	! 
					face = BouInfo(i)%FaceID(j)		! 
					do k = 1, Nfp
						call boundary_conditions(tmpi,face,k) !			
					end do
				End do 
			End do 
			!$OMP END DO 
			!$OMP END PARALLEL
	
			!$OMP PARALLEL			
			!$OMP DO PRIVATE(i,j,alpha,F2E,invM_Sx_f,invM_Sy_f,invM_Sz_f,k,invM_R,face...) SCHEDULE(guided,4)     
			do i = 1, ELE_num
				 
				  do j = 1, Np
				  
						invM_Sx_f = 0.0
						invM_Sy_f = 0.0
						invM_Sz_f = 0.0
					  
						do k = 1, Np	
						   do alpha = 1, 18
							  invM_Sx_f(alpha)=invM_Sx_f(alpha)+...    
							  invM_Sx_f(alpha)=invM_Sx_f(alpha)+...     
							  invM_Sx_f(alpha)=invM_Sx_f(alpha)+...     
						   end do 
						end do 
											
												
						do alpha = 1, 18
						
                              invM_R = 0.0
                              
							  do face = 1, 4

									if ( ELE(i)%n_ea(face,alpha) < 0.0 ) then	
										 if ( ELE(i)%F2B(face)==0 ) then		! 
											
											do k = 1, Nfp
												invM_R = invM_R + ...
											end do 
											
										 else									! 
											
											do k = 1, Nfp
												invM_R = invM_R + ...
											end do
											
										end if 
									end if
									
							  end do        
								  
							  ELE(i)%df(alpha,j) = a(RK_i)*ELE(i)%df(alpha,j)&
													+ dt*( invM_R - ea(alpha,1)*invM_Sx_f(alpha) &
																- ea(alpha,2)*invM_Sy_f(alpha) &
																- ea(alpha,3)*invM_Sz_f(alpha) ) 
	
					   end do
					   
				  end do
			end do     
			!$OMP END DO  
			!$OMP END PARALLEL
			
			!$OMP PARALLEL			 
			!$OMP DO PRIVATE(i)  SCHEDULE(guided,4)
			do i = 1, ELE_num 			
				 ELE(i)%f(1:18,:) = ELE(i)%f(1:18,:) + b(RK_i)*ELE(i)%df(1:18,:) 
			end do        
			!$OMP END DO   			
			!$OMP END PARALLEL  
			
		end do      

subroutine boundary_conditions does not contain large loop

ELE_num is a large integer(10^5~10^6) , so all of the loop i =1, ELE_num is parallelized 

there is NO read/write operation in all subroutines

 

the program is compiled using Intel fortran compiler 2019 under the following setup:

optimization level: O3

favor fast code

parallelization: yes

threshold for auto-parallelization : 100

threshold for auto-vectorization : 100

prefetch insertion : agressive

interprocedure optimization : Yes

enable matrix multiply call : yes

 

0 Kudos
4 Replies
Bruce_Weaver
Beginner
1,313 Views

Which version of Windows were you running?  I'm wondering if this is a processor group issue.

0 Kudos
gib
New Contributor II
1,313 Views

This may be obvious, but in my limited experience increasing the number of threads with OpenMP does not automatically result in a speed increase.  I believe that's because of bus contention(?) - the bus that's shared for memory accesses is often the bottleneck.  Posters with more knowledge can supply a better explanation.

It would be interesting to know how the speed varies with number of threads as it ranges from 1 to 24.

0 Kudos
TimP
Honored Contributor III
1,313 Views

Running on such a large number of cores is complicated enough that with such extremely limited information we can't guess much.  You don't even say whether it makes a difference when you enable nested parallelism, but first you may want to explore the inner and outer parallel separately.  You don't reveal the results of setting omp_places=cores which is almost certainly needed, as well as testing with the threads pinned to 1 and 2 CPUs.  As you probably need to test scaling for thread numbers 1,2,4,..12... perhaps you are hinting that it's too much work to investigate, but then asking us isn't productive.

I haven't much experience with it, but I'd guess unnecessary used of guided may work better pinned to 1 CPU.

Wow, now that I have an internet service (Viasat) which doesn't go through AT&T or Verizon, I'm not blocked from reaching Intel's login server.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,313 Views

>>the program is compiled using Intel fortran compiler 2019 under the following setup:

optimization level: O3

favor fast code

parallelization: yes **********************

threshold for auto-parallelization : 100

threshold for auto-vectorization : 100

*** Do not mix auto-parallelization of loops with OpenMP parallelization

Use either /openmp .OR. /parallel .NOT. both

Your description sounds like you are using (attempting to use) nested parallelism. While you can do this, you also must be careful in how you nest your parallel regions. Simply enabling nested parallel regions then parallel everything generally results in severe oversubscription of threads.

Jim Dempsey

0 Kudos
Reply