How to optimize this codes to make it as fast as possible in Fortran?

Xia__Brian · ‎06-06-2021

Hi, everyone! I am writing a sort of scientific computing program which contains a part looks as follows:

! derived type
type, public :: block_arrayRR_type			
	real(8), allocatable :: matA(:,:)
end type
type(block_arrayRR_type), allocatable :: Dx(:), Dy(:) 

! temporary array
real(8)		invM_R(8,maxNp)	! maxNp is the biggest vector size	
real(8) 	invM_Sx_f(8)
real(8) 	invM_Sy_f(8)

...

! main loop
do i = 1, ELE_num

...


	! fetch sub vector and matrix size Np		
	associate( Np => ELE_Ept_num(i) )

	do Ept = 1, Np

		invM_Sx_f = 0.0			
		invM_Sy_f = 0.0

		! fetch sub vector index, pt varis with Ept in unit stride
		! such as Ept=1,2,3..., pt=10,11,12,...,
		associate( pt => ELE_EP2P(Ept,i) )

		do Ept2 = 1, Np

			! fetch sub vector index, pt2 varis with Ept2 in unit stride
			associate( pt2 => ELE_EP2P(Ept2,i) )
	
			invM_Sx_f(1) = invM_Sx_f(1) + Dx(i)%matA(Ept2,Ept)*Pt_f(1,Pt2)      
			invM_Sx_f(3) = invM_Sx_f(3) + Dx(i)%matA(Ept2,Ept)*Pt_f(3,Pt2)      
			invM_Sx_f(5) = invM_Sx_f(5) + Dx(i)%matA(Ept2,Ept)*Pt_f(5,Pt2)      
			invM_Sx_f(6) = invM_Sx_f(6) + Dx(i)%matA(Ept2,Ept)*Pt_f(6,Pt2)      
			invM_Sx_f(7) = invM_Sx_f(7) + Dx(i)%matA(Ept2,Ept)*Pt_f(7,Pt2)      
			invM_Sx_f(8) = invM_Sx_f(8) + Dx(i)%matA(Ept2,Ept)*Pt_f(8,pt2)       

			invM_Sy_f(2) = invM_Sy_f(2) + Dy(i)%matA(Ept2,EPt)*Pt_f(2,pt2)        
			invM_Sy_f(4) = invM_Sy_f(4) + Dy(i)%matA(Ept2,EPt)*Pt_f(4,pt2) 
			invM_Sy_f(5) = invM_Sy_f(5) + Dy(i)%matA(Ept2,EPt)*Pt_f(5,pt2) 
			invM_Sy_f(6) = invM_Sy_f(6) + Dy(i)%matA(Ept2,EPt)*Pt_f(6,pt2) 
			invM_Sy_f(7) = invM_Sy_f(7) + Dy(i)%matA(Ept2,EPt)*Pt_f(7,pt2) 
			invM_Sy_f(8) = invM_Sy_f(8) + Dy(i)%matA(Ept2,EPt)*Pt_f(8,pt2)
	
			end associate
	
		end do ! Ept2	

		Pt_df(1,pt) = RK_a*Pt_df(1,pt) + dt*( invM_R(1,Ept)-ea(1,1)*invM_Sx_f(1)) 
		Pt_df(2,pt) = RK_a*Pt_df(2,pt) + dt*( invM_R(2,Ept)-ea(2,2)*invM_Sy_f(2)) 
		Pt_df(3,pt) = RK_a*Pt_df(3,pt) + dt*( invM_R(3,Ept)-ea(1,3)*invM_Sx_f(3)) 
		Pt_df(4,pt) = RK_a*Pt_df(4,pt) + dt*( invM_R(4,Ept)-ea(2,4)*invM_Sy_f(4)) 
		Pt_df(5,pt) = RK_a*Pt_df(5,pt) + dt*( invM_R(5,Ept)-ea(1,5)*invM_Sx_f(5)&
														   -ea(2,5)*invM_Sy_f(5))
		Pt_df(6,pt) = RK_a*Pt_df(6,pt) + dt*( invM_R(6,Ept)-ea(1,6)*invM_Sx_f(6)&
														   -ea(2,6)*invM_Sy_f(6)) 
		Pt_df(7,pt) = RK_a*Pt_df(7,pt) + dt*( invM_R(7,Ept)-ea(1,7)*invM_Sx_f(7)&
														   -ea(2,7)*invM_Sy_f(7)) 
		Pt_df(8,pt) = RK_a*Pt_df(8,pt) + dt*( invM_R(8,Ept)-ea(1,8)*invM_Sx_f(8)&
														   -ea(2,8)*invM_Sy_f(8)) 

		end associate

	end do! Ept	
	
	end associate

...

end do! i

All variables appear in the code above are intrinsic array or scalar data type (except Dx and Dy).

It is the fastest version I can get so far, but the speed is still not satisfactory yet.

Can anyone provide help? Thanks a lot!

JohnNichols · ‎06-06-2021

Patiently await the response from Jim or Steve or mecej4.

I would suggest that repeating the same operation is not necessarily optimal.

I use multiple threads, but your problem is small in terms of steps to thread.

Once you start run timing profiles to see if it actually helps.

Good luck.

JohnNichols · ‎06-06-2021

Forgot Fortran Fan - it all depends who is watching on a Sunday afternoon when there is no football.

Xia__Brian · ‎06-06-2021

Hello, everyone! I am the oringinal poster. Thanks for all the responses!

I am freshman in forum and I am so careless that I post this thread early before I finish writing.

I forgot to provide some additional info which may help you to better understand the situation :

The core of the code is perform matrix multiplication( Dx * Pt_f , Dy * Pt_f ) and update Pt_df.

It looks a bit weird and lengthy because of the following reasons:

1. Not all 'alpha' in Pt_f (alpha,pt) needs to evaluate because some ea(:,alpha) equals zero exactly.

So I have already omitted them and manually unrolled the alpha loop, as can be seen in the above codes.

It makes the code lengthy but also makes it faster. I guess manually unrolling loop helps compiler to vectorize the codes.

2. For different element ' i ' , the matrix and vector size 'Np' may be different.

To reduce memory consumption, I pack the vectors into a long one( such as Pt_f(alpha, pt) )

and arrange the matrices to structured type( such as Dx(i)%maxA(Np,Np) ).

As you see ELE_EP2P(Ept,i) is used to fetch the sub vector index 'pt'.

The matrix and vetor size Np ranges about 5~200 depending on specific application.

The main loop count ELE_num is a large number.

I have already tried many modification to this code, but the above seems the fastest one I can get.

By VTune profiling, I found the percentage of vectorization Ops is low(<20%) and FLOPS speed is also not ideal.

I do believe there is definitly much better way to perform this task because it is indeed matrix multiplication!!!!

I have tried the intrinsic function 'matmul' and MKL blas subroutine 'gemv', but I can not get performance gains.

Can anyone provide some useful suggestions??

jimdempseyatthecove · ‎06-07-2021

1) I would suggest you change the Ept2 loop

from constructing two non-unit stride temporary arrays (two 8-element arrays containing 6 values not compacted)
to constructing three four-element contiguous arrays

2) Prior to the DO i loop construct three four-element copies of the ea array in the selection order as used after the Ept2 looop

These two changes should permit better AVX2 optimization.

Jim Dempsey

JohnNichols · ‎06-07-2021

Jim Is there no way to break up the operations and use more threads?

Xia__Brian · ‎06-08-2021

Hi Jim

I am not sure whether I get you ideas exactly but I try the following version:

! derived type
type, public :: block_arrayRR_type			
	real(8), allocatable :: matA(:,:)
end type
type(block_arrayRR_type), allocatable :: Dx(:), Dy(:) 

! temporary array
real(8)		invM_R(8,maxNp)	! maxNp is the biggest vector size	
real(8) 	invM_Sx_f(4)
real(8) 	invM_Sy_f(4)
real(8) 	invM_S1_f(4)


...

! main loop
do i = 1, ELE_num

...


	! fetch sub vector and matrix size Np		
	associate( Np => ELE_Ept_num(i) )

	do Ept = 1, Np

		invM_Sx_f = 0.0			
		invM_Sy_f = 0.0
		invM_S1_f = 0.0

		! fetch sub vector index, pt varis with Ept in unit stride
		! such as Ept=1,2,3..., pt=10,11,12,...,
		associate( pt => ELE_EP2P(Ept,i) )

		do Ept2 = 1, Np

			! fetch sub vector index, pt2 varis with Ept2 in unit stride
			associate( pt2 => ELE_EP2P(Ept2,i) )
	
			invM_S1_f(1) = invM_S1_f(1) + Dx(i)%matA(Ept2,Ept)*Pt_f(1,Pt2)
			invM_S1_f(2) = invM_S1_f(2) + Dy(i)%matA(Ept2,EPt)*Pt_f(2,pt2)
			invM_S1_f(3) = invM_S1_f(3) + Dx(i)%matA(Ept2,Ept)*Pt_f(3,Pt2) 
			invM_S1_f(4) = invM_S1_f(4) + Dy(i)%matA(Ept2,EPt)*Pt_f(4,pt2)
			
			invM_Sx_f(1) = invM_Sx_f(1) + Dx(i)%matA(Ept2,Ept)*Pt_f(5,Pt2)      
			invM_Sx_f(2) = invM_Sx_f(2) + Dx(i)%matA(Ept2,Ept)*Pt_f(6,Pt2)      
			invM_Sx_f(3) = invM_Sx_f(3) + Dx(i)%matA(Ept2,Ept)*Pt_f(7,Pt2)      
			invM_Sx_f(4) = invM_Sx_f(4) + Dx(i)%matA(Ept2,Ept)*Pt_f(8,pt2)       
			 
			invM_Sy_f(1) = invM_Sy_f(1) + Dy(i)%matA(Ept2,EPt)*Pt_f(5,pt2) 
			invM_Sy_f(2) = invM_Sy_f(2) + Dy(i)%matA(Ept2,EPt)*Pt_f(6,pt2) 
			invM_Sy_f(3) = invM_Sy_f(3) + Dy(i)%matA(Ept2,EPt)*Pt_f(7,pt2) 
			invM_Sy_f(4) = invM_Sy_f(4) + Dy(i)%matA(Ept2,EPt)*Pt_f(8,pt2)
	
			end associate
	
		end do ! Ept2	
								! ea1 = [eaT(1,1),eaT(2,2),eaT(3,1),eaT(4,2)]  
		Pt_df(1,pt) = RK_a*Pt_df(1,pt) + dt*( invM_R(1,Ept)-ea1(1)*invM_S1_f(1))
		Pt_df(2,pt) = RK_a*Pt_df(2,pt) + dt*( invM_R(2,Ept)-ea1(2)*invM_S1_f(2)) 
		Pt_df(3,pt) = RK_a*Pt_df(3,pt) + dt*( invM_R(3,Ept)-ea1(3)*invM_S1_f(3)) 
		Pt_df(4,pt) = RK_a*Pt_df(4,pt) + dt*( invM_R(4,Ept)-ea1(4)*invM_S1_f(4)) 
		Pt_df(5,pt) = RK_a*Pt_df(5,pt) + dt*( invM_R(5,Ept)-eaT(5,1)*invM_Sx_f(1)&
														   -eaT(5,2)*invM_Sy_f(1))
		Pt_df(6,pt) = RK_a*Pt_df(6,pt) + dt*( invM_R(6,Ept)-eaT(6,1)*invM_Sx_f(2)&
														   -eaT(6,2)*invM_Sy_f(2)) 
		Pt_df(7,pt) = RK_a*Pt_df(7,pt) + dt*( invM_R(7,Ept)-eaT(7,1)*invM_Sx_f(3)&
														   -eaT(7,2)*invM_Sy_f(3)) 
		Pt_df(8,pt) = RK_a*Pt_df(8,pt) + dt*( invM_R(8,Ept)-eaT(8,1)*invM_Sx_f(4)&
														   -eaT(8,2)*invM_Sy_f(4)) 

		end associate

	end do! Ept	
	
	end associate

...

end do! i

On standalone tests, the above version did work!!! So do I make the correct change?

It can save 30% time compared with the original one (both compiled with /QxCORE-AVX2 and running on i7-7400)

But when I integrate it into my whole program, I found this part gives little gains... It is weird.( Maybe still my fault!)

So I wonder :

1. Is there anything else that I should pay attention to when targeting vectorization?

2. Apart from the current modification, is it possible to use ' matmul' or BLAS routines to achieve better performance and how?

Thanks

jimdempseyatthecove · ‎06-08-2021

Good job on converting hints to code.

If the post Ep2 loop code is not vectorizing completely then switch from

real(8) :: eaT(5:8,2)

to

real(8) :: eaT1(5:8), eaT2(5:8)

Additional hint:

I find it best to use VTune, and then look at the code bottlenecks in the Disassembly view.

You do not have to fully understand assembly code to assess if your temporaries are scalar or vector and/or registered/RAM.

Jim Dempsey

Xia__Brian · ‎06-08-2021

I don t really understand why switching from 2D array to two 1D array will help?

Memory access should be continuous in both cases I guess?

Unfortunately, I almost know nothing about assembly codes.

When I use VTune, I find its help manual not detailed enough for me(freshman).

I guess I'll spend a long time to learn necessary knowledge before I can pin point the problem in my code using VTune...

Ulrich_M_ · ‎06-08-2021

Are you using OpenMP to parallelize your code? That makes a huge performance difference, and is not difficult to do.

Xia__Brian · ‎06-08-2021

Of course, OpenMP has been used on outmost loop and I have already reached a fairly good parallel efficiency even on NUMA arch.

What I want to discuss here is about vectorization and Cache.

I recently read an article also written by Jim:

Superscalar programming 101 (Matrix Multiply) Part 3 of 5

then and I realize there is stil a lot of potential on single core performance.

However the tricks in that article seem still difficult for me to undertand and use so far.

Ulrich_M_ · ‎06-09-2021

If you use OpenMP, and Dx and Dy are large, then the bottleneck might well be (relatively slow) RAM access due to cache misses. You might want to experiment with the number of parallel threads to empirically find the optimum for your application.

kay-diederichs · ‎06-30-2021

I agree with Ulrich: memory access may be the bottleneck. Just try REAL(4) instead of REAL(8). It may be twice as fast.

Xia__Brian · ‎07-01-2021

Yes, Dx is really a large block matrix and in parallel processing, each thread is trying to access different part of Dx at same time.

This can cause large bandwidth requirement.

Some story behind the codes:

On the old machine with single-channel DDR4, RAM access is once a real problem. To deal with it, I tried buiding the each Dx(i) matrix at runtime (whenever it is used) instead of pre-computing & storing the entire one. It requires fewer data access but apparently increase the computing effort! So you see it is no more than a trade-off.

Later, I upated the computer and add multi-channel RAM. the bandwidth is improved. Then I found the present version seems more profitable. The old version is abandoned.

Even with real(8), the code so far is still not very likely to make full use of AVX256 / AVX512 at all time.

plus the fact that memory access issue is not so severe now, using real(4) can only give about 15% speedup depending on applications.

jimdempseyatthecove · ‎06-08-2021

>>I don t really understand why switching from 2D array to two 1D array will help

The suggestion was: If the...

While in abstract terms the two methods should be equivalent, the compiler optimization code may (potentially) miss the optimization opportunity. IF that is the case, then by simplifying the source code (2D array to two 1D arrays) may yield the desired results.

VTune is an invaluable tool to identify bottlenecks in your code. You do not need to learn "All about VTune" to get started. Start with

Collect Hot Spots
View Bottom-Up (lists hottest procedures in descending order)

Double-Click on top (hottest) procedure to view source

If interested, click the Assembly window to view assembly side-by-side.

While learning all about VTune is complicated, the basics is quite easy.

Jim Dempsey

Xia__Brian · ‎06-08-2021

It seems swiching form 2D to 1D does not make difference in my case here.

I will try more VTune to see if I can get something new. And I'll update if I have problem.

Thanks for your reply! It does help a lot!

jimdempseyatthecove · ‎06-30-2021

While the compiler optimization should have caught this, replace the associate pt and pt2/end associate with the fetch of the index into ipt and ipt2. This might save a little bit of overhead.

I am not sure whether I get you ideas exactly but I try the following version: 

 

! derived type
type, public :: block_arrayRR_type			
	real(8), allocatable :: matA(:,:)
end type
type(block_arrayRR_type), allocatable :: Dx(:), Dy(:) 

! temporary array
real(8)		invM_R(8,maxNp)	! maxNp is the biggest vector size	
real(8) 	invM_Sx_f(4)
real(8) 	invM_Sy_f(4)
real(8) 	invM_S1_f(4)
integer         iNp, ipt, ipt2

...

! main loop
do i = 1, ELE_num

...


  ! fetch sub vector and matrix size Np		
  iNp = ELE_Ept_num(i)

  do Ept = 1, iNp

    invM_Sx_f = 0.0			
    invM_Sy_f = 0.0
    invM_S1_f = 0.0

    do Ept2 = 1, Np

      ! fetch sub vector index, pt2 varis with Ept2 in unit stride
      ipt2 = ELE_EP2P(Ept2,i)
      block
        real(8) rDxDyDxDy(4)

        rDxDyDxDy(1) = Dx(i)%matA(Ept2,Ept)
        rDxDyDxDy(2) = Dy(i)%matA(Ept2,EPt)
        rDxDyDxDy(3:4) = rDxDyDxDy(1:2)
        invM_S1_f = invM_S1_f + rDxDyDxDy*Pt_f(1:4,iPt2)
        invM_Sx_f = invM_Sx_f + rDxDyDxDy(1)*Pt_f(5:8,iPt2)      
        invM_Sy_f = invM_Sy_f + rDxDyDxDy(2)*Pt_f(5:8,ipt2) 

      end block
    end do ! Ept2	
    ! ea1 = [eaT(1,1),eaT(2,2),eaT(3,1),eaT(4,2)]  
    ! fetch sub vector index, pt varis with Ept in unit stride
    ! such as Ept=1,2,3..., pt=10,11,12,...,
    ipt = ELE_EP2P(Ept,i)

    Pt_df(1:4,ipt) = RK_a*Pt_df(1:4,ipt) &
                     + dt*( invM_R(1:4,Ept)-ea1(1:4)*invM_S1_f(1:4))
 
    Pt_df(5:8,ipt) = RK_a*Pt_df(5:8,ipt) &
                     + dt*( invM_R(5:8,Ept)-eaT(5:8,1)*invM_Sx_f(1:4) - eaT(5:8,2)*invM_Sy_f(1:4))
  end do! Ept	
  ...
end do! i

Something like the above (untested) code.

Jim Dempsey

Xia__Brian · ‎07-01-2021

Hi Jim

Define temperary array within block construct...emmm.... it is really fresh to me!

I have tested your codes but it seems no improvement and even a little bit slower than the previous version.....

invM_S1_f(1) = invM_S1_f(1) + Dx(i)%matA(Ept2,Ept)*Pt_f(1,Pt2)
invM_S1_f(2) = invM_S1_f(2) + Dy(i)%matA(Ept2,EPt)*Pt_f(2,pt2)
invM_S1_f(3) = invM_S1_f(3) + Dx(i)%matA(Ept2,Ept)*Pt_f(3,Pt2) 
invM_S1_f(4) = invM_S1_f(4) + Dy(i)%matA(Ept2,EPt)*Pt_f(4,pt2)

invM_Sx_f(1) = invM_Sx_f(1) + Dx(i)%matA(Ept2,Ept)*Pt_f(5,Pt2)      
invM_Sx_f(2) = invM_Sx_f(2) + Dx(i)%matA(Ept2,Ept)*Pt_f(6,Pt2)      
invM_Sx_f(3) = invM_Sx_f(3) + Dx(i)%matA(Ept2,Ept)*Pt_f(7,Pt2)      
invM_Sx_f(4) = invM_Sx_f(4) + Dx(i)%matA(Ept2,Ept)*Pt_f(8,pt2)

Adding Block and temperaray storage of Dx Dy in the innermost loop seems harmful.....

By the way, I always compiled with O3 option.

I guess the compiler should have detected the same reference and fetch it from cache rather than RAM after its first reference.

So there is no need to do these things manually.

jimdempseyatthecove · ‎07-01-2021

RE: O3

I assume you also included /QxCORE-AVX2

RE: Block

The intention of block is to declare a local temporary four element array in hope that the compiler will registerize it. From your test it appears that it did not.

RE: fetch it from cach

The compiler has little to do with cache. That is a memory hierarchy issue. In the attempt at using block rDxDyDxDy end block the compiler could have constructed the 4-wide array in register, but apparently did not (resulting in a store, store, store, store, load). I am not faulting the compiler writers as they do an excellent job at optimization. This specific case may be new to them and/or of low priority.

Jim Dempsey

Xia__Brian · ‎07-01-2021

So far, Mannully unrolling the innermost, incontinuous 'alpha' loop gives fairly good performance.

Even though it lookes redundant.

but It's hard to apply this pattern to other similar tasks in my application where the 'alpha' list varies with differerent element 'i'.

for example, the most direct way to implement variable 'alpha' list is to use a array 'alphaList' and a innermost 'ia' loop to access the 'alpha'

!(1) indirect access to alpha:  very slow
do Fpt = 1, Nfp
   do ia= 1, alphaListSize
        FLux(ia)=...
   end do 
   do Ept = 1, Np
       do ia= 1, alphaListSize
            invM_R(alphaList(ia),Ept)= invM_R(alphaList(ia),Ept) + Array(Ept,Fpt)*FLux(ia)
       end do
   end do 
end do

!(2) static unrolled: fastest,  but not general
do Fpt = 1, Nfp
   FLux(1)=...
   FLux(2)=...
   FLux(3)=...
   FLux(4)=...
   do Ept = 1, Np
       invM_R(1,Ept)= invM_R(1,Ept) + Array(Ept,Fpt)*FLux(1)
       invM_R(2,Ept)= invM_R(2,Ept) + Array(Ept,Fpt)*FLux(2)
       invM_R(5,Ept)= invM_R(5,Ept) + Array(Ept,Fpt)*FLux(3)
       invM_R(6,Ept)= invM_R(6,Ept) + Array(Ept,Fpt)*FLux(4)
   end do 
end do

!(3) evaluate all alpha even though some are unnecessary: slower than (2)
do Fpt = 1, Nfp
   FLux(1:8)=...
   do Ept = 1, Np
       invM_R(1:8,Ept)= invM_R(1:8,Ept) + Array(Ept,Fpt)*FLux(1:8)
   end do 
end do

the above is only a sample. In the real task the alphaList may be much more complicated.

Is there any solution to it? Or do I get stuck in the wrong direction?

jimdempseyatthecove · ‎07-01-2021

In your !(2), I would suggest you experiment with exchanging the indices of invM_R (unless the performance order is important elsewhere).

The order you use there results in strided read/modify/write.

Jim Dempsey