Help to speed up code

Dix_Carneiro__Rafael · ‎12-02-2017

Hello everyone,

I am running an algorithm that is taking about 40 seconds per iteration with Intel Fortran -- a bit slower than the same algorithm (quite efficiently) implemented in Matlab 2017. I am surprised that running times are even comparable (let alone that Fortran is slower), and I think I know at what point the lack of performance with Fortran is coming from. Basically, I think that Matlab is much more efficient in computing the maximum value and location of large arrays.

Here is what I have in Fortran (simplifying):

integer ind(60,2000)
real*8 A(2000,2000), B(60,2000)

do k = 1, 60
        
        ! max location across columns
	ind(k,:) = maxloc(A,DIM = 2)
        ! maximum value
        do j = 1, 2000
        	B(k,j) = A(k,ind(k,j))
        end do
    
end do

This portion of code is inside another loop, so I have to go over it dozens of times.

Questions: is there a more efficient way of computing the maximum value (with index location) of an array other than using maxloc? Perhaps from some library? (I checked MKL and did not find anaything)

Note that in Matlab, you get the location and maximum value with just one command, whereas here I am getting the location first, and then I need to go over a loop to get the maximum values....

Many thanks!

Rafael

andrew_4619 · ‎12-02-2017

Your array indexes are the wrong way around so in your loops you are jumping all over the place.

IanH · ‎12-02-2017

What is your matlab equivalent? How are you compiling your Fortran code? Can you show a complete test?

Your choice of the dimension ordering for ind is unfortunate - array(:,x) is a nice contiguous slice of an array while array(x,:) is a reference to array elements that are widely separated in memory. Is there a particularly reason you chose this ordering?

The implied or explicit accesses to A and B have similar contiguity issues. How are your arrays arranged in your Matlab code?

(Edit - which is just a longer version of what andrew has said...)

I assume that you want both the maximum value and index location. If it is just the former then there is no point ind being a two dimensional array.

Dix_Carneiro__Rafael · ‎12-02-2017

Dear Ian and Andrew,

Many thanks for the kind replies. Unfortunately, my code above was not 100% accurate, which I think generated sensible questions from Ian. I tried to simplify it, but failed to be accurate. My apologies. Here is what I was doing:

do k = 1, nZ
    ! for eack k there is a new matrix AA        
    AA = profit_v((k-1)*nE+1:k*nE,:) 
    pol_ind(k,:) = maxloc(AA,DIM=2)
    do j = 1, nE
        BB(k,j) = AA(j,pol_ind(k,j))
    end do
    
end do

I do need a two dimensional index vector with the location of the max -- this will be used later on in the code, as will the max value.

From what I understood from your comments, I slightly changed the above code to:

do k = 1, nZ
    ! for eack k there is a new matrix AA        
    AA = profit_v((k-1)*nE+1:k*nE,:) 
    pol_ind2(:,k) = maxloc(AA,DIM=2)
    do j = 1, nE
        BB(k,j) = AA(j,pol_ind2(j,k))
    end do
    
end do

Unfortunately, computational time did not improve.

Here is my command line:
/nologo /O2 /heap-arrays0 /Qopenmp /module:"Release\\" /object:"Release\\" /Fd"Release\vc150.pdb" /check:none /libs:static /threads /Qmkl:parallel /c

LINKER:
/OUT:"Release\TradeInformality.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"Release\TradeInformality.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /IMPLIB:"C:\Users\rd123\Dropbox\TradeAndInformality\RDC_Codes\Fortran\TradeInformality\TradeInformality\Release\TradeInformality.lib"

Finally, my MATLAB code is:

for k = 1 : nZ

AA = profit_v((k-1)*nE+1:k*nE,:) ;
[BB2 , pol_ind(k,:)] = max( AA , [] , 2 ) ;
BB(k,:) = BB2' ;

end

Many thanks for your time.

Best,

Rafael

andrew_4619 · ‎12-03-2017

OK is you look at the BB(k,j) = for example the inner loop is varying J the second index if B was constructed the other way around such is was BB(j,k) you would be writing to sequential memory addresses.T The whole construct seems specifically designed to make it slow and not open to vectorisation / parallelisation in Fortran, from the very first line where AA is constructed from a non=contiguous slice of profit_v.

That said it is hard (for me anyway) to follow exactly what you are doing I assume that some of the arrays are sized with nZ and nE, the size declarations would aid understanding I think and maybe lead to some constructive suggestions for code constructs that will optimise.

Dix_Carneiro__Rafael · ‎12-03-2017

Dear Andrew,

Many thanks, your comments prompted me to do some research and learn more about how Fortran works. For future visitors of this thread, I found this very useful, which is what Andrew is referring to (I believe): http://www.cita.utoronto.ca/~merz/intel_f10b/main_for/mergedProjects/optaps_for/fortran/optaps_prg_arrs_f.htm

I was now able to reduce computational time 4 fourfold, so that is great! However, please could I ask if there is still room for improvement? Please see below. Remember, the lines below are inside a loop, so I really need to do it efficiently!

profitT = transpose(profit)
CT = transpose(C)
V_iniT = transpose(V_ini)
        
do k = 1, nZ
        
    do j = 1, nE
        pol_ind2(j,k:k) = maxloc( profitT(:,k) - CT(:,j) + V_iniT(:,k)/(1+r) )
        BB2(j,k) = profitT(pol_ind2(j,k),k) - C(j,pol_ind2(j,k)) + V_iniT(pol_ind2(j,k),k)/(1+r)
     end do
       
end do
        
BB = transpose(BB2)

In this application nZ = 60 and nE = 2000

I tried to: (1) have the bigger loop inside, (2) sweep rows for a fixed column. To do that, I had to transpose some of previously defined matrices.

Note that the row with BB2(j,k) = profit... is there because I need to save the maximum array value, so this is why I have this odd-looking vector access

profitT(pol_ind2(j,k),k)

In any case, it would be great to know if there is still *obvious* room for improvement.

Many thanks,

Rafael

andrew_4619 · ‎12-03-2017

I think what I wrote below does the same thing and might gain a little more time.

!profitT = transpose(profit)
CT = transpose(C)
!V_iniT = transpose(V_ini)
! do this work as one operation outside loop (fast)
! and thus do less work in the loop
tmp = transpose(C) + transpose(V_ini)/(1+r)
        
do k = 1, nZ    
    do j = 1, nE
        !pol_ind2(j,k:k) = maxloc( profitT(:,k) - CT(:,j) + V_iniT(:,k)/(1+r) )
        pol_ind2(j,k) = maxloc( tmp(:,k) - CT(:,j) )
        !BB2(j,k) = profitT(pol_ind2(j,k),k) - C(j,pol_ind2(j,k)) + V_iniT(pol_ind2(j,k),k)/(1+r)
	BB2(j,k) = tmp(pol_ind2(j,k),k) - C(j,pol_ind2(j,k)) 
     end do
end do
        
BB = transpose(BB2)

I think the inner loop can also be a do concurrent rather than a plain do, this might not make any difference because the compiler might parallelise this anyway without the big hint to do so.

Dix_Carneiro__Rafael · ‎12-03-2017

Dear Andrew,

This is very helpful, many thanks! This reduced computational time an additional twofold, so you helped reduce computational time eight-fold in total! Thank you so much!

Now, about "do concurrent"... Is there any way I can specify the number of threads I would like to allow?

Is it more efficient to use "do concurrent", or maybe manually parallelize parts of the loop with OpenMP?

Many thanks for being so helpful.

All the best,
Rafael

Dix_Carneiro__Rafael · ‎12-03-2017

Oh, and one more question: why did you suggest "do concurrent" for the inner loop as opposed to the outer loop? Using it in the inner or outer did not make much of a difference (probably because it just ended up using a few threads anyway), but I was curious to hear your answer. Thanks again!

andrew_4619 · ‎12-04-2017

I would switch on optimisation reports which show your code after compilation with optimisation notes showing what the compiler has decided to do to speed things up. You might not be able to improve on what you get already as the compiler is often quite smart (but note, sometimes quite stupid).

Edit also look at:

https://software.intel.com/en-us/articles/threading-fortran-applications-for-parallel-performance-on-multi-core-systems

Steve_Lionel · ‎12-04-2017

DO CONCURRENT doesn’t have a way to specify number of threads, at least directly. It uses the same logic as auto-parallel. It might be that if you call the OpenMP routine to set thread number, that will work, but I don’t recommend this as it is probably counterproductive.

jimdempseyatthecove · ‎12-04-2017

Line 6 of Andrew's post #7 should read

tmp = CT + transpose(V_ini)/(1+r)

I'd recommend using OpenMP to parallelize the code as opposed to using DO CONCURRENT.

You have more control over where and how many threads to employ.

Also look at

!profitT = transpose(profit)
CT = transpose(C)
!V_iniT = transpose(V_ini)
! do this work as one operation outside loop (fast)
! and thus do less work in the loop
tmp = CT + transpose(V_ini)/(1+r)

!$omp parallel do private(j,k,pol_ind2_jk)        
do k = 1, nZ    
    do j = 1, nE
        ! assuming array pol_ind2 was only used here
        pol_ind2_jk = maxloc( tmp(:,k) - CT(:,j) )
 BB2(j,k) = tmp(pol_ind2_jk,k) - C(j,pol_ind2_jk) 
     end do
end do
        
BB = transpose(BB2)

Jim Dempsey