Thanks for the very quick

ignacio82 · ‎07-28-2015

I wrote this Fortran module and I need its subroutines to run as fast as possible. My hope is to get some advice to optimize this code, and after having the serial version running as fast as it can run I will use MPI to make it parallel.

Module smearingmodule
IMPLICIT NONE
contains
subroutine smearing(i, nMCd, nPeriodT, nPeriods, DF, theta, wm) bind(C, name="smearing_")
    use, intrinsic                         :: iso_c_binding, only : c_double, c_int
    integer(c_int), intent(in)                                 :: i, nMCd, nPeriods
    integer(c_int), intent(in), dimension(nPeriods)            :: nPeriodT
    real(c_double), intent(in), dimension(i,5)        :: DF
    real(c_double), intent(in), dimension(i,nMCd)     :: theta
    real(c_double), dimension(nMCd, nPeriods), intent(out)  :: wm
    integer                                             :: d

    do d=1,nMCd
        CALL subsmearing(d, i, nMCd, nPeriodT, nPeriods, DF, theta, wm(d,:))
    end do

end subroutine smearing


subroutine subsmearing(d, i, nMCd, nPeriodT, nPeriods, DF, theta, wm)
    integer, intent(in)                                 :: d, i, nMCd, nPeriods
    integer, intent(in), dimension(nPeriods)            :: nPeriodT
    double precision, intent(in), dimension(i,5)        :: DF
    double precision, intent(in), dimension(i,nMCd)     :: theta
    double precision, dimension(nPeriods), intent(out)  :: wm
    double precision, allocatable, dimension(:,:)       :: out
    double precision, dimension(i,8)                    :: bigDF
    double precision, dimension(i)                      :: epredC, epredT, diff, A1, A2
    double precision, dimension(i,i)                    :: B1, B2
    integer                                             :: j, Period, jj
    double precision                                    :: sumWeights

    diff = 0.0d0
    wm = 0.0d0
    epredC = exp(DF(:,4) + (theta(:,d) * DF(:,1)))
    epredT = exp(DF(:,4) - (theta(:,d) * (1-DF(:,1))))

    do j=1,i
        B1(:,j)=epredT(j)
        B2(:,j)=epredC(j)
    end do

    A1 = matmul(DF(:,5), B1)
    A2 = matmul(DF(:,5), B2)

    diff = (A1-A2)/DBLE(i)

    bigDF(:,1:5) = DF
    bigDF(:,6) = epredC
    bigDF(:,7) = epredT
    bigDF(:,8) = diff

    do jj =1, nPeriods
        ALLOCATE(out(nPeriodT(jj),7))
        CALL subsetPeriod(bigDF, i, 8, nPeriodT(jj), jj, out) !Filters data
        sumWeights = sum(out(:,2))
        do j=1, nPeriodT(jj)
            wm(jj) = wm(jj) + (out(j,7)*out(j,2)/sumWeights)
        end do
        DEALLOCATE(out)
    end do

end subroutine subsmearing

subroutine subsetPeriod(A, rowA, colA, rowB, Period, B)
    double precision, dimension(rowA, colA), intent(in)    :: A
    integer, intent(in)                                    :: rowA, colA, rowB, Period
    double precision, dimension(rowB,colA-1), intent(out)  :: B
    integer                                                :: i, pos

    pos = 1
    do i = 1, size(A,1)
        if(A(i,2)==Period)then
            B(pos,1) = A(i,1)
            B(pos,2:) = A(i,3:)
            pos = pos+1
        end if
    end do
end subroutine subsetPeriod

end module smearingmodule

Is there anything in my code that could be done in a more efficient way?

Thanks a lot!

TimP · ‎07-28-2015

I swapped source lines 66 and 67 so that ifort could compile. It was a hassle to get that far.

Could you say which are the hot spots of interest? Do you agree with the compiler that matrices are 12x12 maximum? Normally, at -O3, matmul would be implemented by MKL calls, but since it is seen as small, it is optimized with single thread in-line code, with an estimated AVX2 vector speedup of 3.

It seems contradictory that memset and memcpy substitution would be chosen if the loops are so short, and that the compiler says those need improved alignment, but perhaps that doesn't matter. VTune ought to be able to show you. You can use directives such as !dir$ simd to prevent the substitution of those library functions.

Many minor optimizations, including important assignment fusions, appear to depend on -align array32byte. Apparently, the way the matrix is declared, it can't achieve alignments in matmul.

If the data set is so small, it may be difficult to make MPI pay off, even on the assumption that there are several times as many 12x12 matrices to compute in parallel as number of cores. You may need to bunch them up if you have hundreds to perform in parallel.

The loops which are unrolled by 8 might do better with a smaller unrolling, if their length isn't several hundred.

The compiler can't fix up the loop nesting in subsetPeriod to achieve vectorization. Perhaps a rewrite in a form eligible for PACK might help, even though there is not yet any direct Xeon instruction support for PACK (as there is on Intel(r) Xeon Phi).

ignacio82 · ‎07-28-2015

Thanks for the very quick response Tim!

DF is going to be millions by 5, theta millions by 4000. That is why I want to optimize the code and make it parallel.

What is PACK? how would I have to rewrite subsetPeriod? You think that is the only change that I can do to the code to gain speed?

Thanks a lot!

TimP · ‎07-28-2015

If you find the time spent in that procedure is large, you might try rewriting with Fortran pack intrinsic or otherwise swapping inner and outer loops in hope of vectorizing . I can't see it well on this phone.

jimdempseyatthecove · ‎07-28-2015

See if you can exchange the indexes of wm to (nPeriods, nMCd)

This will require a change in the calling C program as well.

Also change the indexing of B1, B2 is subsmearing

And B in subsetPeriod.

The current indexing order results in strided load/store as well as a temporary for wm.

Reorganizing the indexing will produce stride-1 which will increase the L1 cache hit ratio.

Be careful as to the effect on the results from your matmul operations.

Jim Dempsey

ignacio82 · ‎07-28-2015

Thanks for the advice Jim. I have many questions.

You are recommending changing

subroutine smearing(i, nMCd, nPeriodT, nPeriods, DF, theta, wm) bind(C, name="smearing_")
use, intrinsic                         :: iso_c_binding, only : c_double, c_int
integer(c_int), intent(in)                                 :: i, nMCd, nPeriods
integer(c_int), intent(in), dimension(nPeriods)            :: nPeriodT
real(c_double), intent(in), dimension(i,5)        :: DF
real(c_double), intent(in), dimension(i,nMCd)     :: theta
real(c_double), dimension(nMCd, nPeriods), intent(out)  :: wm
integer                                             :: d

do d=1,nMCd
CALL subsmearing(d, i, nMCd, nPeriodT, nPeriods, DF, theta, wm(d,:))
end do

end subroutine smearing

to

subroutine smearing(i, nMCd, nPeriodT, nPeriods, DF, theta, wm) bind(C, name="smearing_")
use, intrinsic                         :: iso_c_binding, only : c_double, c_int
integer(c_int), intent(in)                                 :: i, nMCd, nPeriods
integer(c_int), intent(in), dimension(nPeriods)            :: nPeriodT
real(c_double), intent(in), dimension(i,5)        :: DF
real(c_double), intent(in), dimension(i,nMCd)     :: theta
real(c_double), dimension(nPeriods, nMCd), intent(out)  :: wm
integer                                             :: d

do d=1,nMCd
CALL subsmearing(d, i, nMCd, nPeriodT, nPeriods, DF, theta, wm(:,d))
end do

end subroutine smearing

This seems like a very simple change, am I missing something? If I am not, can you explain why this should make a difference? I will ask you the other questions after I understand this one.

Thanks a lot!

jimdempseyatthecove · ‎07-29-2015

In Fortran

a(i,j,k) = b(i,j,k)

The left most index (for contiguous arrays) as it increments, points to the next location in RAM. Thus b(i,j,k) and b(i+1,j,k) are statistically likely to reside in the same cache line.

The change of the indexing of wm on the outer call and use inside the routine called is not the potential for problems. It is in the other recommendations I made for improving the performance by changing the index order of the some of the other variables (A and B). Swapping the order usually is not a problem except for when these arrays are use with intrinsic function (matmul) where you cannot swap the internal representation of the indices. In this case you may find the result being a transposition of what you want.

Experiment with the wm, if satisfied, incrementally move on to the other variables (testing results as you go).

Jim Dempsey

Serial code optimization advice?