Improve Speed of Looped Matrix Muliplication

ScottBoyce · ‎03-21-2013

I have two matrices A(100K,100K) and B(100K,50) and I want to get C=[B^TAB]

A is a matrix from finite differences, but is stored in a vector to exploit its sparsity. It has two referance variables, S which refers to the location in A where the first value of row I is stored. So the range of values in A for ROW I would be S(I) to S(I+1)-1

The second variable is IDX which says the column location of a value in the vector A. For example S(I)+1 would be the second value stored in noncompressed Row I and IDX(S(I)+1) its column.

Below is my current code where I multiply AB first through loops and then multiply B^TAB through the MKL/BLAS.

[fortran]DO J=1,COL
DO I=1,ROW
    DO IJ= S(I), S(I+1)-1
        AB(I,J)= AB(I,J) + A(IJ)*B(IDX(IJ),J)
    END DO
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)[/fortran]

Any suggestions how I can improve this code/speed it up would be greatly appreciated. Right now it is a major choke point in a program I have been working on for the past six months.

Thanks in advance!

John_Campbell · ‎03-22-2013

You could try:
DO I=1,ROW
   DO IJ= S(I), S(I+1)-1
    DO J=1,COL
      AB(I,J)= AB(I,J) + A(IJ)*B(IDX(IJ),J)
    END DO
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)

or the inner loop as

k = idx(IJ)
AB(I,1:ROW)= AB(I,1:ROW) + A(IJ)*B(k,1:ROW)

Alternatively, ignoring the compression, you might be able to work from the attached approach to compute BAB, by having a temporary copy of AB. I'm not sure of IDX in your example above.
Is A and BAB symmetric ?

John

SergeyKostrov · ‎03-22-2013

Scott, I'd like to make a couple of comments. Regarding the 1st part of your calculations, that is multiplication of matricies A and B. Since sizes are big you need to consider parallelization of your processing. Also, from your initial post it is not clear how long does it take to calculate the 1st part and then the 2nd part? It would be nice to give us at least some numbers. Then, did you consider a dense layout for matricies A and B? Take into account that Fortran's MATMUL function does calculations in 8x - 9x times faster on a system with 4 CPUs ( 8 cores ) in that case and /Qparallel and Parallel version of MKL need to be used. Another thing is how much physical and virtual memory do you have on your computer? When all matricies are stored in physical memory only and virtual memory is not used matrix multiplication is CPU bound. As soon as at least some amount of virtual memory is used then matrix multiplication is virtual memory bound and performance affected significantly. For example, I see that my HDD LED never stops blinking when I multiply two dense 64K x 64K matricies on a system with 32GB of physical and 128GB of virtual memory.

TimP · ‎03-22-2013

If the first quoted loop doesn't optimize (what do opt-report or equivalent options for your compiler say?) it would be worth while to put it in the form required for ifort simd reduction. The "vectorized" reduction would be accomplished by "simulated gather." OpenMP parallel should be useful (as Sergey said). Software prefetch on the B array may help, either explicitly prefetching elements for future loop iterations, or looking for compiler automatic software prefetch facility.

I assume you're using MKL parallel and setting affinity if on a multiple CPU platform.

jimdempseyatthecove · ‎03-22-2013

>>I have two matrices A(100K,100K).
>>AB(I,J)= AB(I,J) + A(IJ)*B(IDX(IJ),J)

A(IJ) has one dimension, is this a type-o or is your comment wrong?

Are IDX(IJ), IDX(IJ+1), IDX(IJ+2), ... sequential numbers?
If yes, then lift IDX(S(I)) outside the DO IJ loop.
If no, then consider performing a gather {BIJJ(IJ) = B(IDX(IJ),J)} loop just prior to DO IJ loop then use gathered BIJJ in DO IJ loop.

Jim Dempsey

ScottBoyce · ‎03-22-2013

A is not symettric/positive definite and is a sparce square matrix in compressed storage. So in the code its actually a rank 1 (vector) whose index is refered to as IJ. To prevent confusion, I will call the actual square matrix A_s and the compress stored version A. A_s holds by row the nonzero elements of A_s. The location in A of where a new row is stored is in S. For example the elements in row 10 would have the first value stored at S(10) and all the values for row 10 would be A( S(10) ), A(S(10) + 1), A(S(10) + 2),..., A(S(11) - 1). The range from S(10) to S(10)-1 are sequential. IDX is a referance to the column of A_s that an element in A refers to. For example the second value in row 10 would be located at column IDX(S(10) + 1) of A_s.

IDX is not sequential nor in any specific order, it can be 75, 10000, 50, 753, ... and so forth. What is:

gather {BIJJ(IJ) = B(IDX(IJ),J)} loop just prior to DO IJ loop then use gathered BIJJ in DO IJ loop.

and I am not familar with automatic "software prefetch facility." I am using MKL parallel, but unsure how to set the affinity (thought it was set to max # of cores by default). Are you suggestioning something like:

[fortran]DO J=1,COL

DO I=1,ROW

PF= B(IDX(S(I):S(I+1)-1),J)

AB(I,J)= MATMUL( A(S(I):S(I+1)-1)), PF )

END DO

CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)[/fortran]

My current code uses the array AB as a temporary storage to multiply out the compressed version. Then I use MKL to do B^T*(AB)

The time it current takes to solve AB on an i7-2600K with 16GB ram is 0.33 s and B^T(AB) is 0.06 s. The reason this is a choke point is that B^TAB has to be computed between 10,000 and 100,000 times during the run of the program. This process only resides in physiscal memory, so it is CPU bound.

I am not familar with opt-report or any of the other optimization software. I am just getting into the optimization side of code. I do have Intel Fortran with Visual Studio. I did try converting the I and J outer loops with a single DO CONCURRENT loop, but that resulted in an increase of time to 1 s.

TimP · ‎03-22-2013

The most basic option to suggest to the compiler the automatic addition of software prefetch is /Qopt-prefetch. mm_prefetch intrinsics and the like are in the ifort help file.

MKL chooses by default to place 1 thread on each core, regardless of whether HyperThreading is enabled. You would want the same effect in your code if you add -parallel or OpenMP. You still didn't say whether you are using a multiple CPU platform where dealing with KMP_AFFINITY is more important.

ScottBoyce · ‎03-22-2013

It is just a single CPU system i7-2600K. The final version of the code will be an executable that will run on personal highend desktops with i7's. Also this is one part of an enormous legacy code, so would it impact the rest of the code if I only compile my section with the /Qopt-prefetch and -parallel.

This is my current compile option for the entire program:

/nologo /O3 /QaxAVX /QxAVX /heap-arrays1024 /arch:AVX /real_size:64 /fp:precise /module:"Release_x64\\" /object:"Release_x64\\" /Fd"Release_x64\vc110.pdb" /libs:static /threads /Qmkl:parallel /c

SergeyKostrov · ‎03-22-2013

>>.../fp:precise... If you're not worried about accuracy ( to some degree ) then two floating point options for faster operations could be used, that is /fp:fast and /fp:fast=2.

jimdempseyatthecove · ‎03-22-2013

>>gather {BIJJ(IJ) = B(IDX(IJ),J)} loop just prior to DO IJ loop then use gathered BIJJ in DO IJ loop.

The above is annotation for forum message. Your

PF= B(IDX(S(I):S(I+1)-1),J)

Is performing a gather as a single statement as opposed to a DO loop as indicated by my post. (iow attaining the same goal of creating a rank 1 array prior to multiplication. The purpose of which is to improve vectorization. The MATMUL, using MKL is likely not inlined by IPO, therefore the gather would likely not be performed in your original code, but will be performed in your revised code. Your code makes better use of language features (re: array sections).

On large PF matricies, you may find it benificial to breakup the load of PF into smaller pieces, such that each piece fits nicely into L1.

Jim Dempsey

ScottBoyce · ‎03-22-2013

PF will actually never be larger than 10, so I will give that a try and see if there is any improvement. Should I use the MATMUL or the MKL/BLAS DDOT() to mulitiply out A(S(I):S(I+1)-1)) and PF?

Also would there be any benifit to prefetching A(S(I):S(I+1)-1)), even though S(I):S(I+1)-1 is sequential?

SergeyKostrov · ‎03-22-2013

>>...Also would there be any benifit to prefetching... You could try the following compiler options: /opt-prefetch[:n] enable levels of prefetch insertion, where 0 disables. n may be 0 through 4 inclusive. Default is 2. /opt-prefetch- disable(DEFAULT) prefetch insertion. Equivalent to /Qopt-prefetch:0

ScottBoyce · ‎03-22-2013

Thanks, I implemented that code and tested all the prefetch options but all the cases actually resulted in slower code. It went from an average of 0.3 s to about 0.55 s.

John_Campbell · ‎03-22-2013

Attached is an option which does not store AB.
It assumes that B is not a sparse matrix, which might be sonething that could be utilised.
I have assumed storage orientation for A and B which might not be correct.
You could compile with /Qparallel to see if you gain from that.

John

ScottBoyce · ‎03-22-2013

Thanks, I will take a look at that and see if it improves the code I have. Right now the best times I have been able to get have been with prefetching and changing the looping. So here is the new code I am using now:

[fortran]DO I=1,ROW
    LB=S(I)
    UB=S(I+1)-1
    U =S(I+1)-S(I)
    PF(1:U)=A(LB:UB)               !PREFETCH VECTOR ON A
DO J=1,COL
    AB(I,J)= SUM( PF(1:U)*B(IDX(LB:UB),J) )
END DO
END DO

CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)[/fortran]

This does not seem intuitive to me why it would be faster compared to prefetching B(IDX(LB:UB), but this reduced the CPU time from 0.33 s to 0.24 s.

Any suggestions for improving upon this would be greatly appreciated.

John_Campbell · ‎03-22-2013

Scott,

I don't like "B(IDX(LB:UB),J)". I'd be interested to find out how well it works.

Also it would be good to know the denstiy of the matrices. You could try something like:

call Report_Density ( 'Ae', Ae, s(row+1)-1 )
call Report_Density ( 'B', B, row*col )
call Report_Density ( 'BAB', BAB, col*col )

subroutine Report_Density ( name, a, num )
character name*(*)
real*8 a(*)
integer*4 num, i, n
!
n = 0
do i = 1,num
if (a(i)/=0) n = n+1
end do
write (*,*) 'Density for ',name,' =', real(n)/real(max(num,1))*100.,' %'
end

ScottBoyce · ‎03-22-2013

I do not like it either, but it is producing the fastest CPU times.The code presented in my first post ran in 0.33 s, the updated version runs in 0.24 s. These are averaged times over 1000 calls. I use the following commands to time my code:

[fortran]CALL SYSTEM_CLOCK(COUNT_RATE=ClockRate)

CALL SYSTEM_CLOCK(COUNT=START1)

'The code being tested'

CALL SYSTEM_CLOCK(COUNT=FINISH1)

TIME=DBLE(FINISH1-START1)/DBLE(ClockRate)

[/fortran]

For the densities, the compressed stored matrix A has no zero values by construction. The matrix B returned 100%, so there are no zero values within it. This is because B is an orthnormal matrix (ie B^TB=I, but BB^T/=I).

TimP · ‎03-23-2013

If you spelled your option /heap-arrays:1024 you should remember that this switches only arrays of a fixed larger size known at compile time to heap. Remember the comment Steve Lionel made yesterday that the numeric option has no useful effect.

Would you be able to use /assume:protect_parens /Qprec-div /Qprec-sqrt instead of /fp:precise?

As Sergey hinted, your /fp: option instructs the compiler not to optimize the SUM intrinsic. You would need the simd reduction if you wish to over-ride and optimize just this construct. It would be clearer to me if you would use dot_product, but the optimization may be the same; neither SUM nor dot_product allow for optimization as an over-ride of /fp:precise.

I'm not sure I understand your objection to parallelizing this loop (by OpenMP if not /Qparallel) when you are already linking MKL parallel, although, if you intend to run only on single CPU platforms, you may be able to saturate memory bandwidth without parallelizing it. For parallelization, you would keep the J loop as outer as you presented it originally.

SergeyKostrov · ‎03-23-2013

>>Thanks, I implemented that code and tested all the prefetch options but all the cases actually resulted in slower code. It went >>from an average of 0.3 s to about 0.55 s. Could you provide some technical details for your system? Thanks in advance. I recently detected a prefetch related issue on Ivy Bridge and I have Not seen even small performance improvements when prefetch was used. I suspect this is due to larger sizes of L3, L2 and L1 cache lines.

ScottBoyce · ‎03-23-2013

Sandy Bridge i7-2600K with 16GB ram, I do not have any performance improvement when I turn on or off prefetching, but when I include that addition vector PF for matrix A, speeds up the code a little bit.

John_Campbell · ‎03-23-2013

Scott,

The following change might improve performance for the example I provided if there are a significant number of empty rows in A

do k = 1,row
if (s(k)==s(k+1)) cycle

This would retain sparsity information for the B'.AB calculation.
{ Special cases of S(k+1)==S(k) (empty row)
or S(k+1)==S(k)+1 (single coefficient) could be coded for BAB calculation if this approach shows promise. }

Also, as B is so dense, the zero test for BAB accumulation probably hinders /Qparallel.

John