Improve Speed of Looped Matrix Muliplication - Page 2

ScottBoyce · ‎03-21-2013

I have two matrices A(100K,100K) and B(100K,50) and I want to get C=[B^TAB]

A is a matrix from finite differences, but is stored in a vector to exploit its sparsity. It has two referance variables, S which refers to the location in A where the first value of row I is stored. So the range of values in A for ROW I would be S(I) to S(I+1)-1

The second variable is IDX which says the column location of a value in the vector A. For example S(I)+1 would be the second value stored in noncompressed Row I and IDX(S(I)+1) its column.

Below is my current code where I multiply AB first through loops and then multiply B^TAB through the MKL/BLAS.

[fortran]DO J=1,COL
DO I=1,ROW
    DO IJ= S(I), S(I+1)-1
        AB(I,J)= AB(I,J) + A(IJ)*B(IDX(IJ),J)
    END DO
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)[/fortran]

Any suggestions how I can improve this code/speed it up would be greatly appreciated. Right now it is a major choke point in a program I have been working on for the past six months.

Thanks in advance!

jimdempseyatthecove · ‎03-24-2013

Try:

[fortran]
!$OMP PARALLEL DO PRIVATE(PF, LB,UB,U,J)
DO I=1,ROW
LB=S(I)
UB=S(I+1)-1
U =S(I+1)-S(I)
PF(1:U)=A(LB:UB) !PREFETCH VECTOR ON A
DO J=1,COL
AB(I,J)= SUM( PF(1:U)*B(IDX(LB:UB),J) )
END DO
END DO
!$OMP END PARALLEL DO
10 CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)
[/fortran]

Jim Dempsey

John_Campbell · ‎03-24-2013

Scott,

Storage of B as B'(col,row) could improve the memory access and improve the solution time. Attached is a derivation of a solution, which uses array sections for the inner loop, where the inputs are A(*) and Bt(col,row).
This would suit AVX instructions.
The attached file shows the development of subroutine Generate_BABt (BtAB, Bt, Ae, row, col, S, idx) which might provide a fast solution using AVX instructions. ( you might experiment with col*8 being a multiple of 64 to improve AVX alignment, although I struggle to control this possible improvement)

I think that some of the problems with the other solutions is that B(row,col) addresses a large memory footprint, while addressing columns of B'(col,row) might overcome, (assuming row >> col)

John

SergeyKostrov · ‎03-24-2013

>>... you might experiment with col*8 being a multiple of 64 to improve AVX alignment... Please take into account that when AVX instructions are used alignment on a 32-byte memory boundary is needed.

John_Campbell · ‎03-24-2013

Sergey,

This is an area I have experimented with and not found a definate improvement when adjusting the alignment. My understanding is 32 byte or 256 bits is the size of the AVX register. Either my testing has been flawed, but I do not get a significant improvement.
I'll give it another try.
32 bytes implies col should be a multiple of 4 for real*8. If you try a multiple of 2 or 3 and get a different run time with AVX instructions, then that would show it is worth fixing.

Using col=50 rather than row=100,000 as the 1st dimension of B' should have a much more significant effect. (Perhaps trying col=52 and see if this is better than 50 ??)

I actually do not know why this is a limitation. Surely with the L1 and L2 cache, this could be addressed without the need to align the memory.

John

TimP · ‎03-25-2013

32-byte alignment may be useful even when using only 128-bit parallel instructions. In the present case, with the indirect access, it would not be surprising to see little effect of alignment. In a packed matrix, to preserve alignment, the leading dimension would need to be a multiple of the alignment size, but then the larger dimensions might need to avoid even sizes which could produce cache mapping conflicts.

There seems to have been confusion in this thread as to an assumption that copying an array section before using it would take the place of optimizing prefetch, and perhaps about the performance loss associated with making unnecessary copies of data. Current compilers have "simulated gather" facilities to be capable of optimizing a dot product with a gather operand, if that optimization is not suppressed by options such as /fp:precise. If you turned on the vectorization report, successful optimization would be reported as VECTORIZED.

The definitions of /fp:precise and the like aren't spelled out fully in the documents. While ifort makes no distinction between /fp:precise and /fp:source, for icl "precise" refers to improved support of exceptions, so "source" is the one which has similar meaning between C and Fortran. /fp:source includes the effects of /assume:protect_parens /Qprec-div /Qprec-sqrt /Qftz- in addition to suppressing optimization of sum reduction (unless over-ridden by !dir$ simd reduction).

A reason for suppressing optimization of sum reduction is that the optimization by vectorization gives rise to slight differences in results with alignment (which could be avoided by aligning operands). The other standards compliance effects of /fp:precise are potentially more important.

The core7-2 (which was mentioned above) and -3 CPUs should support the ftz- setting without loss of performance.

John_Campbell · ‎03-25-2013

Tim,

Do you have any comment on the use of B(IDX(LB:UB),J) in "AB(I,J)= SUM( PF(1:U)*B(IDX(LB:UB),J) )" ?
Does the use of IDX in this way improve the performance, or would it be better to say use:
PF(1:U) = B(IDX(LB:UB),j)
AB(I,J) = SUM ( A(LB:UB)*PF(1:U) )

Or, should "AB(I,J) = SUM( A(LB:UB)*B(IDX(LB:UB),J) )" perform just as well ?
Would there be a prefetch to assemble the elements of B?

I presume the use of array sections improves the AVX performance in comparison to the F77 coding of:
x = 0
DO ij = S(I),S(I+1)-1
k = idx(ij)
x = x + A(ij)*B(k,j)
END DO
AB(I,J) = x

I'd appreciate your comments on the merits of the different possible approaches.

John

TimP · ‎03-26-2013

If the compiler doesn't optimize away your extra copy of the array section, and if you optimize the dot product in both cases, the extra copy will cost performance.

I think you may still be insisting on interpreting prefetch as making a separate copy of the array section or as a synonym for gather.

Indirect prefetch doesn't happen automatically. If you look it up, e.g. by search engine terms such as indirect prefetch site:software.intel.com you will see examples with the prefetch intrinsic both for C and Fortran; possibly the latter will be intended for Intel(r) Xeon Phi(tm) but can be tried on Xeon host. Additional directive options may come out in future compilers.

You do really need to check what opt-report says about the optimization of your source code. The f77 code should work fine and gives you the additional option of applying !dir$ simd reduction(+: x) to promote optimization with simulated gather as an over-ride of your /fp:precise. This may give you an advantage with AVX compilation. Your MKL call will switch automatically into AVX by default.

f77 doesn't require you to introduce the local variable k as f66 did. Not that ifort has many remaining preferences for f77, outside of cases where ifort array assignments introduce extra array copies, which you seem to like anyway.