- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have two matrices A(100K,100K) and B(100K,50) and I want to get C=[BTAB]
A is a matrix from finite differences, but is stored in a vector to exploit its sparsity. It has two referance variables, S which refers to the location in A where the first value of row I is stored. So the range of values in A for ROW I would be S(I) to S(I+1)-1
The second variable is IDX which says the column location of a value in the vector A. For example S(I)+1 would be the second value stored in noncompressed Row I and IDX(S(I)+1) its column.
Below is my current code where I multiply AB first through loops and then multiply BTAB through the MKL/BLAS.
[fortran]DO J=1,COL
DO I=1,ROW
DO IJ= S(I), S(I+1)-1
AB(I,J)= AB(I,J) + A(IJ)*B(IDX(IJ),J)
END DO
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)[/fortran]
Any suggestions how I can improve this code/speed it up would be greatly appreciated. Right now it is a major choke point in a program I have been working on for the past six months.
Thanks in advance!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You could try:
DO I=1,ROW
DO IJ= S(I), S(I+1)-1
DO J=1,COL
AB(I,J)= AB(I,J) + A(IJ)*B(IDX(IJ),J)
END DO
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)
or the inner loop as
k = idx(IJ)
AB(I,1:ROW)= AB(I,1:ROW) + A(IJ)*B(k,1:ROW)
Alternatively, ignoring the compression, you might be able to work from the attached approach to compute BAB, by having a temporary copy of AB. I'm not sure of IDX in your example above.
Is A and BAB symmetric ?
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the first quoted loop doesn't optimize (what do opt-report or equivalent options for your compiler say?) it would be worth while to put it in the form required for ifort simd reduction. The "vectorized" reduction would be accomplished by "simulated gather." OpenMP parallel should be useful (as Sergey said). Software prefetch on the B array may help, either explicitly prefetching elements for future loop iterations, or looking for compiler automatic software prefetch facility.
I assume you're using MKL parallel and setting affinity if on a multiple CPU platform.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>I have two matrices A(100K,100K).
>>AB(I,J)= AB(I,J) + A(IJ)*B(IDX(IJ),J)
A(IJ) has one dimension, is this a type-o or is your comment wrong?
Are IDX(IJ), IDX(IJ+1), IDX(IJ+2), ... sequential numbers?
If yes, then lift IDX(S(I)) outside the DO IJ loop.
If no, then consider performing a gather {BIJJ(IJ) = B(IDX(IJ),J)} loop just prior to DO IJ loop then use gathered BIJJ in DO IJ loop.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A is not symettric/positive definite and is a sparce square matrix in compressed storage. So in the code its actually a rank 1 (vector) whose index is refered to as IJ. To prevent confusion, I will call the actual square matrix As and the compress stored version A. As holds by row the nonzero elements of As. The location in A of where a new row is stored is in S. For example the elements in row 10 would have the first value stored at S(10) and all the values for row 10 would be A( S(10) ), A(S(10) + 1), A(S(10) + 2),..., A(S(11) - 1). The range from S(10) to S(10)-1 are sequential. IDX is a referance to the column of As that an element in A refers to. For example the second value in row 10 would be located at column IDX(S(10) + 1) of As.
IDX is not sequential nor in any specific order, it can be 75, 10000, 50, 753, ... and so forth. What is:
gather {BIJJ(IJ) = B(IDX(IJ),J)} loop just prior to DO IJ loop then use gathered BIJJ in DO IJ loop.
and I am not familar with automatic "software prefetch facility." I am using MKL parallel, but unsure how to set the affinity (thought it was set to max # of cores by default). Are you suggestioning something like:
[fortran]DO J=1,COL
DO I=1,ROW
PF= B(IDX(S(I):S(I+1)-1),J)
AB(I,J)= MATMUL( A(S(I):S(I+1)-1)), PF )
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)[/fortran]
My current code uses the array AB as a temporary storage to multiply out the compressed version. Then I use MKL to do BT*(AB)
The time it current takes to solve AB on an i7-2600K with 16GB ram is 0.33 s and BT(AB) is 0.06 s. The reason this is a choke point is that BTAB has to be computed between 10,000 and 100,000 times during the run of the program. This process only resides in physiscal memory, so it is CPU bound.
I am not familar with opt-report or any of the other optimization software. I am just getting into the optimization side of code. I do have Intel Fortran with Visual Studio. I did try converting the I and J outer loops with a single DO CONCURRENT loop, but that resulted in an increase of time to 1 s.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The most basic option to suggest to the compiler the automatic addition of software prefetch is /Qopt-prefetch. mm_prefetch intrinsics and the like are in the ifort help file.
MKL chooses by default to place 1 thread on each core, regardless of whether HyperThreading is enabled. You would want the same effect in your code if you add -parallel or OpenMP. You still didn't say whether you are using a multiple CPU platform where dealing with KMP_AFFINITY is more important.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is just a single CPU system i7-2600K. The final version of the code will be an executable that will run on personal highend desktops with i7's. Also this is one part of an enormous legacy code, so would it impact the rest of the code if I only compile my section with the /Qopt-prefetch and -parallel.
This is my current compile option for the entire program:
/nologo /O3 /QaxAVX /QxAVX /heap-arrays1024 /arch:AVX /real_size:64 /fp:precise /module:"Release_x64\\" /object:"Release_x64\\" /Fd"Release_x64\vc110.pdb" /libs:static /threads /Qmkl:parallel /c
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>gather {BIJJ(IJ) = B(IDX(IJ),J)} loop just prior to DO IJ loop then use gathered BIJJ in DO IJ loop.
The above is annotation for forum message. Your
PF= B(IDX(S(I):S(I+1)-1),J)
Is performing a gather as a single statement as opposed to a DO loop as indicated by my post. (iow attaining the same goal of creating a rank 1 array prior to multiplication. The purpose of which is to improve vectorization. The MATMUL, using MKL is likely not inlined by IPO, therefore the gather would likely not be performed in your original code, but will be performed in your revised code. Your code makes better use of language features (re: array sections).
On large PF matricies, you may find it benificial to breakup the load of PF into smaller pieces, such that each piece fits nicely into L1.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PF will actually never be larger than 10, so I will give that a try and see if there is any improvement. Should I use the MATMUL or the MKL/BLAS DDOT() to mulitiply out A(S(I):S(I+1)-1)) and PF?
Also would there be any benifit to prefetching A(S(I):S(I+1)-1)), even though S(I):S(I+1)-1 is sequential?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, I implemented that code and tested all the prefetch options but all the cases actually resulted in slower code. It went from an average of 0.3 s to about 0.55 s.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, I will take a look at that and see if it improves the code I have. Right now the best times I have been able to get have been with prefetching and changing the looping. So here is the new code I am using now:
[fortran]DO I=1,ROW
LB=S(I)
UB=S(I+1)-1
U =S(I+1)-S(I)
PF(1:U)=A(LB:UB) !PREFETCH VECTOR ON A
DO J=1,COL
AB(I,J)= SUM( PF(1:U)*B(IDX(LB:UB),J) )
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)[/fortran]
This does not seem intuitive to me why it would be faster compared to prefetching B(IDX(LB:UB), but this reduced the CPU time from 0.33 s to 0.24 s.
Any suggestions for improving upon this would be greatly appreciated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Scott,
I don't like "B(IDX(LB:UB),J)". I'd be interested to find out how well it works.
Also it would be good to know the denstiy of the matrices. You could try something like:
call Report_Density ( 'Ae', Ae, s(row+1)-1 )
call Report_Density ( 'B', B, row*col )
call Report_Density ( 'BAB', BAB, col*col )
subroutine Report_Density ( name, a, num )
character name*(*)
real*8 a(*)
integer*4 num, i, n
!
n = 0
do i = 1,num
if (a(i)/=0) n = n+1
end do
write (*,*) 'Density for ',name,' =', real(n)/real(max(num,1))*100.,' %'
end
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not like it either, but it is producing the fastest CPU times.The code presented in my first post ran in 0.33 s, the updated version runs in 0.24 s. These are averaged times over 1000 calls. I use the following commands to time my code:
[fortran]CALL SYSTEM_CLOCK(COUNT_RATE=ClockRate)
CALL SYSTEM_CLOCK(COUNT=START1)
'The code being tested'
CALL SYSTEM_CLOCK(COUNT=FINISH1)
TIME=DBLE(FINISH1-START1)/DBLE(ClockRate)
[/fortran]
For the densities, the compressed stored matrix A has no zero values by construction. The matrix B returned 100%, so there are no zero values within it. This is because B is an orthnormal matrix (ie BTB=I, but BBT/=I).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you spelled your option /heap-arrays:1024 you should remember that this switches only arrays of a fixed larger size known at compile time to heap. Remember the comment Steve Lionel made yesterday that the numeric option has no useful effect.
Would you be able to use /assume:protect_parens /Qprec-div /Qprec-sqrt instead of /fp:precise?
As Sergey hinted, your /fp: option instructs the compiler not to optimize the SUM intrinsic. You would need the simd reduction if you wish to over-ride and optimize just this construct. It would be clearer to me if you would use dot_product, but the optimization may be the same; neither SUM nor dot_product allow for optimization as an over-ride of /fp:precise.
I'm not sure I understand your objection to parallelizing this loop (by OpenMP if not /Qparallel) when you are already linking MKL parallel, although, if you intend to run only on single CPU platforms, you may be able to saturate memory bandwidth without parallelizing it. For parallelization, you would keep the J loop as outer as you presented it originally.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sandy Bridge i7-2600K with 16GB ram, I do not have any performance improvement when I turn on or off prefetching, but when I include that addition vector PF for matrix A, speeds up the code a little bit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Scott,
The following change might improve performance for the example I provided if there are a significant number of empty rows in A
do k = 1,row
if (s(k)==s(k+1)) cycle
This would retain sparsity information for the B'.AB calculation.
{ Special cases of S(k+1)==S(k) (empty row)
or S(k+1)==S(k)+1 (single coefficient) could be coded for BAB calculation if this approach shows promise. }
Also, as B is so dense, the zero test for BAB accumulation probably hinders /Qparallel.
John

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page