matrix multiplication

roddur · ‎04-24-2009

in the code shown(somehow i cant "quote" the code)
do j=1,lorbit
do l=1,lorbit
sumk=0.0d0
do k=1,lorbit
sumk=sumk+srl(j,k,i)*ap5(k,l)
enddo
ap61(j,l)=sumk
enddo
enddo
lorbit varies from 9 to 17, i is >70000 but as you can see, is not involved here.
I know this mode of matrix multiplication is very slow. but given the size of my array does it really matter? if matters, what is my best bet? using lapack or matmul?

bubin · ‎04-25-2009

Quoting - roddur

in the code shown(somehow i cant "quote" the code)
do j=1,lorbit
do l=1,lorbit
sumk=0.0d0
do k=1,lorbit
sumk=sumk+srl(j,k,i)*ap5(k,l)
enddo
ap61(j,l)=sumk
enddo
enddo
lorbit varies from 9 to 17, i is >70000 but as you can see, is not involved here.
I know this mode of matrix multiplication is very slow. but given the size of my array does it really matter? if matters, what is my best bet? using lapack or matmul?

I think what might speed up your calculations a little bit is to copy t(1:lorbit,1:lorbit)=srl(1:lorbit,1:lobbit,j) and then use that temporary (two dimensional) array t in matrix multiplication. If you call an external matrix-matrix multiplication routine you will actually need to do such copying anyway.
For external matrix-matrix multiplication routine your best bet is BLAS tuned for your machine architechture. However, since the matrices are rather small you'd actually better test what gives the best performance. If you don't use an external multiplication routine, you can try to force loop unrolling when compiling this particular fragment of code (or source file).

TimP · ‎04-25-2009

If OP is in doubt about whether the last few percent performance matters, why not start with MATMUL? The compiler would create automatically any temporaries which have a chance of being useful, and would vectorize the operations. Given the quoted sizes, MATMUL or the original code at -O3 would likely be faster than tuned BLAS. The main point in avoiding MATMUL would be to avoid creating temporaries, which appears possible to accomplish with BLAS.
Attempting matmul by eye (untested):
ap61(:lorbit,:lorbit) = matmul(sr(:lorbit,:lorbit,i),ap5(:lorbit,:lorbit))
The short answer is we couldn't say whether it matters, you would probably have to test it, even if you gave more information.
Writing the original code with dot_product might improve readability without necessarily affecting the generated code. You may as well dictate a more suitable loop nesting, so as not to depend on -O3, although it may not matter at such small sizes:
do l=1,lorbit
do j=1,lorbit
ap61(j,l)=dot_product(srl(j,:lorbit,i),ap5(:lorbit,l))
enddo
enddo

bubin · ‎04-27-2009

I suspect that using dot_product may not be the best way to do matrix multiplication when lorbit is small as there will be significant overhead penalty (dot_product has to be called lorbit times, not once as matmul or similar BLAS routine). But again, only actual testing can give the answer.

TimP · ‎04-27-2009

Quoting - bubin

I suspect that using dot_product may not be the best way to do matrix multiplication when lorbit is small as there will be significant overhead penalty (dot_product has to be called lorbit times, not once as matmul or similar BLAS routine). But again, only actual testing can give the answer.

dot_product generates inline code in all satisfactory compilers. If you are using an obsolete gfortran or g95 where that doesn't happen, please upgrade.
I assumed this thread was about ifort, where both dot_product and matmul generate inline code. Thus, -O3 could have a signficant effect.
The quoted problem size is small enough that any external call such as BLAS would be expected to reduce performance. If you used the external-blas option in gfortran, you would still be under the size threshold for that choice to be taken at run time, and gfortran would use its own matmul library function, possibly not as fast as inline code.

bubin · ‎04-28-2009

tim18,

First of all, I would not call gfortran an unsatisfactory compiler. It is an excellent product (if one disregards numerous bugs in the very first versions of it, like 4.0-4.1). And it simply cannot be called obselete because not that much time has passed since it was first included in gnu compiler collection. As far as me is concerned, I am using a relatively recent version of it (4.3.2). In fact, gfortran has some nice features that ifort, unfortunately, does not offer. I have seen many cases when it even generates faster code. Although, in general, I would give ifort an edge here. Anyway, I also assumed this thread was about ifort.

I was not lazy to actually do some rough tests on my desktop computer (for some reason it became interesting for me). It is no way a scientific testing, so I am not claiming it is extremely accurate. After all, hardware differences will probably cause some variations of the results. So I did five diffeent ways of multiplications. lorbit was declared as a parameter, i_max was 70000, all arays were static. I tested the case lorbit=9, lorbit=12, and lorbit=17. Timings are given below. I run every case for four times and took the best time. I only tried -O2 and -O3 flags (did not want to spend an entire day playing with it). The BLAS DGEMM was taken from Intel MKL library. It is obviously optimized for x86, but certainly not finetuned for my particular machine (Core 2 Dou 6700 2.66Mhz). The three numbers correspond to the best timings for lorbit=9,12, and 17 respectively. I used ifort 11.0 and gfortran 4.3.2 for making binaries.

original code
ifort -O2 4.01 11.00 40.35
ifort -O3 4.01 11.02 36.79
gfortran -O2 8.13 18.46 48.13
gfortran -O3 3.54 9.10 28.58

original code + precopying t=srl(:,:,i)
ifort -O2 3.97 10.47 33.85
ifort -O3 3.98 10.46 33.84
gfortran -O2 11.84 17.13 49.96
gfortran -O3 3.63 9.12 28.74

dot_product code
ifort -O2 4.08 10.73 32.83
ifort -O3 4.08 10.74 33.29
gfortran -O2 8.32 19.34 50.90
gfortran -O3 4.84 10.77 35.01

matmul code
ifort -O2 5.50 8.42 29.87
ifort -O3 5.68 7.42 30.64
gfortran -O2 8.57 ??.? 42.72
gfortran -O3 9.34 ??.? 49.76

BLAS DGEMM code + precopying t=srl(:,:,i)
ifort -O2 8.32 8.48 25.86
ifort -O3 8.44 8.52 26.98

??.? - there was some glitch -- I am not sure what's going on there

Looking at this data one can conclude that it does make sense to use BLAS for lorbit=17 as it may provide up to 20-30% performance improvement. For smallest lorbit value the fastest way is to use the original code with an additional copying t=srl(:,:,i) which I mentioned in a post above. Even tough you claim that dot_product is inlined when compiled with ifort -O3, it is still not the fastest way to do the multiplication. It should also be mentioned that the differences between many of the timings are actually marginal and sometimes even smaller than the statistical error.

Sergiy

P.S. The "obsolete" gfortran is actually overperforming ifort a little when the original code (do loops) are used.

TimP · ‎04-28-2009

Thanks for showing the results.
I mentioned gfortran on account of its built-in feature for choosing between its own matmul and BLAS with a compile option to set the size for switching over. You verified the expectation that BLAS would not be efficient for the smallest cases, but it looks like gfortran's own library function doesn't improve on it. If I recall correctly, there is only one case vectorized in gfortran matmul.
If dot_product weren't inlined, it would have taken much longer than the other versions you tested. You could easily check for in-lining, for example by looking at the symbol table of your object file. It looks to me that it didn't perform much different from your original code, which is the expectation I intended to convey.
It may not be surprising if the ifort matmul is favoring multiples of 4; that would be moderately good news, as ifort vectorization often favors only multiples of 8 with aligned data.