Matrix size 20 poses more

Espen_M_ · ‎04-28-2014

Hi

I have a function for multiplying the matrix product A^T B which I have placed in a MODULE. When doing some simple timing (using CPU_TIME) I found that making the function internal is about twice as fast as the MODULE procedure. Is there a simple explanation for this?

Also I found that computing MATMUL(A,B) takes the same time as MATMUL(A,TRANSPOSE(B)), implying that the TRANSPOSE operation is optimized away, whereas MATMUL(TRANSPOSE(A),B) takes a lot longer (which is the reason I wrote my own in the first place). Is there a reason for not optimizing away the transposition in tahat case?

PS: the matrices I used for the timing were 600x600 in double precision

Steven_L_Intel1 · ‎04-28-2014

Interprocedural optimzation would happen by default for internal procedures. You might try turning on whole-program interprocedural optimization which will likely erase the difference.

Espen_M_ · ‎04-28-2014

Thaks for the quick response, Steve!

How do I do that?

TimP · ‎04-28-2014

Espen M. wrote:

Hi

Also I found that computing MATMUL(A,B) takes the same time as MATMUL(A,TRANSPOSE(B)), implying that the TRANSPOSE operation is optimized away, whereas MATMUL(TRANSPOSE(A),B) takes a lot longer (which is the reason I wrote my own in the first place). Is there a reason for not optimizing away the transposition in tahat case?

PS: the matrices I used for the timing were 600x600 in double precision

For a problem of that size, you would want to be using /Qopt-matmul (or equivalent -O3), and setting OpenMP affinity environment variables if you are allowing it to go multi-thread. Then the effect of TRANSPOSE in the function call would be expected to be taken care of by adjustments in the library calls (analogous to DGEMM transposition arguments). If the opt-matmul scheme doesn't match your requirements, you would want to use DGEMM directly (or via lapack95).

Steven_L_Intel1 · ‎04-28-2014

I agree with Tim that you should be looking at MKL's xGEMM implementation which would likely outperform anything you could code, as long as the array sized were large enough. If you want to try whole-program optimization and you are using Visual Studio, set project property Fortran > Optimization > Interprocedural Optimzation to "Multi-File (/Qipo)"

Espen_M_ · ‎04-29-2014

The Interprocedural Optimzation did the trick! Thanks :)

Normally the matrices are not as large as 600x600, they just had to be that large for CPU_TIME to give non-zero measurements and I thought that enlarging the arrays would perhaps give more meaningful figures regarding the algorithm efficiency than making multiple calls in a loop since both the function calls and the loop itself would bring some overhead (any comments on that rationale?). Normally the matrices will be smaller than 20x20, so that I doubt lapack would perform much better, right?

Any comments on the MATMUL(TRANSPOSE(A),B) case, where I easily outperform (by a factor of 15 with the 600x600 matrices!) the intrinsic by simply by writing a trivial function?

TimP · ‎04-29-2014

Matrix size 20 poses more choices than you apparently want to consider (opt-matmul, dgemm, -O3 -Qopt-matmul-, as well as varying number of threads). If you want to optimize cpu time per thread, you can hold it to one thread. Certainly, you won't need many threads for this small case.

Espen_M_ · ‎04-29-2014

Tim, if you're referring to my question regarding the MATMUL(TRANSPOSE(A),B) case, it's not that I'm against using /Qopt-matmul, say, I just find it peculiar that it does not get the same optimization as MATMUL(A,TRANSPOSE(B)) without any fiddeling with compiler options etc.

I thought that LAPACK routines was mainly targeted towards larger arrays, e.g. size 50 and above; is this a misconception on my side?

And the library routines that /Qopt-matmul invokes, are they different from the LAPACK routines? If so, why aren't they used by default?

Can I find information on this somewhere?

Thank you for all the information!

TimP · ‎04-29-2014

Opt-matmul has different entry points in mkl than dgemm. I suppose there is common code. I can't guess from here how your comparisons would come out nor how they would relate to what you want to know.

internal procedures faster than MODULE procedures??