- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I have a function for multiplying the matrix product A^T B which I have placed in a MODULE. When doing some simple timing (using CPU_TIME) I found that making the function internal is about twice as fast as the MODULE procedure. Is there a simple explanation for this?
Also I found that computing MATMUL(A,B) takes the same time as MATMUL(A,TRANSPOSE(B)), implying that the TRANSPOSE operation is optimized away, whereas MATMUL(TRANSPOSE(A),B) takes a lot longer (which is the reason I wrote my own in the first place). Is there a reason for not optimizing away the transposition in tahat case?
PS: the matrices I used for the timing were 600x600 in double precision
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interprocedural optimzation would happen by default for internal procedures. You might try turning on whole-program interprocedural optimization which will likely erase the difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thaks for the quick response, Steve!
How do I do that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Espen M. wrote:For a problem of that size, you would want to be using /Qopt-matmul (or equivalent -O3), and setting OpenMP affinity environment variables if you are allowing it to go multi-thread. Then the effect of TRANSPOSE in the function call would be expected to be taken care of by adjustments in the library calls (analogous to DGEMM transposition arguments). If the opt-matmul scheme doesn't match your requirements, you would want to use DGEMM directly (or via lapack95).
Hi
Also I found that computing MATMUL(A,B) takes the same time as MATMUL(A,TRANSPOSE(B)), implying that the TRANSPOSE operation is optimized away, whereas MATMUL(TRANSPOSE(A),B) takes a lot longer (which is the reason I wrote my own in the first place). Is there a reason for not optimizing away the transposition in tahat case?
PS: the matrices I used for the timing were 600x600 in double precision
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree with Tim that you should be looking at MKL's xGEMM implementation which would likely outperform anything you could code, as long as the array sized were large enough. If you want to try whole-program optimization and you are using Visual Studio, set project property Fortran > Optimization > Interprocedural Optimzation to "Multi-File (/Qipo)"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Interprocedural Optimzation did the trick! Thanks :)
Normally the matrices are not as large as 600x600, they just had to be that large for CPU_TIME to give non-zero measurements and I thought that enlarging the arrays would perhaps give more meaningful figures regarding the algorithm efficiency than making multiple calls in a loop since both the function calls and the loop itself would bring some overhead (any comments on that rationale?). Normally the matrices will be smaller than 20x20, so that I doubt lapack would perform much better, right?
Any comments on the MATMUL(TRANSPOSE(A),B) case, where I easily outperform (by a factor of 15 with the 600x600 matrices!) the intrinsic by simply by writing a trivial function?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim, if you're referring to my question regarding the MATMUL(TRANSPOSE(A),B) case, it's not that I'm against using /Qopt-matmul, say, I just find it peculiar that it does not get the same optimization as MATMUL(A,TRANSPOSE(B)) without any fiddeling with compiler options etc.
I thought that LAPACK routines was mainly targeted towards larger arrays, e.g. size 50 and above; is this a misconception on my side?
And the library routines that /Qopt-matmul invokes, are they different from the LAPACK routines? If so, why aren't they used by default?
Can I find information on this somewhere?
Thank you for all the information!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page