Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Tools (Compilers, Debuggers, Profilers & Analyzers)
- Intel® Fortran Compiler
- internal procedures faster than MODULE procedures??

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Espen_M_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-28-2014
10:12 AM

31 Views

internal procedures faster than MODULE procedures??

Hi

I have a function for multiplying the matrix product A^T B which I have placed in a MODULE. When doing some simple timing (using CPU_TIME) I found that making the function internal is about twice as fast as the MODULE procedure. Is there a simple explanation for this?

Also I found that computing MATMUL(A,B) takes the same time as MATMUL(A,TRANSPOSE(B)), implying that the TRANSPOSE operation is optimized away, whereas MATMUL(TRANSPOSE(A),B) takes a lot longer (which is the reason I wrote my own in the first place). Is there a reason for not optimizing away the transposition in tahat case?

PS: the matrices I used for the timing were 600x600 in double precision

8 Replies

Highlighted
##

Interprocedural optimzation would happen by default for internal procedures. You might try turning on whole-program interprocedural optimization which will likely erase the difference.

Steven_L_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-28-2014
10:41 AM

31 Views

Highlighted
##

Espen_M_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-28-2014
10:57 AM

31 Views

Thaks for the quick response, Steve!

How do I do that?

Highlighted
##

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-28-2014
11:11 AM

31 Views

Espen M. wrote:For a problem of that size, you would want to be using /Qopt-matmul (or equivalent -O3), and setting OpenMP affinity environment variables if you are allowing it to go multi-thread. Then the effect of TRANSPOSE in the function call would be expected to be taken care of by adjustments in the library calls (analogous to DGEMM transposition arguments). If the opt-matmul scheme doesn't match your requirements, you would want to use DGEMM directly (or via lapack95).

Hi

Also I found that computing MATMUL(A,B) takes the same time as MATMUL(A,TRANSPOSE(B)), implying that the TRANSPOSE operation is optimized away, whereas MATMUL(TRANSPOSE(A),B) takes a lot longer (which is the reason I wrote my own in the first place). Is there a reason for not optimizing away the transposition in tahat case?

PS: the matrices I used for the timing were 600x600 in double precision

Highlighted
##

I agree with Tim that you should be looking at MKL's xGEMM implementation which would likely outperform anything you could code, as long as the array sized were large enough. If you want to try whole-program optimization and you are using Visual Studio, set project property Fortran > Optimization > Interprocedural Optimzation to "Multi-File (/Qipo)"

Steven_L_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-28-2014
11:36 AM

31 Views

Highlighted
##

Espen_M_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2014
01:46 AM

31 Views

The Interprocedural Optimzation did the trick! Thanks :)

Normally the matrices are not as large as 600x600, they just had to be that large for CPU_TIME to give non-zero measurements and I thought that enlarging the arrays would perhaps give more meaningful figures regarding the algorithm efficiency than making multiple calls in a loop since both the function calls and the loop itself would bring some overhead (any comments on that rationale?). Normally the matrices will be smaller than 20x20, so that I doubt lapack would perform much better, right?

Any comments on the MATMUL(TRANSPOSE(A),B) case, where I easily outperform (by a factor of 15 with the 600x600 matrices!) the intrinsic by simply by writing a trivial function?

Highlighted
##

Matrix size 20 poses more choices than you apparently want to consider (opt-matmul, dgemm, -O3 -Qopt-matmul-, as well as varying number of threads). If you want to optimize cpu time per thread, you can hold it to one thread. Certainly, you won't need many threads for this small case.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2014
03:05 AM

31 Views

Highlighted
##

Espen_M_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2014
05:38 AM

31 Views

Tim, if you're referring to my question regarding the MATMUL(TRANSPOSE(A),B) case, it's not that I'm against using /Qopt-matmul, say, I just find it peculiar that it does not get the same optimization as MATMUL(A,TRANSPOSE(B)) without any fiddeling with compiler options etc.

I thought that LAPACK routines was mainly targeted towards larger arrays, e.g. size 50 and above; is this a misconception on my side?

And the library routines that /Qopt-matmul invokes, are they different from the LAPACK routines? If so, why aren't they used by default?

Can I find information on this somewhere?

Thank you for all the information!

Highlighted
##

Opt-matmul has different entry points in mkl than dgemm. I suppose there is common code. I can't guess from here how your comparisons would come out nor how they would relate to what you want to know.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-29-2014
06:09 AM

31 Views

For more complete information about compiler optimizations, see our Optimization Notice.