Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- GEMM performance, linking wrong library?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

springinhetveld

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2011
12:09 AM

174 Views

GEMM performance, linking wrong library?

I seem to have a problem with the performance of the DGEMM routine. A search through the forum has showed some earlier posts on this topics and based on that I have the impression that I am doing something wrong when linking and thus end up with a "not-so-optimized" version of the GEMM (DGEMM) routine.

I just got the latest ifort yesterday for my Linux machine and thus the version I have is called:

composer_xe_sp1.7.256.

The ifort version is 12.1.0.

The MKL version I can not figure out but it was included with the composer package I downloaded.

I have compiled exactly as given by the examples in the mkl part of the installation, i.e., the examples found in:

.../mkl/example/blas

.../mkl/example/blas95

I have tried both the standard blas and the blas95. Both work fine but the peformance is significantly slower compared when I compile with gfortran. At the same time the usage of LAPACK routines is much faster with the ifort then with the gfortran.

Anyway, for the normal BLAS linking the libraries I link with are (as per the example):

$(MKLROOT)/lib/intel64/libmkl_intel_lp64.a

-Wl,--start-group

$(MKLROOT)/lib/intel64/libmkl_sequential.a

$(MKLROOT)/lib/intel64/libmkl_core.a

-Wl,--end-group -lpthread

In the case of using the BLAS95 the linking becomes

$(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a

$(MKLROOT)/lib/intel64/libmkl_intel_lp64.a-Wl,--start-group

$(MKLROOT)/lib/intel64/libmkl_sequential.a

$(MKLROOT)/lib/intel64/libmkl_core.a

-Wl,--end-group -lpthread

So just the extra blas95, and of course in the source the extra "use blas95 and mkl_precision" and also the changed calls to GEMM (rather then DGEMM) using the F95 interface.

In both cases everything works just fine except for the speed!

Since LAPACK works fine and from LAPACK I use the DPOTRx subroutines which call DGEMM it is clear that there is a good DGEMM routine in the MKL. But somehow directly calling it from my routine does not access that (optimized) DGEMM version.

To give you an idea of the perfomance issue:

DGEMM gfortran: 1 minutes

DGEMM ifort: 4 minutes

DPOTRF gfortran: 3 minutes (and this calls DGEMM, amongst others)

DPOTRF ifort: 0.5 minutes

So please help me and tell me what I am doing wrong! How can I access the "fast" DGEMM in the MKL!

Many thanks in advance!

Cheers,

Tim

Link Copied

12 Replies

barragan_villanueva_

Valued Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2011
01:03 AM

174 Views

I can see that MKL sequential library is used so far.

If your data volumeis large enough then it's reasonable to use threading.

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2011
01:30 AM

174 Views

by another words, please try to link by this way:

$(MKLROOT)/lib/intel64/libmkl_intel_lp64.a

-Wl,--start-group

$(MKLROOT)/lib/intel64/libmkl_sequential.a

$(MKLROOT)/lib/intel64/**libmkl_intel_thread.a**

$(MKLROOT)/lib/intel64/libmkl_core.a

-Wl,--end-group **-liomp5**-lpthread

springinhetveld

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2011
01:34 AM

174 Views

Could that be or is that assumption plain nonsense?

Tim

mecej4

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-18-2011
04:02 AM

174 Views

If you showed a simplified documented example, with source code, where you perform the "same" calculation (i) using Lapack and (ii) using BLAS, it would help.

springinhetveld

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-19-2011
04:51 AM

174 Views

Cheers,

Tim

springinhetveld

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2011
01:04 AM

174 Views

I found some time to make an example. Since the input file is rather large (>300MB) I have put everything on an FTP server.The example nicely shows the issue I am having.

You may find it at: ftp://ftp.positim.com (username: web87f2 and pwd: ifort12)

The example profides:

- speed.f90: A small program that read some matrices and does the work I want it to do with DGEMM

- make.ifort: Compilation commands I use to build the binary using ifort

- make.gfortran: Compilation commands I use to build the binary using gfortran

- input.dat: Binary input file

Note that the make.* are not real make files but just simple commands. You can use them, e.g. under csh, by typing:

- source make.ifort

I have run the resulting ifort and gfortran binary and looked at the CPU time that the program determines, but also I did run them using the "time" command and looked in "top" with the show threads on ("H").

I did run this 3 times for both binaries with the following results (all three runs were very similar as the machine was completely free for just this task):

ifort:

time command: 218.95 seconds 99.8%

fortran cpu_time: 3.63 min

gfortran:

time command: 29.26 seconds 99.9%

fortran cpu_time: 0.45 min

Both ifort and gfortran use only one single core of the CPU, at least according to top (showing threads).

So obviously I am getting a terrible DGEMM performance using the ifort with my compilation method. Since I do use LAPACK subroutines, that do call DGEMM, I am sure there is a good performing DGEMM in the ifort MKL library. But somehow my linking command does not find it! So please help me to figure out how to compile, or rather link, with the correct library giving me proper DGEMM performance. The difference right now is almost a factor of 7 compared to the gfortran based BLAS library. And the gfortran libraries are not well tuned for my core-i7 as the lapack routines perform poorly.

And yes, the gfortran DGEMM does the correct thing and gives the right answers (this test example does not test that but it is a 1 to 1 copy of what I normally do where I have tested that extensively)!

For completeness I also tried the "threaded" compilation adviced above. This gives for theifort:

time command: 253.84 seconds 392.5% (using all 4 cores of my core-i7 processor)

fortran cpu_time: 4.22 min

So even in this case the DGEMM I link with does not seem to be any faster. The real-time spend of course gets significantly reduced but this is basically the same DGEMM and the same poor performance...

Looking forward to your answers...

Cheers,

Tim

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2011
08:01 AM

174 Views

Not having had time to dig into it, I would guess that the operand sizes in your dgemm usage don't match well with MKL, and you may not have heeded the advice pertaining to alignments for MKL.

The default gfortran 64-bit options for building dgemm evidently do work well on core-I7; it would not surprise me to see better performance than ifort -xSSE4.2 if you have a significant rate of misaligned access.

Since you're running on linux, it should be no difficulty to link the gfortran build with MKL and the ifort with the linux distribution blas libraries, to verify if the performance is accounted for by blas.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2011
09:37 AM

174 Views

I suppose it may be necessary to time each MKL call separately, as well as displaying array sizes, to get an idea what is going on.

On this machine, the performance margin in favor of the generic linux (Centos6) blas installation is larger, just as you report, even though that one runs just 1 thread.

springinhetveld

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2011
02:50 PM

174 Views

However, I am still amazed (if not shocked) about the difference between the gfortran and ifort DGEMM library performance.... Maybe I should try it (just for fun) with MATMUL and see how that performs (must be aweful). I will try the "cross linking", i.e., the gfortran with MKL and ifort with the -lblas.

Anyway, thanks for the support so far! It has at least made it clear that I am not doing something stupid. But for the rest I am even more confused then I was before.... Dare I hope for some more feedback and clarifications on this issue? What should I do to make sure I use DGEMM properly? Some simple pointers would be nice.

By the way, to explain my persistence on this matter, I am looking for speeding up my software with a factor of two. The example I made here is the part where I spend most time and thus is the part where I may gain most time. At the moment I have a dedicated subroutine for this operation which is very fast and efficient but it does not use any BLAS/LAPACK tricks. So I am sure it can be improved (more or less) significantly. My implementation makes use of the "sparseness" of the matrices. However, I learned that true "sparse" matrices have around 5% non-zeros. My matrices have about 20-30% non-zero and are thus not really sparse. So BLAS/LAPACK type of tricks should (or at least might) give better performance then my "sparse" approach.

So the simplest first attempt I did was to use DGEMM (and DGEMV). I was very happy with the gfortran performace which did manage things in about 0.5 minutes compared to my smart implementation that manages it in 0.1 minutes. I figured that with the ifort MKL I would gain at least an other factor of 3 compared to gfortran which would bring the "brute force" DGEMM performance very close to my "smart sparse" approach. The big advantage then being that DGEMM could be run on several threads which my approach does not support.

So the poor performance of the ifort DGEMM was (and actualyl still is) really a bit of a shock.

By the way my "smart sparse" method is included in the example I made. It is the subroutine "ReduceNormalEquation" compared to the "ReduceNormalEquation2" which uses the DGEMM. That process actually runs a bit faster with ifort compared to gfortran (6 vs 8 seconds).

Anyway, hopeing for some better "news" on this....

Cheers,

Tim

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2011
06:33 PM

174 Views

I apologize if my earlier reply appeared ready to make such jumps. You have supplied a reproducer which both demonstrates the possibility that gnu tools may optimize well, and that there may be exceptions to the rule where MKL brings automatic performance gains.

springinhetveld

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2011
10:50 PM

174 Views

Just as a short reply. The subroutine that gets used in the example I provided is the one called ReduceNormalEquation2. It makes 2 calls to DGEMM and 1 call to DGEMV. No other BLAS functions and or subroutines are called! And for "clarity" and/or debugging you can turn the call to DGEMV off (and you can also turn the second call to DGEMM off). In that case only DGEMM gets used, nothing else!

From the comments in the source you can see that:

The first DGEMM call does: help = a12 * a22

Where:

- help is a n1 by n2 matrix

- a12 is passes as a21 with the flag "transposed" and a21 is n2 by n1 matrix

- a22 is a n2 by n2 matrix

The second DGEMM call does: a11 = a11 - help * a21

- a11 is a n1 by n1 matrix

- help is the n1 by n2 matrix from the previous call

- a21 is the same n2 by n1 matrix from the previous call

The DGEMV call performs: b1 = b1 - help * b2

- b1 is a vector of size n1

- help is the n1 by n2 matrix from before

- b2 is a vector of size n2

In the example I provided n1 = 6761 and n2 = 107.

Clearly the second DGEMM call is the one that is most likely to be the most time consuming part as it has to loop over all n1 x n1 elements.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-22-2011
05:40 AM

174 Views

threads time

1 3.6

2 1.8

4 0.95

6 0.64

8 0.48

12 0.34

24 1.9

One could repeat the exercise of optimizing the BLAS source with OpenMP and both gfortran and ifort; it's an open-ended task.

It seems to be difficult to optimize performance for square matrices without losing some performance in other cases.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.