GEMM performance, linking wrong library?

springinhetveld · ‎11-18-2011

Dear Forum,

I seem to have a problem with the performance of the DGEMM routine. A search through the forum has showed some earlier posts on this topics and based on that I have the impression that I am doing something wrong when linking and thus end up with a "not-so-optimized" version of the GEMM (DGEMM) routine.

I just got the latest ifort yesterday for my Linux machine and thus the version I have is called:

composer_xe_sp1.7.256.

The ifort version is 12.1.0.

The MKL version I can not figure out but it was included with the composer package I downloaded.

I have compiled exactly as given by the examples in the mkl part of the installation, i.e., the examples found in:

.../mkl/example/blas

.../mkl/example/blas95

I have tried both the standard blas and the blas95. Both work fine but the peformance is significantly slower compared when I compile with gfortran. At the same time the usage of LAPACK routines is much faster with the ifort then with the gfortran.

Anyway, for the normal BLAS linking the libraries I link with are (as per the example):

$(MKLROOT)/lib/intel64/libmkl_intel_lp64.a

-Wl,--start-group

$(MKLROOT)/lib/intel64/libmkl_sequential.a

$(MKLROOT)/lib/intel64/libmkl_core.a

-Wl,--end-group -lpthread

In the case of using the BLAS95 the linking becomes

$(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a

$(MKLROOT)/lib/intel64/libmkl_intel_lp64.a

-Wl,--start-group

$(MKLROOT)/lib/intel64/libmkl_sequential.a

$(MKLROOT)/lib/intel64/libmkl_core.a

-Wl,--end-group -lpthread

So just the extra blas95, and of course in the source the extra "use blas95 and mkl_precision" and also the changed calls to GEMM (rather then DGEMM) using the F95 interface.

In both cases everything works just fine except for the speed!

Since LAPACK works fine and from LAPACK I use the DPOTRx subroutines which call DGEMM it is clear that there is a good DGEMM routine in the MKL. But somehow directly calling it from my routine does not access that (optimized) DGEMM version.

To give you an idea of the perfomance issue:

DGEMM gfortran: 1 minutes

DGEMM ifort: 4 minutes

DPOTRF gfortran: 3 minutes (and this calls DGEMM, amongst others)

DPOTRF ifort: 0.5 minutes

So please help me and tell me what I am doing wrong! How can I access the "fast" DGEMM in the MKL!

Many thanks in advance!

Cheers,

Tim

barragan_villanueva_ · ‎11-18-2011

Tim,

I can see that MKL sequential library is used so far.
If your data volumeis large enough then it's reasonable to use threading.

Gennady_F_Intel · ‎11-18-2011

by another words, please try to link by this way:

$(MKLROOT)/lib/intel64/libmkl_intel_lp64.a

-Wl,--start-group

$(MKLROOT)/lib/intel64/libmkl_sequential.a

$(MKLROOT)/lib/intel64/libmkl_intel_thread.a

$(MKLROOT)/lib/intel64/libmkl_core.a

-Wl,--end-group -liomp5-lpthread

springinhetveld · ‎11-18-2011

Yes, I know about that. But the gfortran dgemm is also running on a single thread! And clearly when using DGEMM "indirectly (by calling the DPOTRF) it seems to perform just fine. So my guess is that there are multiple DGEMMs in the MKL library and somehow when calling it directly I end up using a "no so optimal" version.

Could that be or is that assumption plain nonsense?

Tim

mecej4 · ‎11-18-2011

There are many possible combinations that are suggested by your description, so it is not clear what you are comparing when you say that Gfortran is faster.

If you showed a simplified documented example, with source code, where you perform the "same" calculation (i) using Lapack and (ii) using BLAS, it would help.

springinhetveld · ‎11-19-2011

OK, fair enough and a good idea. I will make a simple test case and compile and run that with ifort and gfortran and post the source+makefile here and the results. Not sure when I will have that ready.... Maybe I find time to do that this weekend....

Cheers,

Tim

springinhetveld · ‎11-21-2011

Hi Forum,

I found some time to make an example. Since the input file is rather large (>300MB) I have put everything on an FTP server.The example nicely shows the issue I am having.

You may find it at: ftp://ftp.positim.com (username: web87f2 and pwd: ifort12)

The example profides:

- speed.f90: A small program that read some matrices and does the work I want it to do with DGEMM

- make.ifort: Compilation commands I use to build the binary using ifort

- make.gfortran: Compilation commands I use to build the binary using gfortran

- input.dat: Binary input file

Note that the make.* are not real make files but just simple commands. You can use them, e.g. under csh, by typing:

- source make.ifort

I have run the resulting ifort and gfortran binary and looked at the CPU time that the program determines, but also I did run them using the "time" command and looked in "top" with the show threads on ("H").

I did run this 3 times for both binaries with the following results (all three runs were very similar as the machine was completely free for just this task):

ifort:

time command: 218.95 seconds 99.8%

fortran cpu_time: 3.63 min

gfortran:

time command: 29.26 seconds 99.9%

fortran cpu_time: 0.45 min

Both ifort and gfortran use only one single core of the CPU, at least according to top (showing threads).

So obviously I am getting a terrible DGEMM performance using the ifort with my compilation method. Since I do use LAPACK subroutines, that do call DGEMM, I am sure there is a good performing DGEMM in the ifort MKL library. But somehow my linking command does not find it! So please help me to figure out how to compile, or rather link, with the correct library giving me proper DGEMM performance. The difference right now is almost a factor of 7 compared to the gfortran based BLAS library. And the gfortran libraries are not well tuned for my core-i7 as the lapack routines perform poorly.

And yes, the gfortran DGEMM does the correct thing and gives the right answers (this test example does not test that but it is a 1 to 1 copy of what I normally do where I have tested that extensively)!

For completeness I also tried the "threaded" compilation adviced above. This gives for theifort:

time command: 253.84 seconds 392.5% (using all 4 cores of my core-i7 processor)

fortran cpu_time: 4.22 min

So even in this case the DGEMM I link with does not seem to be any faster. The real-time spend of course gets significantly reduced but this is basically the same DGEMM and the same poor performance...

Looking forward to your answers...

Cheers,

Tim

TimP · ‎11-21-2011

On my aging laptop, my gfortran build quotes 1.2 minutes, while ifort/MKL quotes 6.4 minutes single threaded (5.2 mins CPU time/3 mins elapsed with 2 threads).
Not having had time to dig into it, I would guess that the operand sizes in your dgemm usage don't match well with MKL, and you may not have heeded the advice pertaining to alignments for MKL.
The default gfortran 64-bit options for building dgemm evidently do work well on core-I7; it would not surprise me to see better performance than ifort -xSSE4.2 if you have a significant rate of misaligned access.
Since you're running on linux, it should be no difficulty to link the gfortran build with MKL and the ifort with the linux distribution blas libraries, to verify if the performance is accounted for by blas.

TimP · ‎11-21-2011

On my linux server, building and running the MKL with -pg set shows 90% of the time spent in MKL internal function LN8_M4_Kgas_1, which seems unexpectedly difficult to track down.
I suppose it may be necessary to time each MKL call separately, as well as displaying array sizes, to get an idea what is going on.
On this machine, the performance margin in favor of the generic linux (Centos6) blas installation is larger, just as you report, even though that one runs just 1 thread.

springinhetveld · ‎11-21-2011

Hmm, so I am getting the "right" DGEMM but the performance is pretty poor. To get better performance I should "tune" the way I am calling DGEMM?! Pff, sounds complicated and also not very "portable" to other platforms. I thought the whole idea of BLAS/LAPACK (and also MKL) was to enable having one source code that runs very well on different architectures... But I guess that was just one of those "perfect world" dreams people have and then they wake up....

However, I am still amazed (if not shocked) about the difference between the gfortran and ifort DGEMM library performance.... Maybe I should try it (just for fun) with MATMUL and see how that performs (must be aweful). I will try the "cross linking", i.e., the gfortran with MKL and ifort with the -lblas.

Anyway, thanks for the support so far! It has at least made it clear that I am not doing something stupid. But for the rest I am even more confused then I was before.... Dare I hope for some more feedback and clarifications on this issue? What should I do to make sure I use DGEMM properly? Some simple pointers would be nice.

By the way, to explain my persistence on this matter, I am looking for speeding up my software with a factor of two. The example I made here is the part where I spend most time and thus is the part where I may gain most time. At the moment I have a dedicated subroutine for this operation which is very fast and efficient but it does not use any BLAS/LAPACK tricks. So I am sure it can be improved (more or less) significantly. My implementation makes use of the "sparseness" of the matrices. However, I learned that true "sparse" matrices have around 5% non-zeros. My matrices have about 20-30% non-zero and are thus not really sparse. So BLAS/LAPACK type of tricks should (or at least might) give better performance then my "sparse" approach.

So the simplest first attempt I did was to use DGEMM (and DGEMV). I was very happy with the gfortran performace which did manage things in about 0.5 minutes compared to my smart implementation that manages it in 0.1 minutes. I figured that with the ifort MKL I would gain at least an other factor of 3 compared to gfortran which would bring the "brute force" DGEMM performance very close to my "smart sparse" approach. The big advantage then being that DGEMM could be run on several threads which my approach does not support.

So the poor performance of the ifort DGEMM was (and actualyl still is) really a bit of a shock.

By the way my "smart sparse" method is included in the example I made. It is the subroutine "ReduceNormalEquation" compared to the "ReduceNormalEquation2" which uses the DGEMM. That process actually runs a bit faster with ifort compared to gfortran (6 vs 8 seconds).

Anyway, hopeing for some better "news" on this....

Cheers,

Tim

TimP · ‎11-21-2011

Your example calls several BLAS functions in addition to dgemm. I don't expect the other MKL BLAS functions to be thoroughly optimized over a variety of cases; anyway I'm not ready to jump to conclusions about where the big slowdown occurs.
I apologize if my earlier reply appeared ready to make such jumps. You have supplied a reproducer which both demonstrates the possibility that gnu tools may optimize well, and that there may be exceptions to the rule where MKL brings automatic performance gains.

springinhetveld · ‎11-21-2011

Tim,

Just as a short reply. The subroutine that gets used in the example I provided is the one called ReduceNormalEquation2. It makes 2 calls to DGEMM and 1 call to DGEMV. No other BLAS functions and or subroutines are called! And for "clarity" and/or debugging you can turn the call to DGEMV off (and you can also turn the second call to DGEMM off). In that case only DGEMM gets used, nothing else!

From the comments in the source you can see that:

The first DGEMM call does: help = a12 * a22

Where:

- help is a n1 by n2 matrix

- a12 is passes as a21 with the flag "transposed" and a21 is n2 by n1 matrix

- a22 is a n2 by n2 matrix

The second DGEMM call does: a11 = a11 - help * a21

- a11 is a n1 by n1 matrix

- help is the n1 by n2 matrix from the previous call

- a21 is the same n2 by n1 matrix from the previous call

The DGEMV call performs: b1 = b1 - help * b2

- b1 is a vector of size n1

- help is the n1 by n2 matrix from before

- b2 is a vector of size n2

In the example I provided n1 = 6761 and n2 = 107.

Clearly the second DGEMM call is the one that is most likely to be the most time consuming part as it has to loop over all n1 x n1 elements.

TimP · ‎11-22-2011

As MKL doesn't appear to be competitive for single thread execution of your example, I thought I'd try multi-threaded performance (2 Westmere CPUs):
threads time
1 3.6
2 1.8
4 0.95
6 0.64
8 0.48
12 0.34
24 1.9

One could repeat the exercise of optimizing the BLAS source with OpenMP and both gfortran and ifort; it's an open-ended task.
It seems to be difficult to optimize performance for square matrices without losing some performance in other cases.