Solved: C++ and Intel MKL for Scientific programming

Petros · ‎06-23-2011

Hi everybody,

I need your expertise...

I will start writing a Scientific program (a numerical simulator for Differential Algebraic Equation system). The default language most people in Power Engineering use is Fortran. We currently have a very fast software developed in fortran.

The software, though, is getting so big with so many features that is very hard to maintain (a lot of code is duplicated with minor changes all over the program) in its current form. Also, a lot of the features that will be implemented in the future involve threading, running same algorithms on different data sets concurenlty, running on cluster etc.

I want to rewrite the software in an object oriented form in order to make it more maintainable, expandable and reusable. Also, I'm interested in trying Intel TBB for the threading needs and a GUI library for setting up an basic interface for the program.

My concern is: What about the speed of the program? Is C++ able to compare with Fortran? What Mathematical libraries should I use?

The current fortran implementation uses:

-elemental mathematical expressions (pow, exp, mult etc)

-dgetrf and dgetrs from BLAS95 (MKL library) are called million times during a simulation with matrices ranging 10-30 elements

-a sparse solver (ma37 - HSL mathematical software library) is called several times on a big (30000+ element matrice)

I currently use the newest version of Intel Parallel Studio 2011 with MKL. Also, my university owns the Intel cluster toolkit.

Many people advocate against the performance of C++ for scientific reasons (use of temporary copies etc). I also found several libraries (blitz++, goto, a++, atlas, eigen ...) promising near-fortran peformance.

Does anyone have experience with this kind of numerically intensive implementations using C++? If I use the tools (pow, exp, matrice mult, blas functions, sparse solvers etc) offered by MKL will I get good performance?

I am not looking for a definitive answer, since I know it's a difficult question. Any kind of help or insight is welcome.

Petros

TimP · ‎06-23-2011

I already suggested using the svml auto-vectorization of the Intel compilers, which should call the same functions regardless of whether you use Fortran or C++.
The "12.0 update 4" compilers implemented an arch-consistency option for the svml, which gives up some performance in return for running the same code across a variety of platforms, yielding more consistently accurate results. There's still a significant gain from the vectorization.
For real to real power, as long as you take care to avoid gratuitous data type mixtures, the Intel compilers still auto-vectorize. e.g. use powf(float, float) or pow(double, double) in order to get svml vectorization.
For small vectors, you are better off letting the compiler auto-vectorize, rather than calling an MKL function. If small vector means size < 50, it will be difficult to find any choice which performs consistently well.
Don't switch from Fortran to C or C++ and expect performance of vector operations unless you are willing to decorate your code with appropriate *restrict definitions and even check correctness of results with #pragma simd reduction() and the like. The MKL functions might be more attractive to someone who is saddled with rules against vectorizable source syntax (e.g. using Microsoft compilers).
MKL level 3 BLAS, e.g. matrix multiplication, might be expected to be well optimized for any case where the smallest dimension is 16 or more (recognizing that only larger problems will benefit from threading at MKL level).
I agree, if you are using MKL matrix multiply in a block solver with such small blocks, you will need to exploit parallelism at a higher level.

View solution in original post

TimP · ‎06-23-2011

If you are successful in structuring your problem so that most of the time is spent in MKL solvers, performance will not necessarily differ between Fortran and C++.
The threading libraries of TBB and MKL (OpenMP) aren't compatible. Under the current setup, under TBB you would be limited to the mkl sequential library, with threading under the control of TBB.
The "work-stealing" capabilities of TBB, Cilk+, ArBB threading libraries (allowing multiple applications to coexist effectively) may come at a price. With OpenMP, for full performance, you designate a group of cores dedicated to your task and the OpenMP library expects to hog those cores.
By following the ancient adage "a Fortran programmer can write Fortran in any language," it is usually possible to achieve similar performance with Intel C++ and Fortran. Both Intel compilers share the compiler auto-vectorization, taking advantage of a single "short vector math library" for math functions.
Many OOP idioms don't leave room for "Fortran in any language" style, and so you may have better performance with the recently added OOP capabilities of Fortran than C++.

Petros · ‎06-23-2011

Thanks for the prompt answer!

I'll try to answer a few remarks:

-I already use the sequential mkl library under fortran since the matrices that I solve are 10 to 30 elements, so, the threading library doesn't make me any profit. The sparse solver I use is so fast that I never thought of using a threaded one.

-The parallelization I need is not at numerical methods' level but at higher level. For example not solve a single (30x30) matrice using threaded MKL library but solve 6000 (30x30) blocks in parallel (each one affecting different data set) using sequential library for each.

-I have been programming in C for 4 years and Java for 2 years before passing to fortran 95 in the last 6 months (the program was pre-existing my arrival at this job).

I can understand about the performance of MKL libraries but what about the performance of elementary mathematical functions? For example raising to real powers, multiplying small/big vectors etc?

Again thanks for the help.

mecej4 · ‎06-23-2011

> dgetrf and dgetrs from BLAS95 (MKL library) are called million times during a simulation with matrices ranging 10-30 elements

There is something inconsistent in that statement.

First of all, the factorization and decomposition routines are from Lapack, not BLAS. However, if you are calling MKL, which contains BLAS, Lapack and other routines, the distinction gets blurred.

Secondly, if you were calling through the Fortran 95 interfaces, you would have been calling the generic names GETRF and GETRS. In that case, the overhead of allocating, populating and deallocating work arrays millions of times could and should be avoided, by putting in the programming effort to call the Fortran-77 routines.

This is something that you can check by examining the MKL calls in your source code.

The usual alerts apply concerning factorizing a matrix only once if solving a sequence of problems with the same matrix but different right hand sides.

Petros · ‎06-23-2011

Hi, sorry for that, my mistake. It's the lapack95 library. I use DGETRF with the callCALL DGETRF (M, N, A, LDA, IPIV, info) to factorize once and then DGETRS with the call CALL DGETRS(TRANS, N, NRHS, A, LDA, IPIV, B, LDB, INFO) for solving several times with different RHS.

I link with-lmkl_intel_lp64 -lmkl_lapack95_lp64 -lmkl_sequential -lmkl_core -lpthread -lma37 -liomp5.

When you say call with fortran-77 routines what do you mean?

Thanks

TimP · ‎06-23-2011

I already suggested using the svml auto-vectorization of the Intel compilers, which should call the same functions regardless of whether you use Fortran or C++.
The "12.0 update 4" compilers implemented an arch-consistency option for the svml, which gives up some performance in return for running the same code across a variety of platforms, yielding more consistently accurate results. There's still a significant gain from the vectorization.
For real to real power, as long as you take care to avoid gratuitous data type mixtures, the Intel compilers still auto-vectorize. e.g. use powf(float, float) or pow(double, double) in order to get svml vectorization.
For small vectors, you are better off letting the compiler auto-vectorize, rather than calling an MKL function. If small vector means size < 50, it will be difficult to find any choice which performs consistently well.
Don't switch from Fortran to C or C++ and expect performance of vector operations unless you are willing to decorate your code with appropriate *restrict definitions and even check correctness of results with #pragma simd reduction() and the like. The MKL functions might be more attractive to someone who is saddled with rules against vectorizable source syntax (e.g. using Microsoft compilers).
MKL level 3 BLAS, e.g. matrix multiplication, might be expected to be well optimized for any case where the smallest dimension is 16 or more (recognizing that only larger problems will benefit from threading at MKL level).
I agree, if you are using MKL matrix multiply in a block solver with such small blocks, you will need to exploit parallelism at a higher level.

mecej4 · ‎06-23-2011

You are calling the Fortran-77 routines directly. Unless Lapack95 routines are being called from somewhere else, you could leave out -lmkl_lapack95_lp64 and see no effect.

>When you say call with fortran-77 routines what do you mean?

See the MKL documetntation.

Daniel_Dopico · ‎03-31-2019

I read this post by chance. I suppose that, at the present time, the C++ library is ready but, I have to say that I faced the exact same problem. I had a code for Multibody Dynamics in Fortran 77, not general at all and suddenly I needed to convert it into a general code. I moved to Fortran 90 but as the functionality grew, more and more complexity was needed.

Nowadays, I have a FORTRAN 2008 library using Fortran object oriented features (inheritance, procedure and data polymorphism, information hidind, etc) and I don't find any good reason for rewritting my code in C++ (I rather would do the opposite). Moreover, for some features like 3D graphics rendering or communication with devices I use interoperability with C to call from my C++ main program to the fortran library and to call from my fortran library or program to C or C++ functions.

Of course it is always a matter of preferences, but I programmed in C++ during years, so I don't have any prejudice to any of these two languages.