Improving performance

Francesco_B_4 · ‎10-13-2015

I developed a Fortran 77 program which is over 10,000 lines long and performs general engineering calculations and matrix computations. As an example, for a specific problem I am working on, the executable runs in about 5.5 hours. Being an incremental analysis, if I double the input number of increments (i.e. the accuracy of calculations), the run-time will double (i.e. 11 hours), and so on. My machine mounts an Intel Core i7 CPU 920 @ 2.67GHz with 6GB RAM (and MS Windows 7 64 bit) and I currently generate my executable with Intel Visual Fortran 11.1 (with MS Visual Studio 2008). While running the analysis, I noted that the CPU usage is around 12%. I understand that this low CPU usage is mainly because the analysis is not making use of all processors. Apart from modifying/rewriting the code so that it can make a more efficient use of the multi processor (I assume by making use of the parallelization features of a more modern compiler such as Intel Parallel Studio XE), I am interested in finding a quick way to decrease the run-time of my analyses. I have tried to use a similar Fortran 90 version of the code and also some compiler options (such as /fast and/or /Qparallel) but the run-time remains the same.

My questions are:

1) Would Intel Parallel Studio XE 2015 produce faster executables than Intel Visual Fortran 11.1, even if I keep unchanged my current code (i.e. without making use of the parallelization features)?

2) Would a new machine decrease significantly the run-time (still keeping unchanged my current code)? If yes, which machine would you suggest? I am wondering whether a machine with a single more powerful processor would be faster than a multi-processor one.

Any help would be much appreciated.

jimdempseyatthecove · ‎10-13-2015

If you desire to .NOT. take the effort to parallelize your code, then you need to do whatever you can to improve the vectorization. This may require you to look at your data layout, but may be as simple as using the appropriate compiler switches, and possibly adding a few compiler directives (to improve vectoization). You should perform the vectorization improvements first, as any effort spent here will be as valid when you decide to go parallel. The second non-parallel option you have is to replace your CPU with a newer processor (and motherboard) with faster everything (clock, QPI, RAM, AVX2, more cache). A typical example:

http://www.newegg.com/Product/Product.aspx?Item=N82E16819117369

Core i7-4790K
LGA 1150
Quad-Core/8 Threads
4.0 GHz (4.4 GHz Max Turbo)
L2 4x256KB
L3 8MB
~$340

or newer series value/performance

http://www.newegg.com/Product/Product.aspx?Item=N82E16819117559

I'd suggest the second one. But please look for reviews first. Your old CPU came out in 2008, and it may be time to retire it.

You really should consider looking at parallelizing your code. It is not all that difficult. And there are plenty of readers on this forum that will help you to learn to parallelize. Do the parallelization after you address vectorization issues.
The CPU you have now supports the SSE 4.2 instruction set. This has small vector size of 16 bytes (4 floats or 2 doubles), the newer CPUs support AVX and/or AVX2. These small vectors are 32-bytes wide (8 floats or 4 doubles). If (strong if) your code is vectorizable, you can get more computations done in fewer clock ticks.

Jim Dempsey

TimP · ‎10-13-2015

ifort xe2016 has improved the auto-parallel option /Qparallel but the gains there still depend strongly on the characteristics of your application. Once you know where your application spends its time, the information in /Qopt-report should help you figure out whether those parts of you application are adequately optimized by the compiler. The Advisor application in the Parallel Studio XE2016 is a convenient way to present this information, but you must be prepared to work at using VTune and Advisor to justify the extra price.

"12% utilization" probably corresponds with using only 1 of 8 hardware threads. You aren't likely to get it up to "50%" (leaving HyperThreading enabled) even if your application is perfectly parallelizable.

If you spend significant time in operations which can be handled by MKL performance library, that is the quickest route toward using your 4 cores. If you use the equivalent of MATMUL, you could write it in explicitly and try the relevant compile options /Qopt-matmul (invoking threaded MKL support) and "/Qopt-matmul- -O3" (invoking single thread in-line optimization).

In order to get significant gain in performance per core from a new platform, you need to make effective use of AVX2 vectorization. If you don't see a gain now from /Qvec- to the default /Qvec, this will not happen unless you can identify important code sections and vectorize them. Typical gain in performance per thread from SSE2-vectorized code to AVX2 is about 40%. Remembering Amdahl's law, if you speed up parts of your application which current take 50% of the time by 4x (certainly possible if they are vectorizable or parallelizable), you save less than 40% of the time you spend now.

jimdempseyatthecove · ‎10-13-2015

You could also consider: http://www.newegg.com/Product/Product.aspx?Item=N82E16819117499

E5-1650 v3

While it is more expensive, it has almost twice as much L3 cache (15MB)

Jim Dempsey

Francesco_B_4 · ‎10-14-2015

Thank you Tom and Jim for your feedback. I have just one question for Jim: the latest Intel processor you suggest (E5-1650 v3) is actually a server one, would you please be able to suggest a corresponding/similar version for a desktop processor?

I understand that a frequency-optimized processor is ideal in my case given that it's (currently) a non-parallel single-threaded application so I need a CPU with the highest clock speed (sacrificing number of cores in order to provide the highest frequencies).

Thank you.

andrew_4619 · ‎10-14-2015

You suggested you tried /QParallel if you haven't done so I suggest:

/Qparallel /Qipo /Qopt-report:3 and in VS look at the source with the optimisation reports overlayed for some key areas of the code. This will tell you the degree to which you get auto-parallelisation and vectorisation and might show some easy gains that can be made....

jimdempseyatthecove · ‎10-14-2015

Don't be fixated on Clock Speed alone.

               E5-1650 v3  Core i7-6700K
Clock          3.5/3.8      4.0/4.2
L3             15 MB        8 MB
Mem Channels   4            2
Mem Bandwidth  68 GB/s      34.1 GB/s

While the clock is 14.2% faster on the Core i7-6700K, the memory band width is half that of the E5-1650v3, and the L3 cache is about half too.

If your application is memory bandwidth limited on your existing system, then it will be memory bandwidth limited on the i7-6700K (same size L2 and L3 cache systems). While it is true i7-6700K will be significantly faster than your existing system, if it is memory bandwidth limited, then the E5-1650v3 will out perform the i7-6700K. Only when your entire application will fit within the 8MB L3 of the i7-6700K, will it out perform the E5-1650v3.

Consider that you will eventually migrate to using parallel coding (when you get tired of waiting 2 hours for your program to finish), having the extra 2 cores will help then.

Jim Dempsey

Francesco_B_4 · ‎11-05-2015

I am pleased to report that I nearly get a 3x speed-up with my new machine mounting Intel Core i7-6700k 8M Skylake Quad-Core 4.0 GHz as compared to my old machine mounting Intel Core i7 CPU 920 @ 2.67GHz. What it was running in over 13.5mins, it's now running in 5.0mins, thank you all.

John_Campbell · ‎11-06-2015

Even though your code is over 10,000 lines long, you should profile where the computation is mostly being carried out. You will probably find that there is one routine where most of the time is being spent. This is where you should focus your attention. As others have recommended your key options are to:

Ensure you select SSE or AVX vector instructions for these inner loops,
Try to enclose these loops in /Qparallel or $OMP, and
Review the memory usage of the arrays you are addressing to keep as much of the arrays in the processor cache for as long as possible.

It is this third point that is important as Jim has referred to memory bandwidth limiting. Poor memory usage for problems larger than the cache size can nullify vectorization or parallelisation. Try to address memory sequentially or use MKL routines which are optimised for this. You can even partition the matrices in cache size blocks to improve performance.

Qparallel (!$OMP) works best if each thread changes different values, so this may require some changes to the arrays you use or the selection of the outer DO loop to suit the parallel computation.

Vectorise the inner loops and enclose the outer loop in !$OMP

Udit_P_Intel · ‎11-16-2015

You might even try VTune to profile your code and identify hotspots - https://software.intel.com/en-us/intel-vtune-amplifier-xe/

Francesco_B_4 · ‎01-14-2016

Good news - I have found out that 95% of the running time was taken by the routine performing the LU factorization of the large matrix (in that case a 6000x6000 matrix). I therefore used instead the routine SGESV provided in the MKL Library (this routine actually performs the LU factorization and then also computes the solution) and nearly got a 50x speed-up in performance! The CPU usage increased from 12% to 50% or so, so the four cores are used much more efficiently. Wish I would have thought of this before! Thank you all for your very helpful suggestions!

Another question... In my case, routine SGESV (using single precision) is about 2x faster than routine DGESV (using double precision), which one should I be using if the matrix gets larger, say 20,000x20,000 or even larger say 50,000x50,000?

TimP · ‎01-14-2016

You may want to check your larger cases (e.g. by comparing with dgesv) to see whether sgesv gives satisfactory accuracy.

jimdempseyatthecove · ‎01-15-2016

The degree of error will depend upon the degree and direction of error in the products and sum of products. Best case is no error. But this would require no round off error in all the products and all the steps of the summation. Worst case (for each output cell) you might get a half bit for product + half bit for partial sum (say 1 bit) * 50,000 cells in row/column. Or worst case ~16 bits of error. The mantissa on single is 23 bits resulting in a worst case precision of (23-16) ~7 bits ~2.3 digits..

I agree with Tim. Run significant number of test cases (with representative data) to determine what is required.

Jim Dempsey

jimdempseyatthecove · ‎01-15-2016

Tim,

Is there an MKL function that takes single precision input and performs a matrix multiplication by performing the DOT products inclusive of conversion to double into temporary arrays, and then converting the result back to single precision. While this is slower than single precision matrix multiply, it will certainly be better optimized internally to MKL than it would be by having the user decompose the single precision arrays into temporary doubles.

Jim Dempsey