in a project I was assigned to investigate performance problems of a FTN 90 program with OpenMP.
The initial situation at the company was that the computation time of the programs doubled due to the change from XE 2015 to XE 2018. At the same time the hardware was also changed. Here I do not know the difference between the hardware at the moment.
I have now tried to optimize the computing time via compiler settings, at the same time was still switched to the latest oneAPI (2021.1).
Now I see an effect that I can not interpret:
I made comparisons between the server and my HP laptop.
Intel® Core™ i7-8750H processor.
12 logical processors
Base clock frequency of the processor 2.20 GHz
RAM 12 Gbyte
Intel® Xeon E5- 2670v3 processor
48 logical processors
Base clock frequency of the processor 2.30 GHz
RAM 128 Gbyte
RAM is dedicated larger than what the program needs.
Windows 10 operating system, programs all use logical processors.
Computing time is 19 min for the laptop and 22 min for the server.
I actually expected the computation time to be significantly less on the server.
Question: Does anyone have an idea where the cause can be?
Thanks for support
Rather than guess, or have others guess, fire up VTune and run the program through that. It should highlight where time is being spent. It could be that the time is dominated by I/O.
thanks for the reply. Sorry for the unclear question I asked. I just wanted to ask if anyone has an explanation for the small performace difference between the two computers. Actually, I assume that the processor of the server is meant to be better than that of my laptop.
Here are the compiler emotions I use. They are the same for both computers.
SET "Compiler_Options= /O3 /Qopenmp-simd /Qvec-threshold0 /Qparallel /Qopenmp /Qautodouble /reentrancy:threaded /QxHost /assume:buffered_io /align:dcommons /reentrancy:threaded /object:Test2.obj /MT /exe:FTN_END02_20Kerne.exe"
Thanks for hints
My reply is the same - the difference may not relate to the CPU speed or RAM amount. Knowing nothing at all about your application, I refuse to guess, and you should as well. The tools are there to be used.
The other issue is one should never assume that the execution time for a program is constant even on the same computer, it will be drawn from a distribution with many variables, and will clearly not be Gaussian for short processes < 30 seconds. If you are interested the easiest way is to add a MySQL component and measure the cycle times. I have data sets of 5 million points with this data embedded, I never assume anything, which is why I know it is not Gaussian. There is an old saying in Physics, never trust a theoretical engineer and always trust someone who measures the data, preferrable someone who can analyze the data with Fortran and not say MATLAB or Python.
Hi Steve and John,
First I have to say something about my questions:
I have a project with the order to optimize the computation time by about 50% by opitimizing the compiler options. The company says that the program was so fast before. At the source code I should do first nothing! I am far away from the 50%.
I have now looked at the program execution with VTune, as Steve advised.
Here are my results:
I interpret the results in such a way that the program is very badly parallelized. Most of the CPU time is used by kmp_barrier.cpp. I have no influence on this program!
So I don't think you can achieve anything with compiler options here.
At the moment I analyze the program with Intel Advisor.
@John: Fortran is for me for scientific, complex numerical computations still first choice!
the passmark score on both processors is close, really there is not likely to be a great statistical difference between 19 and 22 minutes and one wonders if compiler options will work and the answer is not likely.
it will unlikely improve its speed rewritten in anything outside Fortran and it is never worth the effort, although it is worth the effort for other way.
If I was doing it I would 1. get a better CPU, both of them are pretty oldish, improve your SSDs - this will help if you have old poorer ones, a day of your time is worth 1000 USD and a better computer is 2000 USD as a desktop,
2. Look at the code and break it into real modules that are distinct and then throw them into different threads explicitly. - Only works if your program has independent bits, this works on my large analysis program - separate out the data stages, slice the program and then throw a completely different thread in a new core.
3. praying helps I find sometimes.
Your VTune reports indicate that the code generated is preponderantly scalar (meaning not using SIMD vector instructions). As to if your problem (with code changes) is compatible with vectorization is unknown (we know not of the problem). The large percentage of time spent in _kmp_hyper_barrier_release (without further investigation) could potentially be caused by:
a) excessive use of ATOMIC and/or CRITICAL
b) unbalanced workloads amongst threads
Both of these can result in: more threads == poorer performance
On the server, I suggest using the OpenMP environment variable settings to experiment with:
a) Selecting 1 thread per core
b) Reducing the thread count
Manipulate both to find the best spot.
You really need to consider reworking the code for better efficiency (keep the current code for generating reference data).
If this code is a money maker for your company, you should consider getting professional help.
you should consider getting professional help
What Jim is trying to say is this board can be considered to fall into the category of the Pros from Dover, most able to shoot 15 over par at Goose River Golf course on their best day, but otherwise average 20 over.
But he is correct, that is a challenging project and good luck.