So I just upgraded this week from Intel Visual Fortran 2017 to 2018. I have each on different computers and the compiler settings are identical (as far as I can tell).
I am compiling a large numerical simulation software and have two simulation domains I am running it on. The first requires the same amount of time to complete the simulation for the 2017 and 2018 versions. The second has the 2018 taking about 3 times longer for the simulation to complete. Is there a default optimization that has changed (such as loop unroll count) or anyway that I can get the same speed with 2018 as I do with 2017.
A ratio of 3x slower can be accounted for by:
what was vectorized is now not vectorized
You've enabled run-time checking such as array bounds checking
Or possibly the v17 using fast intrinsic functions (sqrt, sin, cos, tan, ...) and v18 using more precise versions
Tim P's suggestion may lead you to the cause, knowing the cause, you may be able to fix it (e.g. adding simd directives when safe, or using different options to override different defaults).
Ok I will try the opt-report4, the problem is there are about 100 source files and the program itself is about 300,000 lines. Is there an easy way of comparing the two outside of a separate report for each source file?
I suspected something that was being vectorized or loops being unrolled are no longer doing so. I just wish there was some readme that explained what changes were made for /O2
Another issue I have is somethings when I use IPO (single file or multi) I end up getting compiler errors. Is it worth using, from my preliminary tests, it does not seem to improve speed much.
Regrettably my weakest area is in optimization
Here is one more factor to consider. If the simulation involves an iterative procedure, and the number of iterations required to satisfy a convergence criterion is not the same for the compile-link-runs from the two compiler versions, look for issues at the algorithmic level rather than at the compiler optimizations level.
Similarly, if arrays are being overrun or undefined variables are being used, and the calculation speed is affected by these errors, the problem may be blamed on the compiler version instead.
Good idea, I had the same thought, but the solver does the same number of iterations, they just take longer.
Thankfully I have 2017 on another computer, so for that model I am just compiling the exe with 2017 till I figure this out.
Not sure why it would be slower, I wish there was a simpler way to find out what optimizations are not being used.
It is surprising that you have two simulation "domains" on the one computer, but only one exhibits the problem.
Presumably, both simulations were rebuilt using 2018 on the one computer, but how different are the "domains" to explain the source of the problem?
I am assuming that the simulations are related and so have similar Fortran coding approaches.
That one simulation confirms 2017 and 2018 can have similar performance on the computer, this must limit the possible causes.
Is the problem the domain settings and not the compiler ?
I will be interested to know what is the cause.
You might be mis-understanding. I am doing all the simulations on the same computer. I just have Intel Fortran 2017 and 2018. I have a Visual Studio project with the same settings (eg /O2 no IPO). What is also strange is that /O3 hangs on Intel 2018, but compiles fine on 2017
When I run the simulation software on do different model inputs (they simulate different hydrologic settings in the united states), call them Model A and Model B, the total run time is different.
For example Model A takes 15 hours to run with 2017, but with 2018 it takes 15.5 hours...which is acceptible.
Model B takes 9 hours to run with Intel 2017, but takes about 28 hours to run with 2018, which is not for our field of simulation models.
There are different simulation features that are used between Model A and B (for example one may have precipitation while the other does not) and other features such as one has a more complicated stream network compared to the other.
Overall I cannot figure why it takes longer and how to fix it. Currently I am just making a special compilation with Intel 2017 for Model B, but that is not a long term solution.
From time to time there are performance regressions in new versions (That's the price we sometimes pay for cutting edge Fortran capability).
I had a similar experience going from 2017 to 2018 in which, a regression in the stream IO runtime caused codes with significant IO using streams to become slower, and switching to another file mode resolved the issue.
Also, I recently found that for codes with a significant amount on-the-fly output have become slower due to the (default) "unbuffered" file mode. Switching those output files to "buffered" (i.e. open(...,buffered='yes')) improved performance significantly in 2018 (vs 2017).
As others have suggested, use VTune on each version. Preferrably on the same system, one after the other. Then look at the bar charts for differences (e.g percent of total runtime).
Due to the model taking 15+ hours of run time, set the VTune runtime for the application to a reasonable number such that any initialization time is "washed out". Perhaps 3 minutes would be enough. You should be able to notice which subroutines/functions move up a level in V18, or lacking movement, the percent of total runtime increases in V18.
Is there some reason why you haven't run VTune
Suggested reading: Peel the Onion
Although this article specifically address your code, it does exemplify significant returns on programming efforts made.
You do have the option of using V2017 until some future version improves the situation, however, I suspect that there is something fundamentally imprecise in your source code that accidentally causes V17 to vectorize some code, and forces scalar code out of V18. Using VTune to identify the cause may not only help you get V18 running at equivalent performance, but may also improve the performance of V17.
Jim raises a good point about VTune.
If you have access to VTune, you should start there. Accurate profiling will tell you where the speed is (or is not), and results can often surprise you when you measure it as some performance issues defy intuition.
The stream IO issue I ran into first manifested itself as an unexpected difference between wall time and CPU time. (CPU time was < 50% of wall time). This "out of process" time loss was something I had never seen, and only after performing a "locks and waits" analysis with VTune did I realize the time was lost in the runtime library (for stream IO).
I will see what I can do with VTune. I have not used it because I do not know how to use it.
Also the code itself is at least 500,000 lines of code, so I rather not have to dig too much into altering sections that are 20+ years old.
Previously it has relied on regular optimizations and ran fine.
There is a lot of unformatted stream I/O output, but it all uses the BUFFER= option (I think its set to 32kb). There is a total output after a simulation that is around 100GB of both binary and text files.
I will see what I can figure out with VTune
Okay - VTune is a great tool, but it does take a little practice. If you are ultimately responsible for the performance of your application (sounds like you are), then I highly recommend putting VTune on your learning list - it will serve you well in the long run.
For now, perhaps the stream IO you are using is being affected by this regression. My issue was with formatted stream IO, but I imagine unformatted is affected by this 2018 performance regression as well.
I don't think the writing of data at the end of the run would necessarily be the primary issue, but if you can add code to "time" the writing of the output, you may see some correlation - 100GB is a lot of data so that stage could be affected. Check out the SECNDS() function in the Intel docs - just call it twice - once before the write, then again after the write, and that will give you elapsed time.
If you also do writing (to stream IO) during the run, then maybe could you test by temporarily disabling all of the write routines during the run to see if the overall wall time reduces significantly for the whole run.