- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
due to unforeseen consequences of file corruption after system freeze, I decided to move from parallel_studio_xe_2016_update2 to the current versions of Fortran compiler:
1. l_BaseKit_p_2024.1.0.596_offline
2. l_HPCKit_p_2024.1.0.560_offline.
The problem is that the same code (F90, ZHEEVD is most expensive element there), on the same machine (i7-12650H), with the same flags, now needs ~20% more time to complete the same computation with respect to the results obtained with 2016 compiler this morning.
The flags I use are:
ifort -heap-arrays -xHost -O3 -falias -pad -unroll3 -funroll-loops -parallel -i8 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -mcmodel=large -traceback -C
It is quite surprising, because I expected the computation times to actually shorten with current compiler that could take advantage of modern CPU architecture.
Do you have any hints on how can I achieve at least the same effectiveness level that I had with 2016 compiler ?
Thanks !
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suggest you run VTune on the program. We do not know if the issue is related to the code generation of your source files or within the MKL runtime system.
If in your code, then analysis of the non-performant code may point to a simple resolution.
Also, if you can run VTune on your old setup and compare the performance data this would help.
An unfounded guess at possible cause is the use of "-parallel" to generate auto-parallelization of some loops is causing oversubscription with the MKL threaded library. When (if) you use threading in your application (either auto-parallelization with -parallel, or, OpenMP directive parallelization) together with the MKL threaded library, you (usually) must be very careful with thread placement through use of environment variables. Failure to do so can result in oversubscription and/or other conflicts. (VTune can expose oversubscription issues).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You have a number of optimization options. The newer compiler with -xhost will probably target avx-512 which can run slower that AVX2 or older simd instruction sets. I'd peel back the options and start with a series of tests
-heap-arrays -O2 -falias -pad -i8
then
-heap-arrays -O2 -xsse3 -falias -pad -i8
then
-heap-arrays -O2 -xcore-avx2 -falias -pad -i8
then
-heap-arrays -O2 -xcore-avx512 -falias -pad -i8
then you can play with -parallel. And then the loop unrolling, which I doubt help but you can prove me wrong.
and you are sure you need -falias?
and the -pad, yeah I'd lose that and use -align array64byte instead. Try that.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page