What are the optimal Fortran XE2020 update 4 compiler settings for AMD Threadripper 3990x on Win10?

rivkin__steve · ‎11-11-2020

What are the optimal Fortran Parallel Studio Pro XE2020 update 4 compiler settings for AMD Threadripper 3990x on Win10?

Steve_Lionel · ‎11-12-2020

Start with /QxHost /O3 /Qipo This is general advice for any processor. Since you have a high-core-count processor you'll also want to look at parallelization. You could start with /Qparallel but would probably do better to work in OpenMP. Of course a lot depends on what your application is doing.

I should add that there is no single "optimal" set of options. You'll have to try different things and see what works best for your application.

rivkin__steve · ‎11-12-2020

I usually start with what works for most of my programs, which is a superset of your starting point, (see below), including /QxHost. For this program and many others, the /QxHost option results in many (not all) programs crashing at runtime on AMD Threadrippers.

Also the /Qipo option resulted in Error 10014 problem during multi-file optimization compilation (code 3).

The program is proprietary and consists of over 100 subroutines/functions/modules.

/nologo /debug:full /MP /O3 /QxHost /Qparallel /heap-arrays1000 /Qopenmp /convert:native /fpscomp:general /stand:f18 /Qdiag-disable:5268,10182,5462 /warn:all /Qtrapuv /fpe:0 /fpconstant /module:"x64\Release/" /object:"x64\Release/" /Fd"x64\Release\vc160.pdb" /traceback /check:none /libs:qwin /Qmkl:parallel /c

below

Steve_Lionel · ‎11-12-2020

Don't use /Qtrapuv - it's worthless. Don't bother setting a value on /heap-arrays, just use the default. /convert:native is the default - why are you specifying it? Using both /Qparallel and /Qopenmp seems problematic to me. And I see no point in /debug:full with /O3, but I suppose it's somewhat harmless.

/QxHost queries the CPU and asks it which instruction sets it supports, and then chooses the best available /arch option. If the CPU is misrepresenting which instructions it supports, that can be a problem. What kind of "crash" is it? I don't know what the Threadripper CPU supports, and AMD's web site isn't helping me find it. You might try experimenting with /arch instead of /QxHost - start with /arch:SSE3, see if it likes that, then try SSE4.1, SSE4.2 or AVX.

I don't know what IPO is complaining about, but try without that for now.

rivkin__steve · ‎11-13-2020

Hi Steve,

We first started communicating back in the late 1990's with DEC's DVF, which got bought out by Compaq and became CVF, which got bought out by HP, which then Intel took over the compiler as IVF. Back around 2003-2005 I helped with the Beta testing going from 32-bit to 64-bit and reported about 50 bugs. And in 2008 helped with ironing out OpenMP (with Richard)... so I'm not sure why the system lists me as "Beginner", but that's OK.

I generally use a prototype project as the starting point for new projects and just edit the program names. I find this approach best especially with a long list of compiler settings and libraries to link in. So it is possible that I migrate settings that have become defaults over the years, etc.

I just ran a dozen experiments with the options you listed and here are my results: when I turn off heap-arrays my programs gets a stack overflow error, so I reset it back. I removed /Qtrapuv and tried /arch:SSE3, 4.1, 4.2, AVX, AVX2, /QaxCOMMON-AVX512 with no speed improvement, but it didn't hang (previously when I mentioned it crashed I meant hung). I then tried /QxHost and now that did not hang, but without any speed increase. As a verification test I added back in /Qtrapuv and it hung again. So /Qtrapuv is causing the problem and without Heap Arrays it creates a new problem.

I tried combinations of /Qparallel and /Qopenmp and without OpenMP the program runs many times slower, probably close to 64 which would have taken an hour to run, so I killed it after 5 minutes. Qparallel doesn't seem to make a difference in runtime.

Also I tried the IPO option again and got the same error. The link time increased from a few seconds to a few minutes using IPO and it consumed 128 GB of RAM.

Steve_Lionel · ‎11-13-2020

You can take HP out of your timeline - HP bought Compaq after the Fortran team moved to Intel.

The reason the link time increased is that's actually when the optimizing happens. What matters more is the run-time. IPO works for many applications, but not all.

Don't use /Qax when you know you'll be running on a non-Intel processor, as you'll get only the "generic" path. What your experiments with /arch tell me is that your application doesn't take advantage of the newer instruction sets. Stick with /QxHost.

The forum software changed recently and it labels as "Beginner" people with few posts, and you have only four. I don't care for the level names the new forum uses.

rivkin__steve · ‎11-16-2020

Hi Steve,

I turned off /Qtrapuv as suggested (which works), but do you know why it resulted in /QxHost hanging with it, but not without it?

Any suggestions on what is causing IPO to exit with error code 3?

Without setting Heap Arrays I get a stack overflow error. I'm using /heap-arrays1000 as a workaround. Is there a better solution?

jimdempseyatthecove · ‎11-16-2020

I would experiment with /QxHost build then /QxAVX2 build. Depending on how "friendly" the Intel compiler is to non-Intel CPUs, /QxHost could potentially fallback to /SSEnnn.

(please report your findings back here)

Jim Dempsey

rivkin__steve · ‎11-16-2020

/QxAVX2 results in my program immediately aborting. /QxHost works but no speed improvement over not using it. Full debug doesn't seem to slow anything down but keeping it is one fewer changes when using Vtune.

jimdempseyatthecove · ‎11-16-2020

What does /QxAVX do?

rivkin__steve · ‎11-16-2020

It also immediately aborts my program.

jimdempseyatthecove · ‎11-16-2020

If you run in (under) the debugger, what is the instruction at the failure point?

You should get a GP halt at an address, and in the debugger you can open a Dissassembly window and goto that address.

Jim Dempsey

Steve_Lionel · ‎11-16-2020

He cannot use ANY of the /Qx options other than /QxHost on a non-Intel CPU - it will immediately fail. Use /arch instead of /Qx.

I don't know why /Ftrapuv causes errors, but I do know that it doesn't do anything useful - it doesn't even do what it says in the manual. Use /Qinit:snan,arrays if you want variables initialized to a NaN.

rivkin__steve · ‎11-16-2020

The suggestion to run Debug mode was beneficial. I started developing this code about 15 years ago and initially used Debug mode but haven't used it in 10 years. The code has been running fine on Intel CPUs. I uncovered 3 latent bugs, two format/variable-type mismatch and one shape mismatch. I normally develop using Release mode and switch runtime options off or on. I fixed all bugs. The program no longer aborts, and runs to completion in both Debug mode and Release mode.

Interestingly the Debug mode is faster by about 50%. Here are the compiler switches for Release and Debug modes:

 /nologo /debug:full /MP /O3 /Qparallel /heap-arrays1000 /arch:AVX /Qopenmp /fpscomp:general /stand:f18 /Qdiag-disable:5268,10182,5462 /warn:all /fpe:0 /fpconstant /module:"x64\Release/" /object:"x64\Release/" /Fd"x64\Release\vc160.pdb" /traceback /check:none /libs:qwin /Qmkl:parallel /c

 /nologo /debug:full /MP /Od /Qparallel /heap-arrays1000 /arch:AVX /Qopenmp /stand:f18 /warn:all /fpe:0 /fpconstant /module:"x64\Debug/" /object:"x64\Debug/" /Fd"x64\Debug\vc160.pdb" /traceback /check:pointer /check:bounds /check:shape /check:uninit /check:format /check:output_conversion /check:stack /libs:qwin /dbglibs /Qmkl:parallel /c

jimdempseyatthecove · ‎11-17-2020

You should be using either

/Qparallel (auto parallelization)

or

/Qopenmp (OpenMP directive parallelization)

not both.

Also, if your program is OpenMP, you generally should link with the serial version of MKL.

The parallel version of MKL internally uses OpenMP. Each application thread (app using OpenMP) calling the parallel version of MKL will result in MKL instantiating an independent OpenMP thread pool for the calling thread. Thus causing an over-subscription of n**2.

There are cases where when MKL is only called from the serial portion of the application that you may then wish to use the parallel version of MKL. In those cases, experiment with setting the environment variable KMP_BLOCKTIME=0.

Jim Dempsey

rivkin__steve · ‎11-17-2020

I have been experimenting with many combinations of options, including turning off /Qparallel, MKL lib parallel to sequential, matmul on/off, prefetch insertion level 3 on/off, /Od, /O2, /O3, runtime checking on/off. The program is large with over 100 subprograms. It also contains lots of stochastic processes which makes timing inconsistent. Example: t= 47, 57, 46, 47 sec.

With that said the best results I got were using the settings from yesterday with both /Qparallel and /openmp, and MKL parallel. I parallelized certain sections of code using OpenMP, coarse-grain where I could. I know I could do more, with more programming time. Would it be beneficial for Intel to enhance the compiler so that it determines the extent of OpenMP sections during semantic analysis, and then blocks those sections off while it auto-parallelizes the rest of the program?

I'm still not experiencing the speed up I'm expecting going from 8-cores Intel to 64-cores AMD CPUs at similar clock speeds. I know certain parts of the code are serial so I don't expect much difference there. Maybe it has to do with microcode optimizations or problems with the parallelization such as race conditions, dead locks, etc. I have the Pro edition of XE2020 with Vtune, Advisor, etc. but the tutorials with code examples are mostly not created for Fortran programmers, just C++ programmers.

jimdempseyatthecove · ‎11-17-2020

If by 64 core AMD system, do you mean 128 hardware threads (2 threads per core).

If this be the case, (unless Intel fixed this), on Windows affinity pinning is different than on Linux. Windows uses processor groups of up to 64 hardware threads. Each processor group has but a 64-bit bit mask of logical processors within that group. I haven't tested the latest version of Intel's OpenMP on Windows with a system with more than 64 logical processors. It may be the case that your application is using only one processor group and with 64 threads running in app as well as 64 threads in MKL ... in the same processor group.

In any event, have you tried setting environment variable KMP_BLOCKTIME=0.

IIF (if and only if) Windows OpenMP uses only one processor group, then as a hack (Windows only)

1) leave KMP_BLOCKTIME undefined (default is 200ms)
2) set KMP_AFFINITY=compact, and OMP_NUM_THREADS=64 (or # threads per processor group)
3) At program start prior to first parallel region and prior to first call to threaded MKL...
4) Assuming 2 processor groups (query to test # groups and # threads/group) set the main thread to the lowest numbered processor group.
5) Enter an OpenMP parallel region (it doesn't have to do much) and exit the region. This will establish the application's OpenMP thread pool in the lowest processor group.
6) Set the main thread's processor group to the next processor group.
7) make a call to MKL, this doesn't have to do anything meaningful except to create its OpenMP thread pool. This will establish the MKL OpenMP thread pool in the 2nd processor group. Note, you may also need to set MKL_NUM_THREADS=64 (or #threads in processor group)

You may want to experiment 1:7 above as well as adding to reset the main thread's processor group back to the first group. Note, processor groups are not necessarily 0-based, nor contiguous.

BTW 100 subprograms isn't large.

Jim Dempsey

rivkin__steve · ‎11-17-2020

AMD Threadripper 3990X is 64-cores, 128-threads. KMP_BLOCKTIME=0 did not improve the performance.

rivkin__steve · ‎11-17-2020

Also I have hyper-threading turned off in the BIOS. I was informed by Intel a long time ago to do that when using OpenMP.

jimdempseyatthecove · ‎11-17-2020

Run VTune on your program to see what&where the hangups are.

Note, for any timing section of code, enclose it within a loop of 2 or a few iterations. It initial pass may exhibit additional overhead than the 2nd and later passes.

a) Tread pool creation
b) "first touch" of allocated arrays
c) cache preloading
d) ...

As an experiment:

Turn off (disable) OpenMP and MKL affinity settings (we will work on enabling these later)
Set OMP_NUM_THREADS=32
Set MKL_NUM_THREADS=32

IOW let both thread pools float, and each pool at half the number of available hardware threads.

Make your test run in VTune (if the runtime is longer than 30 second or so, use the Stop button to collect the results). The past the Bottom-Up screenshot (Alt-PrintScreen).

You may need multiple screenshots to get the threads listing at the bottom.

Jim Dempsey

rivkin__steve · ‎11-18-2020

In my current project from within Visual Studio I clicked on "Open Vtune Profiler" and in Performance Snapshot I get two error messages stating that Vtune can not recognize the microprocessor. Does anyone in the Vtune development group have an AMD Threadripper 39nnX to test this on, with possibly a lot fewer cores if cost is an issue?