Re: What are the optimal Fortran XE2020 update 4 compiler settings for AMD Threadripper 3990x on Win - Page 3

rivkin__steve · ‎11-11-2020

What are the optimal Fortran Parallel Studio Pro XE2020 update 4 compiler settings for AMD Threadripper 3990x on Win10?

jimdempseyatthecove · ‎12-04-2020

The last message was intended for Steve...

*** 32-bit or 64-bit was immaterial

The sample code, though it was compiled as 32-bit (default project is 32-bit and I was too lazy to make another run), does illustrate a way for you to correct the warnings.

The sample code illustrates that (at least when not requesting standards conformance, which I did not test)...

... that you can use SEQUENCE followed by UNION (MAP...ENDMAP) within user defined TYPE.

The warning message you see, is a result of the module (source available) using a STRUCTURE as opposed to TYPE where the comment reads "because UNION not supported by TYPE). My code example illustrates this is not the case. (I did not request standards warnings/enforcements).

This means you can make a copy of the offending source(s) and replace the STRUCTURE with a TYPE and thus eliminate the warning about the argument not being a TYPE.

You can do that .OR. simply ignore the warning.

Jim Dempsey

Steve_Lionel · ‎12-04-2020

The version I attached earlier should compile without warnings if you don't ask for standards checking. I consider it a compiler bug that standards checking complains.

I'm not planning to do more here. Jim had done a test with a 32-bit compile and was limited to 32 threads. If you're building for 64-bit, Jim verified that you could get 256 threads.

jimdempseyatthecove · ‎12-04-2020

I am also constructing an example (from your test code) to illustrate how to incorporate the multi-threaded MKL library together with an OpenMP program. Please be patient as I intend the example to be constructive. IOW it will not be targeted to a specific machine (e.g. yours), but rather be informative such that the reader can grasp how not to do things, as well as how to investigate the best means for their system together with the problem at hand. Your test program, and figure out how to configure for multi-tier OpenMP app together with multi-threaded MKL.

If you hunt on this forum, a few years ago I had assisted someone in doing just this for his application.

The exploratory example I intend to write will be tested on my KNL system using Windows 10 Pro. The KNL will (is) configured 64 cores, 256 HW threads, memory configured as 4 NUMA nodes. IOW while this is a single socked system, it will appear as 4 "sockets", each with approximately 16 cores and 4x # of those cores as HW threads. I say approximately, because the KNL series CPU is (was) etched with more cores than 64 with (I am guessing here) 18 cores per quadrant, meaning their are 2 spares/quadrant. Some well binned CPUs have 72 cores. It is conceivable that a 64 core KNL could have quadrants with 14,18,16,16 cores.

Jim Dempsey

rivkin__steve · ‎12-04-2020

Thanks Jim. I wasn't sure if you guys wanted me to re-run ProcessorInfo since it crapped out before generating any meaningful results. I was expecting you guys to fix it along with IVF compiler complaining, so I can get you meaningful results. I don't plan on doing any more with ProcessorInfo.

In your test program, could you include coarse-grain OpenMP parallelization whereby one creates a parallel do-loop that loops on a subprogram that contain allocatable arrays, allocates them, and then deallocates them at the end? Back in 2008 Martyn at IVF suggested I change all of my static arrays to allocatable arrays. Doesn't the compiler have to internally perform a lock to determine if the memory address is occupied by any threads during allocation? With 64 threads (order N**2 process) this seems like it could be a huge bottleneck that will get worse in the future as core count continues to increase.

jimdempseyatthecove · ‎12-04-2020

>> I was expecting you guys to fix it along with IVF compiler ...

I am not an Intel employee, (Neither is Steve). You have too high of expectations on this forum.

Your problem is MKL is designed for use in 3 ways

Single threaded app + single threaded MKL
Single threaded app + mult-threaded MKL
Multi-threaded app + single threaded MKL

What you want is

Multi-threaded app + muilti-threaded MKL

In order to do this, you must make additional effort to assure that the several thread pools do not conflict, as well as are optimally placed. The "standard" environment variable settings (as well as pre-1st parallel region omp_... runtime calls), are not setup do do what you want them to do. To get what you want will take some finesse.

The allocatable verses static allocated is not the issue. At issue is where the threads are placed so they do not conflict.

Taking your system (assuming HT disabled, though it would be preferred to have it enabled) with 64 HW threads, one possible configuration would be

App with 4 OpenMP threads + 4 instances of MKL (one per app thread) each with 16 threads. OR
App with 8 OpenMP threads + 8 instances of MKL (one per app thread) each with 8 threads.
...

IOW the total active threads to not exceed the total hardware threads.

Two problems with "standard" approaches.

No affinity pinning, threads may get scheduled on same hardware thread (not good).

Affinity pin (KMP_AFFINITY=scatter), may end up placing the app OpenMP threads scattered as desired, *** but may restrict those threads, each, to a single hardware thread (as opposed to 1/4th or 1/8th of the HW threads), thus when each MKL initializes its OpenMP thread pool (to 16 or 8 threads), the threads of each MKL OpenMP thread pool will be constricted to the permitted (pinned) HW threads of the parent thread, and in this case to 1 thread. This is much worse than "(not good)". e.g. running 64 software threads on 4 hardware threads. (or 64 SW threads on 8 HW threads).

What I've said above, is not to say that there is an official way of doing this, but for me, not knowing this, I will have to resort to some programming gymnastics to get what I (you) want.

Jim Dempsey

rivkin__steve · ‎12-05-2020

1. Switching from Quickwin to Multithreaded did not increase the performance.

2. I normally use parallel MKL in programs that do not use OpenMP. MKL sequential or parallel does not affect the performance of most of my programs. Over the past 40 years I've created my own custom libraries of thousands of subprograms that cover a wider range of numerical algorithms than MKL, such as those in IMSL, NAg, MATH77 (from JPL-NASA supercomputer group where I used to work), and other US govt libraries.

3. I do have other programs that show 100% CPU utilization in the Win10 TaskManager using OpenMP, on the 64-core, 128-thread AMD 3990X (hyper-threading turned off in BIOS as recommended by Intel).

JohnNichols · ‎12-05-2020

What do you use a $6000 CPU for?

That is a lot of processor power that is going to be hard to control in threading --

jimdempseyatthecove · ‎12-06-2020

>>hyper-threading turned off in BIOS as recommended by Intel

I disagree

When you (someone) is testing an algorithm for scalability (nice pretty charts) you would turn off Hyper-Threading .AND. turn off Turbo Boost. This would provide for reasonable (non-conflicting) scalability charts.

This said, in a production environment, the desired behavior is shortest runtime possible. to attain this, one would enable Turbo Boost .AND. Hyper-Threading.

Note, some applications may perform better with or without HT. You (they) will have to test this.

Consider the following scenario:

The system has a heavy-duty simulation job (high degree of vectorization floating point), together with some non-simulation scalar processes (SQL, Internet, O/S miscellaneous tasks, etc...). The additional (HT) threads can run these just fine. If you are worried about cache conflicts, you can specify 1 thread per core for your OpenMP app, and let the other stuff use 2 threads per core (plus random HT siblings).

The tuning is for throughput .NOT. scaling.

Note, a program that scales well without Turbo Boost and without HT will (most often) perform better with Turbo Boost and HT.

Jim Dempsey

rivkin__steve · ‎12-08-2020

I'm surprised you are disagreeing with Intel. By "nice pretty charts" are you referring to Intel's Vtune graphs?

It is common sense that one does not get something for nothing. If there were a chip that had a clock frequency of 4GHz and had 64 hardware threads without hyper-threading, then turning on HT would effectively create a chip with 128 cores at 2Ghz but also must have overhead associated with creating the synthetic logical cores. In a thought experiment of extrapolating by 1000x there would be 64,000 cores running at only 4MHz (0.004GHz) with massive overhead. Every real-world program contains serial portions of code, typically in a preprocessor stage. This serial code would then run at 1/1000th the speed. Conversely if somehow the chip could run as a single core but with the clock speed increased 64x this would be the ideal situation without the need for any parallel sections of code anywhere. See Amdahls Law.

Over the decades I have literally fried Intel chips (motherboards on fire, flames coming out of the case) when running 24/7 for weeks at a time, at 100% CPU and overclocking. Overclocking is fine for Gamers who only run games for a short while. I've learned to respect the Intel suggested clock speeds.

jimdempseyatthecove · ‎12-08-2020

>>I'm surprised you are disagreeing with Intel. By "nice pretty charts" are you referring to Intel's Vtune graphs?

No. What I mean is when a student or academic is writing a paper illustrating how good of programmer they are by achieving nice clean scaling charts, the do not want to be embarrassed with outlier data.

When a scientist is running a simulation, they want to know what the (per core scaling is) for purchassing decisions, however, during production runs there interest is reduced run times (on the system so configured). Hyperthreading (application dependent) can experience a 15%-25% boost in performance.

After you have done all that you think you can do optimizing for 1 thread per core, wouldn't you be interested in that extra performance?

>>In a thought experiment...

Why do a thought experiment when you can quite easily do the test?

Jim Dempsey

rivkin__steve · ‎12-09-2020

I ran a number of experiments in duplicate and recorded the wall clock time of the section of current interest and the total program time. The experiments consisted of compiling in Release mode, Debug mode, with hyper-threading off, then using the best results with hyper-threading turned on in BIOS.

1. HT off, Release mode, /O3, /arch:AVX, multi-threaded lib, 45 sec (64 cores)

2. HT off, Debug mode, /Od, /arch:AVX, multi-threaded lib, 14 sec (64 cores)

3. HT on, Debug mode, /Od, /arch:AVX, multi-threaded lib, 111 sec (128 cores)

HT on is slower by a factor of 7.93X as I expected. And surprisingly Debug mode is faster than Release mode.

I also turned on compiler report level 5. Some of the computations are in the complex domain including complex divide, sqrt, **2, exp, log. I got a compiler suggestion to turn on Limit Complex Range, which I did, but that did not improve the results so I turned it off.

jimdempseyatthecove · ‎12-12-2020

I am in the middle of rebuilding my KNL system and will get back to this thread later.

Jim Dempsey

jimdempseyatthecove · ‎12-01-2020

I've used this since Intel Visual Fortran Version 8:

!  TestUnion.f90 
Module foo

    type TypeSaveAsAll
        sequence
        union
            map
                integer(1) :: AllData(32)
            end map
            map
                ! Selection filter follows
                union
                    map
                        integer(1) :: AllReals(16)
                    end map
                    map
                        integer(1) :: TVXNUL ! TVXNUL(3)
                        integer(1) :: TVXUNT ! TVXUNT(3)
                        integer(1) :: TMXNUL ! TMXNUL(3,3)
                        integer(1) :: TMXUNT ! TMXUNT(3,3)
                        integer(1) :: TVXUP1 ! TVXUP1(3)
                        integer(1) :: TVXUP2 ! TVXUP2(3)
                        integer(1) :: TVXUP3 ! TVXUP3(3)
                        ! Expansion follows
                        integer(1) :: ExpandRealsHere(8:16)
                    end map
                end union
                union
                    map
                        integer(1) :: AllIntegers(16)
                    end map
                    map
                        integer(1) :: NJTOSS
                        integer(1) :: NLLOSS
                        integer(1) :: ICSTAG
                        integer(1) :: JINTEG
                        integer(1) :: LRYEAR
                        integer(1) :: LRMON
                        integer(1) :: LRDAY
                        integer(1) :: LRHOUR
                        integer(1) :: LRMIN
                        integer(1) :: LRSEC
                        ! logicals in with integers
                        integer(1) :: PROCED
                        ! Expansion follows
                        integer(1) :: ExpandIntegersHere(12:16)
                    end map
                end union
            end map
        end union
    end type TypeSaveAsAll

end Module foo

program TestUnion
    use foo
    implicit none
    type(TypeSaveAsAll) :: sample
    sample%PROCED = 1
end program TestUnion

1>------ Rebuild All started: Project: TestUnion, Configuration: Debug Win32 ------
1>Deleting intermediate files and output files for project 'TestUnion', configuration 'Debug|Win32'.
1>Compiling with Intel(R) Visual Fortran Compiler 19.1.0.166 [IA-32]...
1>TestUnion.f90
1>Linking...
1>Embedding manifest...
1>
1>Build log written to  "file://C:\test\TestUnion\TestUnion\Debug\BuildLog.htm"
1>TestUnion - 0 error(s), 0 warning(s)
========== Rebuild All: 1 succeeded, 0 failed, 0 skipped ==========

Jim Dempsey

Steve_Lionel · ‎12-01-2020

Fascinating - I didn't know that worked.

jimdempseyatthecove · ‎11-16-2020

>> And I see no point in /debug:full with /O3,

When using VTune on optimized code (e.g. /O3) you need the debug symbol table.

Jim Dempsey

What are the optimal Fortran XE2020 update 4 compiler settings for AMD Threadripper 3990x on Win10?