Unfortunately, either by

Alexander_S_2 · ‎11-01-2017

I hope that this is a simple question:

Which compiler flags should i use to get the best performance out of an AMD Epyc processor (particularly for MPI and OpenMP codes)? I know which instruction sets it is theoretically capable of. But since there were "problems" in the past where the intel compiler would choose slower execution paths for non-intel CPUs and there is this disclaimer, I feel that I should ask an expert first instead of blindly trusting the software to do its best.

Steve_Lionel · ‎11-01-2017

Since I no longer work for Intel, I think I can say that I disagree with the premise of the statement. The only part of this I consider remotely true is that if you use the auto-CPU dispatch option -aX, then non-Intel processors take the "generic" path, whatever you have set that to. The -x options (-xHost excepted), as the disclaimer notes, reserve some optimizations for Intel processors and add a check at program start that gives an error if the CPU type doesn't match. The -m or -arch options omit this check. You are unlikely to find any compiler that consistently outperforms Intel's on an AMD CPU (for many years, AMD would use Intel compilers for their SPEC submissions.)

I would recommend the use of -xHost. This will select the best option for the processor you're compiling on, Intel or non-Intel. (I wrote the initial code that does this determination.)

Alexander_S_2 · ‎11-02-2017

I did not want to start a discussion about the premise...

If I understand correctly, the fact that the auto-CPU dispatch options will choose a slower path for non-intel CPUs makes this option unusable for non-intel CPUs. I will need a separate binary for every type of CPU?

Concerning xHost: what if our development workstation all have different generations of Intel CPUs, but the code may or may not be run on AMD epyc CPUs. In this case I need something different.

jimdempseyatthecove · ‎11-02-2017

This may be of help:

IVF documentation Index | OPTIMIZATION_PARAMETER | ATTRIBUTES OPTIMIZATION_PARAMETER

You can use:

!DIR$ ATTRIBUTES OPTIMIZATION_PARAMETER: string::{ procedure-name | named-main-program}

string

Is a character constant that is passed to the optimizer. The constant must be delimited by apostrophes or quotation marks, and it may have one of the following values:

TARGET_ARCH= cpu

Tells the compiler to generate code specialized for a particular processor. For the list of cpus you can specify, see option x.

...

The characters in string can appear in any combination of uppercase and lowercase. The following rules also apply to string:

If string does not contain an equal sign (=), then the entire value of string is converted to lowercase before being passed to the optimizer.
If string contains an equal sign, then all characters to the left of the equal sign are converted to lowercase before all of string is passed to the optimizer.

Characters to the right of the equal sign are not converted to lowercase since their value may be case sensitive to the optimizer, for example “target_arch=AVX”.

You can specify multiple ATTRIBUTES OPTIMIZATION_PARAMETER directives for one procedure or one main program.

For the named procedure or main program, the values specified for ATTRIBUTES OPTIMIZATION_PARAMETER override any settings specified for the following compiler options:

x, -m, and /arch
...

This isn't as elegant as using an auto-dispatcher as you may have to explicitly perform your dispatch.

Note 2:

The Fortran code is likely calling (once) the C/C++ code to determine the CPU architecture and then load a bitmask of supported features. You could step into this (probably best using a C/C++ explorative program), to locate the address (and hopefully a global symbol name). Once located, you can jamb in whatever bits you want. I do not know what EPYC supports, perhaps it supports AVX-2 and/or FMA.

Note 3:

You can search the Intel C++ documentation for cpu_dispatch and _allow_cpu_features. This may help you craft your own dispatcher.

Jim Dempsey

jimdempseyatthecove · ‎11-02-2017

Possible guid:

1) create a Fortran project that is .NOT. to be built
2) in this project, insert the bodies of your multi-build subroutines and functions .WITHOUT. SUBROUTINE or FUNCTION declaration.

Example on my Windows system (You can adapt for Linux) I have a Fortran Project folder named NoBuild containing foo_body.f90:

! subroutine foo(v, a, t, n)
    implicit none
    integer, intent(in) :: n
    real, intent(inout) :: v(n)
    real, intent(in) :: a,t
    v = v + a * t
! end subroutine foo

3) Construct your program and use of multi-generated function like this:

module mod_dispatch
    integer :: yourDispatchCode = 0
   contains
    subroutine foo(v, a, t, n)
        implicit none
        integer, intent(in) :: n
        real, intent(inout) :: v(n)
        real, intent(in) :: a,t
        select case(yourDispatchCode)
        case (0)
            call foo_SSE2(v, a, t, n)
        case(1)
            call foo_AVX(v, a, t, n)
        case default
            call foo_SSE2(v, a, t, n)
        end select 
    end subroutine foo
    
    subroutine foo_SSE2(v, a, t, n)
    !dir$ attributes optimization_parameter: "target_arch=SSE2" :: foo_SSE2
        include "..\NoBuild\foo_body.f90"
    end subroutine foo_SSE2
        
    subroutine foo_AVX(v, a, t, n)
    !dir$ attributes optimization_parameter: "target_arch=AVX" :: foo_AVX
        include "..\NoBuild\foo_body.f90"
    end subroutine foo_AVX
 
    end module mod_dispatch
    
program Dispatch
    use mod_dispatch
    implicit none
    integer, parameter :: n = 100
    real :: v(n)
    real :: a, t
    v = 0.0
    a = 9.1
    t = 0.1
    yourDispatchCode = 1 ! you determine CPU supported features and set value here
    call foo(v, a, t, n)
    print *,v
end program Dispatch

I hope this helps.

Jim Dempsey

TimP · ‎11-02-2017

Alexander S. wrote:

I did not want to start a discussion about the premise...

If I understand correctly, the fact that the auto-CPU dispatch options will choose a slower path for non-intel CPUs makes this option unusable for non-intel CPUs. I will need a separate binary for every type of CPU?

Concerning xHost: what if our development workstation all have different generations of Intel CPUs, but the code may or may not be run on AMD epyc CPUs. In this case I need something different.

If you wish to make a multi-architecture binary, you should set the default architecture to the oldest architecture you intend to support, for example -msse3. If you have complex arithmetic, this could be much faster than the default. For example, -axAVX -msse3 should generate both AVX and SSE3 execution paths (for those cases where the compiler sees an advantage for AVX).

Steve_Lionel · ‎11-02-2017

As I wrote earlier, auto-CPU dispatch will select the "generic" code path on non-Intel CPUs. This is not necessarily the slowest - it depends on whether you specified a non-default generic instruction set and what the application does. As of the 18.0 version, the Intel compiler supports -m as high as -mavx. I don't know which of the Intel instruction sets the EPYC processors support.

Tim's advice is what I'd recommend, since now you're saying that the application may run on a variety of different processor generations. See https://software.intel.com/en-us/fortran-compiler-18.0-developer-guide-and-reference-m for the various choices.

Alexander_S_2 · ‎11-02-2017

Thanks for all your valuable input. I am beginning to get a better understanding how it should be done correctly - and why there are some salty people on the internet who strongly disagree with the way intel compiler handles non-intel architectures by default.

I might get back to this topic once our AMD test workstation is deployed.

Matthew2 · ‎12-21-2017

Unfortunately, either by design or bug, -xHost does not function on AMD EPYC's. Very disappointing :(

cpu family : 23

model : 1

model name : AMD EPYC 7401P 24-Core Processor

stepping : 2

microcode : 0x8001207

cpu MHz : 1996.236

cache size : 512 KB

physical id : 0

siblings : 48

core id : 0

cpu cores : 24

apicid : 61

initial apicid : 61

fpu : yes

fpu_exception : yes

cpuid level : 13

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx cpb hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload overflow_recov succor smca

bugs : fxsave_leak sysret_ss_attrs null_seg

bogomips : 3992.47

TLB size : 2560 4K pages

clflush size : 64

cache_alignment : 64

address sizes : 48 bits physical, 48 bits virtual

power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

user@a-compute-01:~$ cat test.c

/* Hello World program */

#include<stdio.h>

main()

{

printf("Hello World");

}

user@a-compute-01:~$ icc test.c -o test

user@a-compute-01:~$ ./test

Hello World

user@a-compute-01:~$ icc -xHost test.c -o test

user@a-compute-01:~$ ./test

Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, POPCNT and AVX instructions.

user@a-compute-01:~$

Compiler flags for AMD Epyc processors