Solved: Thank you both Jim and Tim.

Mick_Pont · ‎09-29-2015

By using the -Qax compiler flag I can get the compiler to generate code to run efficiently on different kinds of processor, for example

-QaxAVX,SSE2

generates code for AVX and SSE2. At run time, a suitable code path is chosen depending on what hardware I'm running on.

I would like to be able to test the different code paths on a single machine. In my case, my processor has AVX instructions available, but I would like to test the SSE2 path as well. Is there any way I can do this, or must I just run on a machine which has SSE2 but not AVX?

Apart from testing purposes, another reason for wanting to do this is that users of the library software that I help build sometimes want to force the library to avoid AVX instructions so that they get more reproducible results across different machines.

What I'd like to do is similar to what you can do when using MKL:

https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr

which allows you to choose CPU via a library call or setting of an environment variable.

I guess what I'm trying to do is the equivalent of intercepting whatever compiler run-time routine determines the current hardware capablilities, so that I can return a result different to the actual hardware.

Mick Pont

Kevin_D_Intel · ‎10-01-2015

I see Jim has already replied (Thank you Jim), and his is the same advice I received from Development regarding your interest. A small example being:

#include <immintrin.h> 

extern int foo_avx();
extern int foo_sse42();
extern int foo_generic();

int ddd() {
    if (_may_i_use_cpu_feature(_FEATURE_AVX)) {
        return foo_avx();
    }
    else if (_may_i_use_cpu_feature(_FEATURE_SSE4_2)) {
        return foo_sse42();
    }
    else {
        return foo_generic();
    }
}

View solution in original post

TimP · ‎09-29-2015

MKL conditional numerical reproducibility resembles compile option /Qimf-arch-consistency:true This replaces math library calls which may have platform architecture dependent results with functions which are meant to produce results which are consistent between architectures, and avoids mistakes due to unexpected architecture selections.

/Qfast-transcendentals offers specific control on whether calls to architecture-dependent math functions, including SVML, are made at compile time (unless /fp:strict is set, which disallows them entirely). The documentation of it is self-contradictory in my reading.

As far as I know, arch-consistency didn't add environment variable controls to select ISA, such as MKL_CBWR , nor did SVML.

Mick_Pont · ‎09-30-2015

Thanks Tim - useful information but not really what I need. I don't just want math library calls to be consistent regardless of the machine I'm running on, I'd like some way of forcing the program to go down the alternate code paths generated by -Qax. Otherwise, how can I ever be confident that the code generated by the compiler is correct for all paths?

jimdempseyatthecove · ‎09-30-2015

Look in the C++ reference (you can call C++ code from Fortran) _may_i_use_cpu_feature. This can be used to what features are available for the CPU at run time. You also might be able to use other available functions. In the C++ reference, click on Search, then in All Words, enter "intrinsic CPU feature".

With these, you could create your own libraries, compiled for different architectures, as different named .DLLs. Then use the _may_i_use_cpu_feature and depending on feature(s) returned, select one of the several .DLLs you have produced (then use LoadLibrary or LoadLibraryEX to load the desired .DLL).

Jim Dempsey

Kevin_D_Intel · ‎10-01-2015

I see Jim has already replied (Thank you Jim), and his is the same advice I received from Development regarding your interest. A small example being:

#include <immintrin.h> 

extern int foo_avx();
extern int foo_sse42();
extern int foo_generic();

int ddd() {
    if (_may_i_use_cpu_feature(_FEATURE_AVX)) {
        return foo_avx();
    }
    else if (_may_i_use_cpu_feature(_FEATURE_SSE4_2)) {
        return foo_sse42();
    }
    else {
        return foo_generic();
    }
}

Mick_Pont · ‎10-01-2015

Thank you both Jim and Kevin. It's not quite as easy as I hoped, but it is a way forward.

Mick

TimP · ‎10-01-2015

I'm not seeing how the C++ code dispatching feature addresses the original question. It appears to leave entirely open the question of what the math libraries are doing with run-time ISA sensing.

A related approach to testing the various paths on a single platform would be to make a test build using the default setting of /arch: and to use the arch-consistency option in an attempt to eliminate differences between the test platform and other customer platforms. Testing sum reductions, for example, with both an SSE2 and an AVX build, may give an idea of the range of results. AVX2 might give a further increase in accuracy of dot product reduction over AVX, although I've never verified it myself.

I was somewhat curious as to why MKL doesn't support an SSE3 setting, which would seem important for support of complex data type acros various platforms. It leaves me to guess that the MKL team didn't have the resources to test any older ISA other than SSSE3. As the original post in this thread indicates, there may be issues in attempting verification on a wide variety of run-time platforms. The arch-consistency option is among the means for dealing with this.

jimdempseyatthecove · ‎10-01-2015

Tim,

In Kevin's response, foo_avx, foo_sse42 and foo_generic (presumably sse or FPU) would be compiled separately with the targeted architecture. IOW with no compiler generated dispatching. And in the case of using different DLLs together with LoadLibrary, the entry point names can all be the same (e.g. foo). In the .DLL method, the runtime architecture test need only to be performed once (for those libraries). MKL and other libraries are a separate issue.

A reason why a programmer may want to do this is to handle the boundary conditions where the overhead of adding the auto-dispatch is estimated to be higher (or unknown) than the potential time reclaimed by using the "newer" instructions.

The following is completely speculation on my part. It should be possible for the compiler to generate a .DLL dispatch table not only for .DLL but also for static library and .obj code. Then the compiler optimization could generate code for each routine in multiple architectures. The initial vector in the dispatch table would contain the architecture test, and then patch the dispatch table such that all subsequent calls (via the dispatch table) would go directly to the appropriate routine.

Jim Dempsey

Choosing between AVX and SSE2 at run time