Solved: [Fortran] Regarding Implementation of fast-transcendentals on Xeon / Xeon Phi

Srinivasan_R_1 · ‎01-12-2016

Hello,

I have been experimenting with the following compiler flags / options.

1. fp-model [strict, source, fast=2,etc]

2. -fast-transcendentals

3.-fimf-precision=high/low/medium

I have been comparing the performance of Xeon vs Xeon Phi for my application using various combinations of these flags. I obtained the best performance with fast=2, on both the Xeon and the Xeon Phi. But what I noticed was that performance improvement for Xeon Phi for -fp-model fast=2 over fp-model strict was far greater than the improvement for Xeon (Owing to FMA, in addition to other optimizations). Even with these flags, the Xeon Phi is barely keeping up with the Xeon (computation time alone).

My application's hotspot contains a lot of pow (**) and log(base 10 and base e), and doesn't contain any other transcendentals as far as I can see. This hotspot is not vectorizable, but is compute intensive due to the power and log functions.

I am interested in knowing how these two functions are implemented (mathematical details like series expansion, etc), and how the implementation varies with precision (high, medium, low).

The reason I ask for the above information is because I am interested in implementing a more "crude" form of pow and log, to see if I can further improve performance for Xeon Phi, while keeping an eye on correctness. Please note that both the base and the exponent are real numbers (not integers)

Compiler Used: 2015 Intel Fortran Compiler

Regards,

Srinivasan Ramesh

Andrey_Vladimirov · ‎01-14-2016

Regarding integer versus floating-point - yes, my comment about log2/exp2 being faster than log/exp applies to floating-point numbers. Hardware support for base-2 log and exp is, apparently, easier to implement for floats, too - you can see, for example, in the Intel Intrinsics Guide, that Xeon Phi has log2 and exp2 intrinsics but not log/exp. Note that the word intrinsics in this context has a different meaning from Fortran - these are compiler intrinsics, functions giving direct access to processor instructions.

Regarding how to use log2/exp2 in Fortran, that's a good question! I compiled this test program to see what the compiler does:

program transc

  real :: x, a
  read(*,*) x
  read(*,*) a
  write(*,*) exp(x)
  write(*,*) a**x
  write(*,*) 2.0**a
  write(*,*) log(x)

end program

First, the default optimization level and compiling for Xeon Phi (i.e., using "-fp-model fast=1 -mmic") in single precision. According to the assembly listing produced by the Intel compiler (use argument -Fa),

statement exp(x) is translated into a call to SVML function "exp"
statement a**x is translated into a call to SVML function "pow"
statement 2.0**a is translated into a call to SVML function "exp2"
statement log(x) is translated into a call to SVML function "exp2"

Second, the highest optimization level, still compiling for Xeon Phi in single precision (i.e., "-fp-model fast=2 -mmic") gives the following:

statement exp(x) is translated into a call to the fast processor instruction vexp223ps - base-2 exponential (probably using the identity exp(x)=2**(x*log2(e)) under the hood)
statement a**x is translated into a call to SVML function "pow" (who knows what it is doing under the hood)
statement 2.0**a is translated into a call to the processor instruction vexp223ps
statement log(x) is translated into a call to processor instruction vlog2ps (probably using the identity a**x = 2.0**(x*log2(a)))

So, that's good news. The compiler is smart enough so that, if you use -fp-model fast=2, your code will actually use the efficient hardware-implemented log2/exp2 functions wherever you call log and exp. SVML functions probably also call exp2/log2 under the hood, but may do some iteration on top of that to improve accuracy.

Of course, because you called log/exp, but the Xeon Phi knows only how to compute log2/exp2, the compiler will have to do some extra math: (a**x = 2.0**(x*log2(a))) or (log(x)=log2(x)/log2(e)). In C, sometimes I can avoid this extra math by explicitly calling log2/exp2. In Fortran, you can explicitly call exp2 using 2.0**x, but no way to explicitly call log2. I guess you can hope that the compiler is able to simplify math expressions so that, for example, log(a)/log(b) is translated behind the scenes into log2(a)/log2(b) rather than into (log2(a)/log2(e))/(log2(b)/log2(e)), however, I have not tried to confirm this.

To me, the conclusion of this exercise is that I have another reason to choose C over Fortran, but perhaps compiler developers can correct me.

View solution in original post

Andrey_Vladimirov · ‎01-13-2016

Hi Srinivasan,

Xeon and Xeon Phi have hardware support for transcendentals, however, in the case of Xeon Phi, some transcendentals, such as the sqrt, have low-precision implementations in hardware. When you call "sqrt()" from the code, the compiler links to the Intel Math Library, where, depending on the required precision, it translates either into the hardware implementation, or into hardware implementation plus some analytical iteration on top of it to improve performance.

-fp-model fast=2 sets a bunch of other settings to basically go as fast as possible, including allowing non-value safe optimization, setting low precision and excluding special number domains from consideration by trancendental functions

-fimf-precision is just one of the settings that -fp-model fast=2 sets

Regarding going as fast as possible beyond compiler flags, there are various things you can do with algebraic transformations. For example, my experience shows that "log" and "exp" on the Intel architecture are less efficient than "log2" and "exp2". Sometimes you can tranfsorm expressions to use "log2" instead of "log" everywhere. For example, power-law interpolation "y=y0*exp(log(x/x0)*log(y1/y0)/log2(x2/x0))" may be more efficiently computed as "y=y0*exp2(log2(x/x0)*log2(y1/y0)/log2(x2/x0))". In other cases, can try "exp2(x*1.4426...)" instead of "exp(x)".

See if this video tutorial gives you any other hints or pointers to resources: https://www.youtube.com/watch?v=DHG9xhAgTM0

It is a part of a 5-hour video course on performance optimization, which is available for free here: colfaxresearch.com/cdt-v01/

Andrey

Srinivasan_R_1 · ‎01-13-2016

Hi Andrey,

Thanks much for your optimization tips. I am currently looking at the videos on Colfax website to see if anything would be applicable.

With regard to the optimization of log and exp that you have mentioned, I would like to point out that the application we are working with is written in Fortran 90. The application is using Fortran intrinsics for power (**), and log() and log10() intrinsics. We are not aware of a log2() intrinsic in Fortran. Also, our exponents do not have "e" as the base. Both the exponent and base in the power function are real numbers.

Would you happen to know the equivalent optimization in Fortran? (I presume you were referring to C code in your reply)

Regards,

Srinivasan Ramesh

Andrey_Vladimirov · ‎01-14-2016

Regarding integer versus floating-point - yes, my comment about log2/exp2 being faster than log/exp applies to floating-point numbers. Hardware support for base-2 log and exp is, apparently, easier to implement for floats, too - you can see, for example, in the Intel Intrinsics Guide, that Xeon Phi has log2 and exp2 intrinsics but not log/exp. Note that the word intrinsics in this context has a different meaning from Fortran - these are compiler intrinsics, functions giving direct access to processor instructions.

Regarding how to use log2/exp2 in Fortran, that's a good question! I compiled this test program to see what the compiler does:

program transc

  real :: x, a
  read(*,*) x
  read(*,*) a
  write(*,*) exp(x)
  write(*,*) a**x
  write(*,*) 2.0**a
  write(*,*) log(x)

end program

First, the default optimization level and compiling for Xeon Phi (i.e., using "-fp-model fast=1 -mmic") in single precision. According to the assembly listing produced by the Intel compiler (use argument -Fa),

statement exp(x) is translated into a call to SVML function "exp"
statement a**x is translated into a call to SVML function "pow"
statement 2.0**a is translated into a call to SVML function "exp2"
statement log(x) is translated into a call to SVML function "exp2"

Second, the highest optimization level, still compiling for Xeon Phi in single precision (i.e., "-fp-model fast=2 -mmic") gives the following:

statement exp(x) is translated into a call to the fast processor instruction vexp223ps - base-2 exponential (probably using the identity exp(x)=2**(x*log2(e)) under the hood)
statement a**x is translated into a call to SVML function "pow" (who knows what it is doing under the hood)
statement 2.0**a is translated into a call to the processor instruction vexp223ps
statement log(x) is translated into a call to processor instruction vlog2ps (probably using the identity a**x = 2.0**(x*log2(a)))

So, that's good news. The compiler is smart enough so that, if you use -fp-model fast=2, your code will actually use the efficient hardware-implemented log2/exp2 functions wherever you call log and exp. SVML functions probably also call exp2/log2 under the hood, but may do some iteration on top of that to improve accuracy.

Of course, because you called log/exp, but the Xeon Phi knows only how to compute log2/exp2, the compiler will have to do some extra math: (a**x = 2.0**(x*log2(a))) or (log(x)=log2(x)/log2(e)). In C, sometimes I can avoid this extra math by explicitly calling log2/exp2. In Fortran, you can explicitly call exp2 using 2.0**x, but no way to explicitly call log2. I guess you can hope that the compiler is able to simplify math expressions so that, for example, log(a)/log(b) is translated behind the scenes into log2(a)/log2(b) rather than into (log2(a)/log2(e))/(log2(b)/log2(e)), however, I have not tried to confirm this.

To me, the conclusion of this exercise is that I have another reason to choose C over Fortran, but perhaps compiler developers can correct me.

Srinivasan_R_1 · ‎01-16-2016

Hi Andrey,

Your example was very informative. Thank you for the detailed reply. I replicated the code you have provided, and looked into the assembly. We are running our code in the offload mode. So I modified this sample code to include and offload statement just before Line 6. I tried compiling with different options, and could see the svml functions getting generated when I used -fp-model fast=2, or when I used -fp-model source -fast-transcendentals.

It appears that SVML function "pow" under the hood is indeed using a composition of vlog2ps and vexp223ps. Please see: https://books.google.co.in/books?id=MUZ0CAAAQBAJ&pg=PA135&lpg=PA135&dq=svml+expf&source=bl&ots=nBWQUP99ko&sig=lWH2dtXGnEEiEusfZbHZfY0qAIo&hl=en&sa=X&ved=0ahUKEwik_tGNv67KAhXQjo4KHaC7DwwQ6AEINDAE#v=onepage&q=svml%20expf&f=false (Google search svml intel :) )

Couple of things are still not clear to me though:

1. In our application code assembly for the Xeon Phi generated with -fp-model source -fast-transcendentals, we see that _svml_log8 is generated, and in another place, log is generated. How come this difference in assembly code generation when both places in the source code call "log" Fortran intrinsic?

2. Would you happen to know the difference between _svml_log8, _svml_log16 (Something to do with precision?)

Srinivasan Ramesh