Erf is 5 times faster on Windows than on Linux!

velvia · ‎11-08-2016

Hi,

With Intel Parallel Studio XE 2017 Update 1, the following program runs 5 times faster on Windows than on Linux. It used to run way faster on Linux (in march, with Parallel Studio XE 2016). Am I missing something?

program main
  integer, parameter :: sp = 4
  integer, parameter :: n = 100000000
  real(sp) :: sum = 0.0_sp
  integer :: k

  do k = 1, n
    sum = sum + erf(sum)
  end do

 write (*,*) sum
end program main

Note that the vectorized version of Erf seems fine on Linux.

TimP · ‎11-08-2016

A possible explanation might be use of scalar erf() from glibc vs. an Intel SSE serial (as well as svml) library version.

velvia · ‎11-09-2016

Thanks for the hint Tim. This is most likely the case. Here is the result of ldd on Linux:

linux-vdso.so.1 =>  (0x00007ffe55bed000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fceb52b9000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fceb509c000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fceb4cd3000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fceb4abc000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fceb48b8000)
/lib64/ld-linux-x86-64.so.2 (0x000056373324d000)

Should it considered as a bug? It was not the case with Parallel Studio XE 2016.

velvia · ‎11-09-2016

Also, I've found something quite unexpected:

- If erf(x) is not vectorized, it's execution time depends a lot upon x. The larger x is, the shorter the execution time is.

- If erf(x(i)) is vectorized, it's execution time barely depends upon x even though I've used the same values for all the x.

This behavior seems quite strange to me. Any explanation on that point?

TimP · ‎11-09-2016

SVML (Intel short vector math library) generally advertises possible errors up to 4 ULP (4 of the least significant bit). Intel compilers generally advertise better accuracy for scalar functions (e.g. <1 ULP). As 4 ULPs error in single precision means you don't get 6 significant decimals, such inaccuracy might be noticeable in real applications. SVML takes advantage of the looser limits, e.g. by avoiding some of the special case treatments which would impede vectorization.

If someone found that the Intel scalar SSE erf didn't always maintain as much accuracy as the current glibc, that might have been a reason for changing. It may be that single precision speedup isn't achieved without exceeding the error bounds goal. You may be seeing the penalties associated with switching into x87, with rational polynomial approximations good to 15-17 decimal digits, as well as different cases for various ranges of values. Dropping back to glibc would avoid any questions about why gfortran might show better accuracy than ifort.

That said, there have been isolated incidents in the past where an Intel function was inadvertently dropped from the library, resulting in falling through to glibc math library.

Your ldd map shows that glibc libm is on path, but that by itself doesn't prove that your erf references are linked to glibc. You should be able to find out by creating and examining your link map, or by running under gdb-ia and stepping into the library. Advisor or VTune also would show where the library reference goes.

The netlib/fdlibm source code for erf() is much saner than some of the others; it should give you an idea what is involved in glibc.

Steven_L_Intel1 · ‎11-09-2016

I don't see that this program vectorizes the ERF call in either compiler. The vectorization report gripes about a dependence in the loop, though it is really a reduction and I think it should realize this. Which options did you use to get it to vectorize the ERF itself?

Can you show the options on both platforms and the assembly it generates? There are too many variables here.

TimP · ‎11-09-2016

The loop shown isn't vectorizable. Choosing the name sum is bad optics; it displaces the intrinsic sum. There is a clear dependency, as each erf call uses the value of sum from the previous iteration. Typical vectorizable usage would involve an array argument for erf. For a sum reduction, it would be something like

sum(erf(array))

which should be vectorizable at /fp:fast.

velvia · ‎11-09-2016

Tim P. wrote:

SVML (Intel short vector math library) generally advertises possible errors up to 4 ULP (4 of the least significant bit). Intel compilers generally advertise better accuracy for scalar functions (e.g. <1 ULP). As 4 ULPs error in single precision means you don't get 6 significant decimals, such inaccuracy might be noticeable in real applications. SVML takes advantage of the looser limits, e.g. by avoiding some of the special case treatments which would impede vectorization.

Thanks for such an explanation in the difference in between the SVML and the scalar version.

I can'd find anything in the documentation about the accuracy of transcendental functions. Is there anything on the web about that ?

velvia · ‎11-09-2016

Steve Lionel (Intel) wrote:

I don't see that this program vectorizes the ERF call in either compiler. The vectorization report gripes about a dependence in the loop, though it is really a reduction and I think it should realize this. Which options did you use to get it to vectorize the ERF itself?

Can you show the options on both platforms and the assembly it generates? There are too many variables here.

Hi Steve. There is a misunderstanding. The program was just here to show that it is 5 times faster on Windows than on Linux. On purpose it was designed to be not vectorizable.

But if you write a vectorized code using erf, my claim is than you won't find this difference in performance in between Windows and Linux.

Steven_L_Intel1 · ‎11-09-2016

But you also used different versions of the compiler and didn't show compile options on both platforms. I am very skeptical that there is really a difference based on OS, unless the ERF entry point is being found in the gcc libm rather than the Intel libm.

velvia · ‎11-09-2016

Steve Lionel (Intel) wrote:

But you also used different versions of the compiler and didn't show compile options on both platforms. I am very skeptical that there is really a difference based on OS, unless the ERF entry point is being found in the gcc libm rather than the Intel libm.

I have used Parallel Studio XE 2017 on both Linux and Windows on the same machine (dual boot). The compilation option are "-g -O3 -xHost" and its equivalent on Windows.

In march 2016 with the latest compiler, on Linux, I used to have the same performance as I have on Windows today.

velvia · ‎11-09-2016

Tim P. wrote:

You should be able to find out by creating and examining your link map, or by running under gdb-ia and stepping into the library. Advisor or VTune also would show where the library reference goes.

I have compiled my program with

ifort -g -O3 -xHost main.f90 -o main -Wl,-M > link-map.txt

and I get that.

.text          0x0000000000478b70      0x1f0 /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a(erff.o)            
  1                 0x0000000000478b70                __libm_erff_ex

I suppose that it proves that I am calling the version from the Intel library.

TimP · ‎11-09-2016

velvia wrote:

Quote:

Tim P. wrote:

You should be able to find out by creating and examining your link map, or by running under gdb-ia and stepping into the library. Advisor or VTune also would show where the library reference goes.

I have compiled my program with
ifort -g -O3 -xHost main.f90 -o main -Wl,-M > link-map.txt
and I get that.
.text          0x0000000000478b70      0x1f0 /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a(erff.o)            
  1                 0x0000000000478b70                __libm_erff_ex
I suppose that it proves that I am calling the version from the Intel library.

You may need to track down that __libm_erff_ex to see if it is defined in libimf or leads into the libm.so.

velvia · ‎11-09-2016

Tim P. wrote:

You may need to track down that __libm_erff_ex to see if it is defined in libimf or leads into the libm.so.

I have tried

nm -D /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.so
nm -D -C /lib/x86_64-linux-gnu/libm.so.6

None of them contains __libm_erff_ex. But the static library contains it:

nm -A /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a  | grep __libm_erff_ex
/opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a:erff_iface_c99.o:                 U __libm_erff_ex
/opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a:erff_iface_disp.o:                 U __libm_erff_ex
/opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a:erff.o:0000000000000000 T __libm_erff_ex

So it seems that it is not related to a problem of the GNU library.

I have tried tonight on my MacBook Pro with Windows, macOS, Linux. All come with Intel Parallel Studio XE 2017 with compiler 17.0.0. The execution time is :

Windows: 2s
macOS: 2s
Linux: 6s

On macOS and Linux, I use the same command line :

ifort -g -O3 -xHost main.f90 -o main

On Windows, it seems that I use the same options.

velvia · ‎11-09-2016

Running vTune shows that the Linux version spends a lot of time in a movssl line (****) which is not present on the Windows version:

From what I understand of assembly language, it seems that the parameter x of erf(x) is passed on the stack on Linux, and through registers on Windows. Could anyone confirm this?

Linux:

Function Range / Basic Block / Address,Source Line,Assembly,Effective Time by Utilization
0x478b70,0,Function range 0x478b70-0x478d60,4.18514
 0x478b70,0,Block 1,2.54656
  0x478b70,,"movd %xmm0, %edx",0.0260569
  **** 0x478b74,,"movssl  %xmm0, -0x10(%rsp)",2.14468 ****
  0x478b7a,,"mov %edx, %eax",0.341746
  0x478b7c,,"and $0x7fffffff, %edx",
  0x478b82,,"and $0x80000000, %eax",
  ...
 0x478d58,,Block 10,
  0x478d58,,retq  ,
 0x478d59,,Block 11,
  0x478d59,,"nopl  %eax, (%rax)",

Windows:

Address,Source Line,Assembly,CPU Time: Total
0x1800a6c40,,Block 1:,
0x1800a6c40,,push rbp,
0x1800a6c41,,"sub rsp, 0x40",
0x1800a6c45,,"lea rbp, ptr [rsp+0x20]",0.0392572
0x1800a6c4a,,"lea rcx, ptr [rip-0xa6c51]",
0x1800a6c51,,"movd edx, xmm0",
...
0x1800a6e52,,pop rbp,
0x1800a6e53,,ret ,
0x1800a6e54,,Block 11:,
0x1800a6e54,,"nop dword ptr [rax+rax*1], eax",
0x1800a6e59,,"nop dword ptr [rax], eax",