Performance difference in between Windows and Linux using intel compiler: looking at the assembly

richardson__josh · ‎11-03-2018

I am running a program on both Windows and Linux (x86-64). It has been compiled with the same compiler (Intel Parallel Studio XE 2017) with the same options, and the Windows version is 3 times faster than the Linux one. The culprit is a call to std::erf which is resolved in the Intel math library for both cases (by default, it is linked dynamically on Windows and statically on Linux but using dynamic linking on Linux gives the same performance).

Here is a simple program to reproduce the problem.

#include <cmath>
#include <cstdio>

int main() {
  int n = 100000000;
  float sum = 1.0f;

  for (int k = 0; k < n; k++) {
    sum += std::erf(sum);
  }

  std::printf("%7.2f\n", sum);
}

When I profile this program using vTune, I find that the assembly is a bit different in between the Windows and the Linux version. Here is the call site (the loop) on Windows

Block 3: "vmovaps xmm0, xmm6" call 0x1400023e0 <erff> Block 4: inc ebx "vaddss xmm6, xmm6, xmm0" "cmp ebx, 0x5f5e100" jl 0x14000103f <Block 3>

And the beginning of the erf function called on Windows

Block 1: push rbp "sub rsp, 0x40" "lea rbp, ptr [rsp+0x20]" "lea rcx, ptr [rip-0xa6c81]" "movd edx, xmm0" "movups xmmword ptr [rbp+0x10], xmm6" "movss dword ptr [rbp+0x30], xmm0" "mov eax, edx" "and edx, 0x7fffffff" "and eax, 0x80000000" "add eax, 0x3f800000" "mov dword ptr [rbp], eax" "movss xmm6, dword ptr [rbp]" "cmp edx, 0x7f800000" ...

On Linux, the code is a bit different. The call site is:

Block 3 "vmovaps %xmm1, %xmm0" "vmovssl %xmm1, (%rsp)" callq 0x400bc0 <erff> Block 4 inc %r12d "vmovssl (%rsp), %xmm1" "vaddss %xmm0, %xmm1, %xmm1" <-------- hotspot here "cmp $0x5f5e100, %r12d" jl 0x400b6b <Block 3>

and the beginning of the called function (erf) is:

"movd %xmm0, %edx" "movssl %xmm0, -0x10(%rsp)" <-------- hotspot here "mov %edx, %eax" "and $0x7fffffff, %edx" "and $0x80000000, %eax" "add $0x3f800000, %eax" "movl %eax, -0x18(%rsp)" "movssl -0x18(%rsp), %xmm0" "cmp $0x7f800000, %edx" jnl 0x400dac <Block 8> ...

I have shown the 2 points where the time is lost on Linux.

Does anyone understand assembly enough to explain me the difference of the 2 codes and why the Linux version is 3 times slower?