- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
With Intel Parallel Studio XE 2017 Update 1, the following program runs 5 times faster on Windows than on Linux. It used to run way faster on Linux (in march, with Parallel Studio XE 2016). Am I missing something?
program main integer, parameter :: sp = 4 integer, parameter :: n = 100000000 real(sp) :: sum = 0.0_sp integer :: k do k = 1, n sum = sum + erf(sum) end do write (*,*) sum end program main
Note that the vectorized version of Erf seems fine on Linux.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A possible explanation might be use of scalar erf() from glibc vs. an Intel SSE serial (as well as svml) library version.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the hint Tim. This is most likely the case. Here is the result of ldd on Linux:
linux-vdso.so.1 => (0x00007ffe55bed000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fceb52b9000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fceb509c000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fceb4cd3000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fceb4abc000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fceb48b8000) /lib64/ld-linux-x86-64.so.2 (0x000056373324d000)
Should it considered as a bug? It was not the case with Parallel Studio XE 2016.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, I've found something quite unexpected:
- If erf(x) is not vectorized, it's execution time depends a lot upon x. The larger x is, the shorter the execution time is.
- If erf(x(i)) is vectorized, it's execution time barely depends upon x even though I've used the same values for all the x.
This behavior seems quite strange to me. Any explanation on that point?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SVML (Intel short vector math library) generally advertises possible errors up to 4 ULP (4 of the least significant bit). Intel compilers generally advertise better accuracy for scalar functions (e.g. <1 ULP). As 4 ULPs error in single precision means you don't get 6 significant decimals, such inaccuracy might be noticeable in real applications. SVML takes advantage of the looser limits, e.g. by avoiding some of the special case treatments which would impede vectorization.
If someone found that the Intel scalar SSE erf didn't always maintain as much accuracy as the current glibc, that might have been a reason for changing. It may be that single precision speedup isn't achieved without exceeding the error bounds goal. You may be seeing the penalties associated with switching into x87, with rational polynomial approximations good to 15-17 decimal digits, as well as different cases for various ranges of values. Dropping back to glibc would avoid any questions about why gfortran might show better accuracy than ifort.
That said, there have been isolated incidents in the past where an Intel function was inadvertently dropped from the library, resulting in falling through to glibc math library.
Your ldd map shows that glibc libm is on path, but that by itself doesn't prove that your erf references are linked to glibc. You should be able to find out by creating and examining your link map, or by running under gdb-ia and stepping into the library. Advisor or VTune also would show where the library reference goes.
The netlib/fdlibm source code for erf() is much saner than some of the others; it should give you an idea what is involved in glibc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't see that this program vectorizes the ERF call in either compiler. The vectorization report gripes about a dependence in the loop, though it is really a reduction and I think it should realize this. Which options did you use to get it to vectorize the ERF itself?
Can you show the options on both platforms and the assembly it generates? There are too many variables here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The loop shown isn't vectorizable. Choosing the name sum is bad optics; it displaces the intrinsic sum. There is a clear dependency, as each erf call uses the value of sum from the previous iteration. Typical vectorizable usage would involve an array argument for erf. For a sum reduction, it would be something like
sum(erf(array))
which should be vectorizable at /fp:fast.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim P. wrote:
SVML (Intel short vector math library) generally advertises possible errors up to 4 ULP (4 of the least significant bit). Intel compilers generally advertise better accuracy for scalar functions (e.g. <1 ULP). As 4 ULPs error in single precision means you don't get 6 significant decimals, such inaccuracy might be noticeable in real applications. SVML takes advantage of the looser limits, e.g. by avoiding some of the special case treatments which would impede vectorization.
Thanks for such an explanation in the difference in between the SVML and the scalar version.
I can'd find anything in the documentation about the accuracy of transcendental functions. Is there anything on the web about that ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve Lionel (Intel) wrote:
I don't see that this program vectorizes the ERF call in either compiler. The vectorization report gripes about a dependence in the loop, though it is really a reduction and I think it should realize this. Which options did you use to get it to vectorize the ERF itself?
Can you show the options on both platforms and the assembly it generates? There are too many variables here.
Hi Steve. There is a misunderstanding. The program was just here to show that it is 5 times faster on Windows than on Linux. On purpose it was designed to be not vectorizable.
But if you write a vectorized code using erf, my claim is than you won't find this difference in performance in between Windows and Linux.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But you also used different versions of the compiler and didn't show compile options on both platforms. I am very skeptical that there is really a difference based on OS, unless the ERF entry point is being found in the gcc libm rather than the Intel libm.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve Lionel (Intel) wrote:
But you also used different versions of the compiler and didn't show compile options on both platforms. I am very skeptical that there is really a difference based on OS, unless the ERF entry point is being found in the gcc libm rather than the Intel libm.
I have used Parallel Studio XE 2017 on both Linux and Windows on the same machine (dual boot). The compilation option are "-g -O3 -xHost" and its equivalent on Windows.
In march 2016 with the latest compiler, on Linux, I used to have the same performance as I have on Windows today.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim P. wrote:
You should be able to find out by creating and examining your link map, or by running under gdb-ia and stepping into the library. Advisor or VTune also would show where the library reference goes.
I have compiled my program with
ifort -g -O3 -xHost main.f90 -o main -Wl,-M > link-map.txt
and I get that.
.text 0x0000000000478b70 0x1f0 /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a(erff.o) 1 0x0000000000478b70 __libm_erff_ex
I suppose that it proves that I am calling the version from the Intel library.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
velvia wrote:
Quote:
Tim P. wrote:
You should be able to find out by creating and examining your link map, or by running under gdb-ia and stepping into the library. Advisor or VTune also would show where the library reference goes.
I have compiled my program with
ifort -g -O3 -xHost main.f90 -o main -Wl,-M > link-map.txtand I get that.
.text 0x0000000000478b70 0x1f0 /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a(erff.o) 1 0x0000000000478b70 __libm_erff_exI suppose that it proves that I am calling the version from the Intel library.
You may need to track down that __libm_erff_ex to see if it is defined in libimf or leads into the libm.so.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim P. wrote:
You may need to track down that __libm_erff_ex to see if it is defined in libimf or leads into the libm.so.
I have tried
nm -D /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.so nm -D -C /lib/x86_64-linux-gnu/libm.so.6
None of them contains __libm_erff_ex. But the static library contains it:
nm -A /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a | grep __libm_erff_ex /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a:erff_iface_c99.o: U __libm_erff_ex /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a:erff_iface_disp.o: U __libm_erff_ex /opt/intel/compilers_and_libraries_2017.0.098/linux/compiler/lib/intel64/libimf.a:erff.o:0000000000000000 T __libm_erff_ex
So it seems that it is not related to a problem of the GNU library.
I have tried tonight on my MacBook Pro with Windows, macOS, Linux. All come with Intel Parallel Studio XE 2017 with compiler 17.0.0. The execution time is :
Windows: 2s macOS: 2s Linux: 6s
On macOS and Linux, I use the same command line :
ifort -g -O3 -xHost main.f90 -o main
On Windows, it seems that I use the same options.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Running vTune shows that the Linux version spends a lot of time in a movssl line (****) which is not present on the Windows version:
From what I understand of assembly language, it seems that the parameter x of erf(x) is passed on the stack on Linux, and through registers on Windows. Could anyone confirm this?
Linux:
Function Range / Basic Block / Address,Source Line,Assembly,Effective Time by Utilization 0x478b70,0,Function range 0x478b70-0x478d60,4.18514 0x478b70,0,Block 1,2.54656 0x478b70,,"movd %xmm0, %edx",0.0260569 **** 0x478b74,,"movssl %xmm0, -0x10(%rsp)",2.14468 **** 0x478b7a,,"mov %edx, %eax",0.341746 0x478b7c,,"and $0x7fffffff, %edx", 0x478b82,,"and $0x80000000, %eax", ... 0x478d58,,Block 10, 0x478d58,,retq , 0x478d59,,Block 11, 0x478d59,,"nopl %eax, (%rax)",
Windows:
Address,Source Line,Assembly,CPU Time: Total 0x1800a6c40,,Block 1:, 0x1800a6c40,,push rbp, 0x1800a6c41,,"sub rsp, 0x40", 0x1800a6c45,,"lea rbp, ptr [rsp+0x20]",0.0392572 0x1800a6c4a,,"lea rcx, ptr [rip-0xa6c51]", 0x1800a6c51,,"movd edx, xmm0", ... 0x1800a6e52,,pop rbp, 0x1800a6e53,,ret , 0x1800a6e54,,Block 11:, 0x1800a6e54,,"nop dword ptr [rax+rax*1], eax", 0x1800a6e59,,"nop dword ptr [rax], eax",
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page