- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Auto-vectorization of loops with math functions works well with ICPX under Linux, but not under Windows.
The Windows version has only double precision vector math functions, and no bigger than 256 bits. The Linux version has everything.
// example
const int size = 256;
float a[size];
float b[size];
...
for (int i=0; i<size; i++) {
b[i] = exp(a[i]);
}
ICPX under Linux is using a 512-bit single precision function in the SVML library (__svml_expf16).
ICPX under Windows is converting to double and using a 256-bit double precision function in the SVML library (__svml_exp4).
__svml_expf16 is actually present in svml_dispmt.lib and can be called as _mm512_exp_ps.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in Intel Communities.
Could you please provide us with the following information to investigate more on your issue?
- The complete reproducer code and steps you have followed on Linux and Windows machines to reproduce your issue at our end?
- please confirm whether you are using Command prompt or Visual studio for running your code in Windows?
- How you are identifying "ICPX under Linux is using a 512-bit single precision function in the SVML library (__svml_expf16).and ICPX under Windows is converting to double and using a 256-bit double precision function in the SVML library (__svml_exp4)."
- Are you using any intrinsic in your code?
- And also, could you please let us know the icpx version?
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your reply.
Steps to reproduce: Compile the below code in Visual Studio 2022.1, release mode, /arch:AVX512, /fp:fast.
(Intel® oneAPI DPC++ Compiler Package ID: w_oneAPI_2022.1.0.256, Intel® oneAPI DPC++ Compiler – toolkit version: 2022.2.0, extension version 22.0.0.17, Package ID: w_oneAPI_2022.1.0.256).
#include <immintrin.h>
#include <inttypes.h>
#include <stdio.h>
#include <math.h>
const int size = 256;
float aaaa[size] = {0};
float bbbb[size] = {0};
volatile int k = 1;
int main () {
// prevent optimizing whole loop away:
aaaa[k] = 5.64;
for (int i=0; i<size; i++) {
bbbb[i] = exp(aaaa[i]);
}
// prevent optimizing whole loop away:
aaaa[k] = bbbb[k+1];
for (int i=0; i<size; i++) {
bbbb[i] = cos(aaaa[i]);
}
printf("\ncos(0) = %f\n", bbbb[0]);
return int(bbbb[k]);
}
The debugger shows this disassembly:
bbbb[i] = exp(aaaa[i]);
00007FF714E5BE2E vcvtps2pd zmm16,ymmword ptr [aaaa+20h (07FF714E94720h)]
00007FF714E5BE38 vcvtps2pd zmm0,ymmword ptr [aaaa (07FF714E94700h)]
00007FF714E5BE42 call __svml_exp8_z0 (07FF714E5C7F0h)
00007FF714E5BE47 vmovaps zmm17,zmm0
00007FF714E5BE4D vmovaps zmm0,zmm16
00007FF714E5BE53 call __svml_exp8_z0 (07FF714E5C7F0h)
00007FF714E5BE58 vcvtpd2ps ymm1,zmm17
00007FF714E5BE5E vcvtpd2ps ymm0,zmm0
00007FF714E5BE64 vinsertf64x4 zmm0,zmm1,ymm0,1
00007FF714E5BE6B vmovaps zmmword ptr [bbbb (07FF714E94B00h)],zmm0
After further testing, I found that the problem is solved when I change exp to expf and cos to cosf. So the diagnosis is not poor auto-vectorization, but that it fails to optimize exp to expf and cos to cosf in the Windows version, even under /fp:fast
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on this internally and will get back to you soon.
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> ICPX under Linux is using a 512-bit single precision function in the SVML library (__svml_expf16)
I tried your test case on Linux with icpx (Version 2022.1.0 Build 20220316) but couldn't see __svml_expf16 being called.
$ icpx -march=common-avx512 -fp-model=fast c05540193.cpp
$ objdump -d a.out |grep svml_expf16
$
>> ICPX under Windows is converting to double and using a 256-bit double precision function in the SVML library (__svml_exp4)
I tried the test case on Windows, but didn't see __svml_ex4 being called at all.
> icx /arch:AVX512 /fp:fast c05540193.cpp /O2
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2022.1.0 Build 20220316
> dumpbin /DISASM c05540193.exe >dump.txt
> notepad dump.txt (no matches found for svml). Attached is a dump.txt
Can you provide steps to reproduce what you have observed?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please provide us instructions and test case to reproduce the issue.
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Viet H.
Your own dump shows the same as mine. It is converting single to double precision at vcvtps2pd, then it is calling the double version of the vector function at call 00000001400019E0. Then converting the result back to single precision at vcvtpd2ps. This shows, as I wrote, that it fails to optimize exp to expf and cos to cosf, but it does vectorize. Your dump does not show the function names.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looks like the Auto-vectorization is not an issue anymore.
Does your assembly dump show the function names? Can you attach the assembly file and point it out where it fails to optimize exp to expf and cos to cosf?
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is seen in your own dump and in the disassembly that I have posted above. It converts single precision to double precision (vcvtps2pd) before calling the double-precision version of the vector function (__svml_exp8_z0), then converts the result back to single precision (vcvtpd2ps).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, I'll look into this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does this workaround, change the exp(), cos() to expf() cosf() which are the correct math library for float, work for you?
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Would you let us know if the workaround works for you? So that we can close this issue?
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We haven't heard from you and since there is workaround, we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page