ICPX auto-vectorization of math functions

AFog · ‎07-24-2022

Auto-vectorization of loops with math functions works well with ICPX under Linux, but not under Windows.

The Windows version has only double precision vector math functions, and no bigger than 256 bits. The Linux version has everything.

// example
const int size = 256;
float a[size];
float b[size];
...
for (int i=0; i<size; i++) {
    b[i] = exp(a[i]);
}

ICPX under Linux is using a 512-bit single precision function in the SVML library (__svml_expf16).

ICPX under Windows is converting to double and using a 256-bit double precision function in the SVML library (__svml_exp4).

__svml_expf16 is actually present in svml_dispmt.lib and can be called as _mm512_exp_ps.

HemanthCH_Intel · ‎07-26-2022

Hi,

Thanks for posting in Intel Communities.

Could you please provide us with the following information to investigate more on your issue?

The complete reproducer code and steps you have followed on Linux and Windows machines to reproduce your issue at our end?
please confirm whether you are using Command prompt or Visual studio for running your code in Windows?
How you are identifying "ICPX under Linux is using a 512-bit single precision function in the SVML library (__svml_expf16).and ICPX under Windows is converting to double and using a 256-bit double precision function in the SVML library (__svml_exp4)."
Are you using any intrinsic in your code?
And also, could you please let us know the icpx version?

Thanks & Regards,

Hemanth

AFog · ‎07-26-2022

Thank you for your reply.

Steps to reproduce: Compile the below code in Visual Studio 2022.1, release mode, /arch:AVX512, /fp:fast.
(Intel® oneAPI DPC++ Compiler Package ID: w_oneAPI_2022.1.0.256, Intel® oneAPI DPC++ Compiler – toolkit version: 2022.2.0, extension version 22.0.0.17, Package ID: w_oneAPI_2022.1.0.256).

#include <immintrin.h>
#include <inttypes.h>
#include <stdio.h>
#include <math.h>

const int size = 256;
float aaaa[size] = {0};
float bbbb[size] = {0};
volatile int k = 1;

int main () {
    // prevent optimizing whole loop away:
    aaaa[k] = 5.64;
    for (int i=0; i<size; i++) {
        bbbb[i] = exp(aaaa[i]);    
    }
    // prevent optimizing whole loop away:
    aaaa[k] = bbbb[k+1];
    for (int i=0; i<size; i++) {
        bbbb[i] = cos(aaaa[i]);     
    }
    printf("\ncos(0) = %f\n", bbbb[0]);
    return int(bbbb[k]);
}

The debugger shows this disassembly:

        bbbb[i] = exp(aaaa[i]);    
00007FF714E5BE2E  vcvtps2pd   zmm16,ymmword ptr [aaaa+20h (07FF714E94720h)]  
00007FF714E5BE38  vcvtps2pd   zmm0,ymmword ptr [aaaa (07FF714E94700h)]  
00007FF714E5BE42  call        __svml_exp8_z0 (07FF714E5C7F0h)  
00007FF714E5BE47  vmovaps     zmm17,zmm0  
00007FF714E5BE4D  vmovaps     zmm0,zmm16  
00007FF714E5BE53  call        __svml_exp8_z0 (07FF714E5C7F0h)  
00007FF714E5BE58  vcvtpd2ps   ymm1,zmm17  
00007FF714E5BE5E  vcvtpd2ps   ymm0,zmm0  
00007FF714E5BE64  vinsertf64x4 zmm0,zmm1,ymm0,1  
00007FF714E5BE6B  vmovaps     zmmword ptr [bbbb (07FF714E94B00h)],zmm0

After further testing, I found that the problem is solved when I change exp to expf and cos to cosf. So the diagnosis is not poor auto-vectorization, but that it fails to optimize exp to expf and cos to cosf in the Windows version, even under /fp:fast

HemanthCH_Intel · ‎07-28-2022

Hi,

We are working on this internally and will get back to you soon.

Thanks & Regards,

Hemanth

Viet_H_Intel · ‎08-01-2022

>> ICPX under Linux is using a 512-bit single precision function in the SVML library (__svml_expf16)

I tried your test case on Linux with icpx (Version 2022.1.0 Build 20220316) but couldn't see __svml_expf16 being called.

$ icpx -march=common-avx512 -fp-model=fast c05540193.cpp

$ objdump -d a.out |grep svml_expf16

$

>> ICPX under Windows is converting to double and using a 256-bit double precision function in the SVML library (__svml_exp4)

I tried the test case on Windows, but didn't see __svml_ex4 being called at all.

> icx /arch:AVX512 /fp:fast c05540193.cpp /O2

Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2022.1.0 Build 20220316

> dumpbin /DISASM c05540193.exe >dump.txt

> notepad dump.txt (no matches found for svml). Attached is a dump.txt

Can you provide steps to reproduce what you have observed?

Viet_H_Intel · ‎08-01-2022

dump.txt

Viet_H_Intel · ‎08-10-2022

Please provide us instructions and test case to reproduce the issue.

Thanks,

AFog · ‎08-10-2022

Viet H.

Your own dump shows the same as mine. It is converting single to double precision at vcvtps2pd, then it is calling the double version of the vector function at call 00000001400019E0. Then converting the result back to single precision at vcvtpd2ps. This shows, as I wrote, that it fails to optimize exp to expf and cos to cosf, but it does vectorize. Your dump does not show the function names.

Viet_H_Intel · ‎08-11-2022

Looks like the Auto-vectorization is not an issue anymore.

Does your assembly dump show the function names? Can you attach the assembly file and point it out where it fails to optimize exp to expf and cos to cosf?

Thanks,

AFog · ‎08-11-2022

It is seen in your own dump and in the disassembly that I have posted above. It converts single precision to double precision (vcvtps2pd) before calling the double-precision version of the vector function (__svml_exp8_z0), then converts the result back to single precision (vcvtpd2ps).

Viet_H_Intel · ‎08-15-2022

Thanks, I'll look into this.

Viet_H_Intel · ‎02-01-2024

Does this workaround, change the exp(), cos() to expf() cosf() which are the correct math library for float, work for you?

Thanks,

Viet_H_Intel · ‎04-05-2024

Would you let us know if the workaround works for you? So that we can close this issue?

Thanks,

Viet_H_Intel · ‎04-09-2024

We haven't heard from you and since there is workaround, we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.