Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7838 Discussions

ICPX auto-vectorization of math functions

AFog
Beginner
969 Views

Auto-vectorization of loops with math functions works well with ICPX under Linux, but not under Windows.

 

The Windows version has only double precision vector math functions, and no bigger than 256 bits. The Linux version has everything.

 

// example
const int size = 256;
float a[size];
float b[size];
...
for (int i=0; i<size; i++) {
    b[i] = exp(a[i]);
}

ICPX under Linux is using a 512-bit single precision function in the SVML library (__svml_expf16).

ICPX under Windows is converting to double and using a 256-bit double precision function in the SVML library (__svml_exp4).

__svml_expf16 is actually present in svml_dispmt.lib and can be called as _mm512_exp_ps.

 

0 Kudos
10 Replies
HemanthCH_Intel
Moderator
940 Views

Hi,

 

Thanks for posting in Intel Communities.

 

Could you please provide us with the following information to investigate more on your issue?

  1. The complete reproducer code and steps you have followed on Linux and Windows machines to reproduce your issue at our end?
  2. please confirm whether you are using Command prompt or Visual studio for running your code in Windows?
  3. How you are identifying "ICPX under Linux is using a 512-bit single precision function in the SVML library (__svml_expf16).and ICPX under Windows is converting to double and using a 256-bit double precision function in the SVML library (__svml_exp4)."
  4. Are you using any intrinsic in your code?
  5. And also, could you please let us know the icpx version?

 

Thanks & Regards,

Hemanth

 

0 Kudos
AFog
Beginner
929 Views

Thank you for your reply.

Steps to reproduce: Compile the below code in Visual Studio 2022.1, release mode, /arch:AVX512, /fp:fast.
(Intel® oneAPI DPC++ Compiler Package ID: w_oneAPI_2022.1.0.256, Intel® oneAPI DPC++ Compiler – toolkit version: 2022.2.0, extension version 22.0.0.17, Package ID: w_oneAPI_2022.1.0.256).

 

#include <immintrin.h>
#include <inttypes.h>
#include <stdio.h>
#include <math.h>

const int size = 256;
float aaaa[size] = {0};
float bbbb[size] = {0};
volatile int k = 1;

int main () {
    // prevent optimizing whole loop away:
    aaaa[k] = 5.64;
    for (int i=0; i<size; i++) {
        bbbb[i] = exp(aaaa[i]);    
    }
    // prevent optimizing whole loop away:
    aaaa[k] = bbbb[k+1];
    for (int i=0; i<size; i++) {
        bbbb[i] = cos(aaaa[i]);     
    }
    printf("\ncos(0) = %f\n", bbbb[0]);
    return int(bbbb[k]);
}

 

 The debugger shows this disassembly:

 

        bbbb[i] = exp(aaaa[i]);    
00007FF714E5BE2E  vcvtps2pd   zmm16,ymmword ptr [aaaa+20h (07FF714E94720h)]  
00007FF714E5BE38  vcvtps2pd   zmm0,ymmword ptr [aaaa (07FF714E94700h)]  
00007FF714E5BE42  call        __svml_exp8_z0 (07FF714E5C7F0h)  
00007FF714E5BE47  vmovaps     zmm17,zmm0  
00007FF714E5BE4D  vmovaps     zmm0,zmm16  
00007FF714E5BE53  call        __svml_exp8_z0 (07FF714E5C7F0h)  
00007FF714E5BE58  vcvtpd2ps   ymm1,zmm17  
00007FF714E5BE5E  vcvtpd2ps   ymm0,zmm0  
00007FF714E5BE64  vinsertf64x4 zmm0,zmm1,ymm0,1  
00007FF714E5BE6B  vmovaps     zmmword ptr [bbbb (07FF714E94B00h)],zmm0  

After further testing, I found that the problem is solved when I change exp to expf and cos to cosf. So the diagnosis is not poor auto-vectorization, but that it fails to optimize exp to expf and cos to cosf in the Windows version, even under /fp:fast

 

 

 

0 Kudos
HemanthCH_Intel
Moderator
883 Views

Hi,


We are working on this internally and will get back to you soon.


Thanks & Regards,

Hemanth


0 Kudos
Viet_H_Intel
Moderator
858 Views

>> ICPX under Linux is using a 512-bit single precision function in the SVML library (__svml_expf16)

I tried your test case on Linux with icpx (Version 2022.1.0 Build 20220316) but couldn't see __svml_expf16 being called.

$ icpx -march=common-avx512 -fp-model=fast c05540193.cpp

$ objdump -d a.out |grep svml_expf16

$

>> ICPX under Windows is converting to double and using a 256-bit double precision function in the SVML library (__svml_exp4)

I tried the test case on Windows, but didn't see __svml_ex4 being called at all.

> icx /arch:AVX512 /fp:fast c05540193.cpp /O2

Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2022.1.0 Build 20220316

> dumpbin /DISASM c05540193.exe >dump.txt

> notepad dump.txt (no matches found for svml). Attached is a dump.txt


Can you provide steps to reproduce what you have observed?




0 Kudos
Viet_H_Intel
Moderator
851 Views
0 Kudos
Viet_H_Intel
Moderator
804 Views

Please provide us instructions and test case to reproduce the issue.

Thanks,


0 Kudos
AFog
Beginner
795 Views

Viet H.

Your own dump shows the same as mine. It is converting single to double precision at vcvtps2pd, then it is calling the double version of the vector function at call 00000001400019E0. Then converting the result back to single precision at vcvtpd2ps. This shows, as I wrote, that it fails to optimize exp to expf and cos to cosf, but it does vectorize. Your dump does not show the function names.

0 Kudos
Viet_H_Intel
Moderator
787 Views

Looks like the Auto-vectorization is not an issue anymore.

Does your assembly dump show the function names? Can you attach the assembly file and point it out where it fails to optimize exp to expf and cos to cosf?


Thanks,


0 Kudos
AFog
Beginner
774 Views

It is seen in your own dump and in the disassembly that I have posted above. It converts single precision to double precision (vcvtps2pd) before calling the double-precision version of the vector function (__svml_exp8_z0), then converts the result back to single precision (vcvtpd2ps).

0 Kudos
Viet_H_Intel
Moderator
743 Views

Thanks, I'll look into this.


0 Kudos
Reply