Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Missing AVX-optimization of atan2f

Lars_Rosenboom__BIAS
1,387 Views
  • Using Intel CC 14.0 under Visual Studio 2013SP2
    • atan2f()
      • with AVX: 3.915 sec.
      • with SSE2: 0.800 sec.
    • ​atanf() is not affected
      • with AVX: 0.475sec.
      • with SSE2: 0.626 sec.
  • atan2() is widely used when calculating with complex numbers (to get the phase).
  • Double precision seems to be affected too, but the numbers are not as clear as with single precision.

 

Simplified example code:

const int iterations = 100000;
const int size = 2048;
float* a = new float[size];
float* b = new float[size];
for (int i = 0; i < size; ++i) {
  a = 1.1f;
  b = 2.2f;
}


for (int j = 0; j < iterations; ++j) {
  for (int i = 0; i < size; ++i) {
    a = atan2f(a, b);
  }
}
for (int j = 0; j < iterations; ++j) {
  for (int i = 0; i < size; ++i) {
    a = atanf(b);
  }
}

 

Options (simplified from real world project)

  • ​using SSE:
    /GS /Qopenmp /Qrestrict /Qansi-alias /W3 /Qdiag-disable:"4267" /Qdiag-disable:"4251" /Zc:wchar_t /Zi /O2 /Ob2 /Fd"Release\64\vc120.pdb" /fp:fast  /Qstd=c++11 /Qipo /GF /GT /Zc:forScope /GR /Oi /MD /Fa"Release\64\" /EHsc /nologo /Fo"Release\64\" /Ot /Fp"Release\64\TestPlugin.pch"
  • using AVX:
    /Qopenmp /Qrestrict /Qansi-alias /W3 /Qdiag-disable:"4267" /Qdiag-disable:"4251" /Zc:wchar_t /Zi /O2 /Ob2 /Fd"Release\64\vc120.pdb" /fp:fast  /Qstd=c++11 /Qipo /GF /GT /Zc:forScope /GR /arch:AVX /Oi /MD /Fa"Release\64\" /EHsc /nologo /Fo"Release\64\" /Ot /Fp"Release\64\TestPlugin.pch" 
0 Kudos
29 Replies
TimP
Honored Contributor III
308 Views

Your IPS support account is the way to submit issues where you require security, or wish to be able to track the response without depending on a volunteer from the Intel team.  As this appears to be a library issue, it may not be the direct responsibility of Intel people who monitor this site regularly.

I'm still not getting a response from the SAVE step at IPS, and it's scheduled for down time at the end of the week.  I thought perhaps my input might be helpful since I set it up to verify on VTune. 

There seems to have been some sort of spam attack on Intel sites the last few days; why it's so important to some people to deny us the use of the sites beats me, if in fact there's a connection.

0 Kudos
Lars_Rosenboom__BIAS
308 Views

I have to say, I can not reproduce the timing results for the SSE2-case anymore. Maybe that was a mistake of mine.

When using 64-bit code with AVX, then the comparison between Intel CC and VC++ is interesting:

  • atan2f using VC++ is twice as fast as when using ICC (the missing optimization noted in my first post).
  • atan2f using VC++ is twice as fast as atanf when using VC++ (?! - didn't notice that before, maybe related to SP3)

 

Using Intel CC 14.0, 64-bit, AVX (calling __svml_atanf8/__svml_atan2f8):

  • ATan:           0.443 GFLOPS ( 0.462 sec.)
  • ATan2:          0.052 GFLOPS ( 3.912 sec.)

Using VS2013SP3, 64-bit, AVX (calling atanf/atan2f):

  • ATan:           0.051 GFLOPS ( 3.991 sec.)
  • ATan2:          0.111 GFLOPS ( 1.847 sec.)

 

 

0 Kudos
Lars_Rosenboom__BIAS
308 Views

Ok, needed to properly rebuild everything (knew that but forgot).

Here are the results when using 32-bit with SSE2 (calling __svml_atanf4/__svml_atan2f4):

  • ATan:            0.333 GFLOPS ( 0.615 sec.)
  • ATan2:          0.280 GFLOPS ( 0.731 sec.)

 

0 Kudos
Bernard
Valued Contributor I
308 Views

 >>>Optimizing away 100000 iterations of atan with known initial input is possible at compile time, but seems very unlikely to me. If that is happening (say, if the timing results are implausible), one may add some rand() initialization and print the average of the resulting array "a>>>

I suppose that ICC could optimize away the inner for-loop by removing call statements from the run-time code.Of course I do not expect further optimization like compile-time array filling which could eliminate inner for-loop from the runtime code.

0 Kudos
Lars_Rosenboom__BIAS
308 Views

Filed as issue 6000062158: "Missing AVX-optimization of atan2f (__svml_atan2f8)"
(Intel C++ Compiler for Windows, Medium, 08/11/2014)

0 Kudos
Bernard
Valued Contributor I
308 Views

>>>I Think that compiler could went further in its optimization efforts and simply eliminate inner loop by calculating atan2 values and filling array in compile time>>>

I made mistake in quoted sentence. Of course compiler will not fill in  dynamically allocated array at compile time because of new operator.

0 Kudos
TimP
Honored Contributor III
308 Views

My Intel premier account is blocked.  I've been getting some help from the support team but still can't file this new issue.  The site is scheduled down tonight, so we will be waiting another week or two on this.

0 Kudos
TimP
Honored Contributor III
308 Views

submitted as intel premier issue 6000063006 during today's uptime between IPS site modifications

issue reported closed as a duplicate of another submission without further comment

0 Kudos
Lars_Rosenboom__BIAS
308 Views

Hello,

This issue is fixed with version 16.0.110.

AVX:
atan2(): 0.864195 seconds
atan(): 0.33743 seconds

SSE2:
atan2(): 0.93485 seconds
atan(): 0.457738 seconds

Bye,
Lars

0 Kudos
Reply