Community
cancel
Showing results for 
Search instead for 
Did you mean: 
297 Views

Missing AVX-optimization of atan2f

  • Using Intel CC 14.0 under Visual Studio 2013SP2
    • atan2f()
      • with AVX: 3.915 sec.
      • with SSE2: 0.800 sec.
    • ​atanf() is not affected
      • with AVX: 0.475sec.
      • with SSE2: 0.626 sec.
  • atan2() is widely used when calculating with complex numbers (to get the phase).
  • Double precision seems to be affected too, but the numbers are not as clear as with single precision.

 

Simplified example code:

const int iterations = 100000;
const int size = 2048;
float* a = new float[size];
float* b = new float[size];
for (int i = 0; i < size; ++i) {
  a = 1.1f;
  b = 2.2f;
}


for (int j = 0; j < iterations; ++j) {
  for (int i = 0; i < size; ++i) {
    a = atan2f(a, b);
  }
}
for (int j = 0; j < iterations; ++j) {
  for (int i = 0; i < size; ++i) {
    a = atanf(b);
  }
}

 

Options (simplified from real world project)

  • ​using SSE:
    /GS /Qopenmp /Qrestrict /Qansi-alias /W3 /Qdiag-disable:"4267" /Qdiag-disable:"4251" /Zc:wchar_t /Zi /O2 /Ob2 /Fd"Release\64\vc120.pdb" /fp:fast  /Qstd=c++11 /Qipo /GF /GT /Zc:forScope /GR /Oi /MD /Fa"Release\64\" /EHsc /nologo /Fo"Release\64\" /Ot /Fp"Release\64\TestPlugin.pch"
  • using AVX:
    /Qopenmp /Qrestrict /Qansi-alias /W3 /Qdiag-disable:"4267" /Qdiag-disable:"4251" /Zc:wchar_t /Zi /O2 /Ob2 /Fd"Release\64\vc120.pdb" /fp:fast  /Qstd=c++11 /Qipo /GF /GT /Zc:forScope /GR /arch:AVX /Oi /MD /Fa"Release\64\" /EHsc /nologo /Fo"Release\64\" /Ot /Fp"Release\64\TestPlugin.pch" 
0 Kudos
29 Replies
TimP
Black Belt
244 Views

I see a significant increase in run time when adding /arch:CORE-AVX2 or /arch:AVX to the build options, in 15.0 beta compiler as wel (using VS2012)l.  I don't know if the differences are buried in the svml library, __svml_atan2f4 vs. __svml_atan2f8. A plain AVX box (which I don't have available) may be needed to see whether an svml performance regression appears with AVX as opposed to AVX2.

It would be preferable to try the cases with 32-byte alignment; otherwise results may not be consistent (even with SSE).  I didn't see this possible issue having an effect in my attempt.

The AVX (or AVX2) version appears to perform about the same as with the Microsoft compiler.  So I will try to figure out whether the time is actually spent in __svml_atan2f8 or maybe it fails to enter the vectorized loop version.

g++ (where there is no vectorization of atan) appears to run the case much faster, but I suspect it may be short-cutting timing loops.

TimP
Black Belt
244 Views

I'm finding Windows 8.1 decidedly unfriendly as to how I might view the VTune screenshot.  Possibly it's due to my having to remove NotePad due to other issues with that.

It shows me the following (Intel64 mode):

_svml_satan2_cout_rare 3.2s

_svml_atan2f8_l9             1.4s

_svml_atanf8                    0.2s

which appears to confirm that it takes frequently a non-vector branch inside svml_atan2f8.  By contrast, the SSE version shows

_svml_atan2f4_h9     0.8s

_svml_atanf4_h9       0.5s

Bernard
Black Belt
244 Views

@Tim

>>>It shows me the following (Intel64 mode):

_svml_satan2_cout_rare 3.2s

_svml_atan2f8_l9             1.4s

_svml_atanf8                    0.2s>>>

Are these functions ordered in some caller-callee relationship?

Bernard
Black Belt
244 Views

I wonder if compiler optimized the second loop (probably by removing it and only once calculating atan2f values) where atan2f was operating on the arrays filled with identical values.

 
Marián__VooDooMan__M
New Contributor II
244 Views

Couldn't this be related to my report @ https://software.intel.com/en-us/forums/topic/516011 by chance?

TimP
Black Belt
244 Views

iliyapolak wrote:

@Tim

>>>It shows me the following (Intel64 mode):

_svml_satan2_cout_rare 3.2s

_svml_atan2f8_l9             1.4s

_svml_atanf8                    0.2s>>>

Are these functions ordered in some caller-callee relationship?

I believe those atan2 functions are called somewhere inside svml library, as the reference in the compiled .obj is to the top level svml entry point atan2f8().  There is no significant time spent in that entry point function.

Bernard
Black Belt
244 Views

I think that bulk of the computation is done by "_svml_atan2f8_l9 _" function.

TimP
Black Belt
244 Views

iliyapolak wrote:

I wonder if compiler optimized the second loop (probably by removing it and only once calculating atan2f values) where atan2f was operating on the arrays filled with identical values.

 

I think ICL is not short-cutting anything, but I do believe g++ does short-cut, as it runs fast (unless I set -O0) even though there is no vector math library.  By "second loop" you must not mean the atanf loop, which is certainly expected to be faster than atan2f.  As the original post implied, the AVX svml version of atan2f ought to be at least as fast as the SSE, where the AVX atanf could reasonably be close to double SSE speed and does show a good improvement.

TimP
Black Belt
244 Views

Marián "VooDooMan" Meravý wrote:

Couldn't this be related to my report @ https://software.intel.com/en-us/forums/topic/516011 by chance?

In this case, the compiler reports vectorization and builds in a call to the svml library, both for SSE2 and for AVX compilation, but the one AVX svml function turns out not to be vectorized effectively internally.  You could at least show us whether the compiler attempts to make a vectorized fmod() function call in your case; if so, does that fail to give a performance improvement.

TimP
Black Belt
244 Views

iliyapolak wrote:

I think that bulk of the computation is done by "_svml_atan2f8_l9 _" function.

That may be, but the overall effect shows no speedup over Microsoft scalar function when the compiler builds in call to the AVX vector function, while the SSE vector function works as expected.

Bernard
Black Belt
244 Views

Tim Prince wrote:

Quote:

iliyapolak wrote:

I wonder if compiler optimized the second loop (probably by removing it and only once calculating atan2f values) where atan2f was operating on the arrays filled with identical values.

 

I think ICL is not short-cutting anything, but I do believe g++ does short-cut, as it runs fast (unless I set -O0) even though there is no vector math library.  By "second loop" you must not mean the atanf loop, which is certainly expected to be faster than atan2f.  As the original post implied, the AVX svml version of atan2f ought to be at least as fast as the SSE, where the AVX atanf could reasonably be close to double SSE speed and does show a good improvement.

Sorry i should have formulated my answer differently. My assumption was that compiler will at compile time calculate only once atan2f() function call probably by analysing called function arguments and understanding that their value will not be changed during n-loop iterations. I Think that compiler could went further in its optimization efforts and simply eliminate inner loop by calculating atan2 values and filling array in compile time.

I suppose that only atan2f() calculation was done at compile time.

TimP
Black Belt
244 Views

It's a legitimate concern whether simplistic timing tests like this are short-cut by the compiler seeing that results are discarded and need not be calculated, and I do think g++ -O does that in this case, but Intel C++ does not.

The follow-up question in my mind is about whether anyone is interested enough to file a ticket for investigation of the SVML atan2f8(), which I think should be directed at library implementation.  I hope the IPS web site may be available for a few days next week; the messages I received indicated no scheduled down time during the next 6 days.

Bernard
Black Belt
244 Views

>>>The follow-up question in my mind is about whether anyone is interested enough to file a ticket for investigation of the SVML atan2f8(), >>>

Should we investigate at assembly  code level?

Bernard
Black Belt
244 Views

>>>It's a legitimate concern whether simplistic timing tests like this are short-cut by the compiler seeing that results are discarded and need not be calculated, and I do think g++ -O does that in this case, but Intel C++ does not>>>

Yes I agree with you. I think that I will try to investigate the issue of compiler optimization.

So do you think that in the case of ICC both of the for-loops are preserved in runtime?

TimP
Black Belt
244 Views

iliyapolak wrote:

>>>The follow-up question in my mind is about whether anyone is interested enough to file a ticket for investigation of the SVML atan2f8(), >>>

Should we investigate at assembly  code level?

I wouldn't suggest that.  I'll watch for the IPS premier web site to become available the next few days.

TimP
Black Belt
244 Views

iliyapolak wrote:

>>>It's a legitimate concern whether simplistic timing tests like this are short-cut by the compiler seeing that results are discarded and need not be calculated, and I do think g++ -O does that in this case, but Intel C++ does not>>>

Yes I agree with you. I think that I will try to investigate the issue of compiler optimization.

So do you think that in the case of ICC both of the for-loops are preserved in runtime?

Yes, it appears that all the calls to svml functions are made, and the relative timings with icc should be meaningful.

Bernard
Black Belt
244 Views

Tim Prince wrote:

Quote:

iliyapolak wrote:

>>>The follow-up question in my mind is about whether anyone is interested enough to file a ticket for investigation of the SVML atan2f8(), >>>

Should we investigate at assembly  code level?

 

I wouldn't suggest that.  I'll watch for the IPS premier web site to become available the next few days.

Ok it seems more reasonable thing to do.

Bernard
Black Belt
244 Views

Tim Prince wrote:

Quote:

iliyapolak wrote:

>>>It's a legitimate concern whether simplistic timing tests like this are short-cut by the compiler seeing that results are discarded and need not be calculated, and I do think g++ -O does that in this case, but Intel C++ does not>>>

Yes I agree with you. I think that I will try to investigate the issue of compiler optimization.

So do you think that in the case of ICC both of the for-loops are preserved in runtime?

 

Yes, it appears that all the calls to svml functions are made, and the relative timings with icc should be meaningful.

I Wonder if functions calls are present because of array data type?

Bernard
Black Belt
244 Views

IIRC there is second or even third case where ICC did not remove function calls with constant arguments where array type was present.

111 Views

At first, thanks to all for the quick confirmation.

Changing the second loop (the atanf at line 31) to be "self-referencing"

a = atanf(a);

should disable even gcc's loop-eliding optimizations. Optimizing away 100000 iterations of atan with known initial input is possible at compile time, but seems very unlikely to me. If that is happening (say, if the timing results are implausible), one may add some rand() initialization and print the average of the resulting array "a".

	const int iterations = 100000;
	const int size = 2048;
	float* a = new float[size];
	float* b = new float[size];

	for (int i = 0; i < size; ++i) {
		a = rand();
	}

	for (int j = 0; j < iterations; ++j) {
		for (int i = 0; i < size; ++i) {
			a = atan2f(a, b);
		}
	}

	float averageA = 0.0f;
	for (int i = 0; i < size; ++i) {
		averageA += a;
	}
	averageA /= size;
	cout << "Average of array a: " << averageA << endl;


	for (int i = 0; i < size; ++i) {
		a = rand();
		b = rand();
	}

	for (int j = 0; j < iterations; ++j) {
		for (int i = 0; i < size; ++i) {
			a = atanf(a);
		}
	}

	averageA = 0.0f;
	for (int i = 0; i < size; ++i) {
		averageA += a;
	}
	averageA /= size;
	cout << "Average of array a: " << averageA << endl;

Using Intel CC 14.0, both loops (atanf and atan2f) are calling the SVML functions (__svml_atanf8 and __svml_atan2f8, respectively).

Thank for noticing IPS, I was not aware of that. I thought this forum would be the appropriate way to file a bug.

Reply