10.1 + VML vzabs seems slow

AndrewC · ‎11-06-2009

I am using vzabs ( MKL 10.1) to take absolute values abs=sqrt(r*r + i*i) of vectors (10000) of double complex data. Suprisingly a single call to vzabs is slower by a factor of 1.5 than a simple C++ 'for' loop implementation.
Switching to 10.2, shows that vzabs takes advantage of threading, but only just matches the single threaded C++ implementation in wall clock time.

TimP · ‎11-06-2009

If you are writing it out in the form you quote, with no protection against over/underflow, your own vectorized code should give full performance, and your test loop may be long enough to show a gain with threaded parallelization.

Ilya_B_Intel · ‎11-09-2009

vasci_intel, we would like to reproduce this result, can you please specify some details?
What kind of loop you used for comparison:
1) for(;;) { sqrt(Im(z)*Im(z) + Re(z)*Re(z)); }
2) for(;;) { cabs(z); }
3) Something else
Was this loop vectorized by compiler?
What version of compilerwas used?
And what system do you use (IA32/Intel 64, CPU)?
Will try to look deeper in the case.

AndrewC · ‎11-10-2009

Quoting - Ilya Burylov (Intel)

vasci_intel, we would like to reproduce this result, can you please specify some details?
What kind of loop you used for comparison:
1) for(;;) { sqrt(Im(z)*Im(z) + Re(z)*Re(z)); }
2) for(;;) { cabs(z); }
3) Something else
Was this loop vectorized by compiler?
What version of compilerwas used?
And what system do you use (IA32/Intel 64, CPU)?
Will try to look deeper in the case.

I am using Intel 10.1.025 C++ Windows x86_64 compiler on Pentium D (Intel 64). My code is actually a fairly complete C++ library matrix library using a custom "Complex" class (not std::complex). The core routine is simply a sqrt( re*re + im*im). It appears the Intel compiler is doing a very good job of inlining the various C++ calls

Given the above, the sample code is non-trivial to generate, but I will take the time to do so and submit it here.

AndrewC · ‎11-10-2009

Quoting - vasci_intel

I am using Intel 10.1.025 C++ Windows x86_64 compiler on Pentium D (Intel 64). My code is actually a fairly complete C++ library matrix library using a custom "Complex" class (not std::complex). The core routine is simply a sqrt( re*re + im*im). It appears the Intel compiler is doing a very good job of inlining the various C++ calls

Given the above, the sample code is non-trivial to generate, but I will take the time to do so and submit it here.

I have done some more investigation on this...
The environment is actually
Intel C++ 10.1.025, ia32, MKL 10.2 Update 3, Windows XP, Pentium D, 3.00Ghz,Timing is done for size=50000, and a number of loops to get meaningful numbers...

typedef std::complex dcomplex;
dcomplex * a = new dcomplex[size];
double * b = new double[size];

for(int i=0; i a=dcomplex(4.0,3.0);
}

test 1 , using std::complex
for(int i=0; i b=abs(a);
}
CPU Time : 21.57 s ( wallclock = 21.81)

test 2 using inline naive abs() function ( more similar to my custom C++ Complex library)
for(int i=0; i const dcomplex &aa=a;
double r=aa.real();
double im=aa.imag();
b=sqrt(r*r +im*im);
}
CPU Time: 0.6875 ( wallclock = 0.687)

test 3 using VML
vzabs( &size, (MKL_Complex16 *)a,b );
CPU Time: 1.13 ( wallclock = 0.566)

Looking at the code for std::complex::abs I had not realized there is , in general, a more numerically stable way to do abs() than just the naive implementation. std::complex::abs is doing this, obviously painfully slow. If vzabs is doing similar then vzabs is clearly relatively very fast ( and it is threaded efficiently)

Ilya_B_Intel · ‎11-11-2009

vasci_intel,

Thank you for your answers.

Yes, vzAbs implements numerically enhanced algorithm for calculation of the result to give accurate answers on all valid arguments, while nave implementation sqrt(r*r +im*im) shows total accuracy loss for about half of representable floating-point numbers.

Let us consider A=max(abs,abs(im)). If A<2^-538 (roughly), then sqrt(r*r+im*im) will give zero as a result, but correct result should not be smaller than A. If A>2^+512 (roughly), then sqrt(r*r+im*im) will give infinity, while correct result should not be larger than A*sqrt(2). This gives about half of valid FP numbers.

Still we appreciate your input and if nave implementation is sufficient for your needs then we can consider more optimizations efforts in VML relaxed accuracy modes for vzAbs. Do you know if vzAbs is in yours application hotspot?