- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Switching to 10.2, shows that vzabs takes advantage of threading, but only just matches the single threaded C++ implementation in wall clock time.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
vasci_intel, we would like to reproduce this result, can you please specify some details?
What kind of loop you used for comparison:
1) for(;;) { sqrt(Im(z)*Im(z) + Re(z)*Re(z)); }
2) for(;;) { cabs(z); }
3) Something else
Was this loop vectorized by compiler?
What version of compilerwas used?
And what system do you use (IA32/Intel 64, CPU)?
Will try to look deeper in the case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
vasci_intel, we would like to reproduce this result, can you please specify some details?
What kind of loop you used for comparison:
1) for(;;) { sqrt(Im(z)*Im(z) + Re(z)*Re(z)); }
2) for(;;) { cabs(z); }
3) Something else
Was this loop vectorized by compiler?
What version of compilerwas used?
And what system do you use (IA32/Intel 64, CPU)?
Will try to look deeper in the case.
Given the above, the sample code is non-trivial to generate, but I will take the time to do so and submit it here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Given the above, the sample code is non-trivial to generate, but I will take the time to do so and submit it here.
I have done some more investigation on this...
The environment is actually
Intel C++ 10.1.025, ia32, MKL 10.2 Update 3, Windows XP, Pentium D, 3.00Ghz,Timing is done for size=50000, and a number of loops to get meaningful numbers...
typedef std::complex
dcomplex * a = new dcomplex[size];
double * b = new double[size];
for(int i=0; i
}
test 1 , using std::complex
for(int i=0; i
}
CPU Time : 21.57 s ( wallclock = 21.81)
test 2 using inline naive abs() function ( more similar to my custom C++ Complex library)
for(int i=0; i
double r=aa.real();
double im=aa.imag();
b=sqrt(r*r +im*im);
}
CPU Time: 0.6875 ( wallclock = 0.687)
test 3 using VML
vzabs( &size, (MKL_Complex16 *)a,b );
CPU Time: 1.13 ( wallclock = 0.566)
Looking at the code for std::complex::abs I had not realized there is , in general, a more numerically stable way to do abs() than just the naive implementation. std::complex::abs is doing this, obviously painfully slow. If vzabs is doing similar then vzabs is clearly relatively very fast ( and it is threaded efficiently)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
vasci_intel,
Thank you for your answers.
Yes, vzAbs implements numerically enhanced algorithm for calculation of the result to give accurate answers on all valid arguments, while nave implementation sqrt(r*r +im*im) shows total accuracy loss for about half of representable floating-point numbers.
Let us consider A=max(abs,abs(im)). If A<2^-538 (roughly), then sqrt(r*r+im*im) will give zero as a result, but correct result should not be smaller than A. If A>2^+512 (roughly), then sqrt(r*r+im*im) will give infinity, while correct result should not be larger than A*sqrt(2). This gives about half of valid FP numbers.
Still we appreciate your input and if nave implementation is sufficient for your needs then we can consider more optimizations efforts in VML relaxed accuracy modes for vzAbs. Do you know if vzAbs is in yours application hotspot?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page