- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Switching to 10.2, shows that vzabs takes advantage of threading, but only just matches the single threaded C++ implementation in wall clock time.
링크가 복사됨
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
vasci_intel, we would like to reproduce this result, can you please specify some details?
What kind of loop you used for comparison:
1) for(;;) { sqrt(Im(z)*Im(z) + Re(z)*Re(z)); }
2) for(;;) { cabs(z); }
3) Something else
Was this loop vectorized by compiler?
What version of compilerwas used?
And what system do you use (IA32/Intel 64, CPU)?
Will try to look deeper in the case.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
vasci_intel, we would like to reproduce this result, can you please specify some details?
What kind of loop you used for comparison:
1) for(;;) { sqrt(Im(z)*Im(z) + Re(z)*Re(z)); }
2) for(;;) { cabs(z); }
3) Something else
Was this loop vectorized by compiler?
What version of compilerwas used?
And what system do you use (IA32/Intel 64, CPU)?
Will try to look deeper in the case.
Given the above, the sample code is non-trivial to generate, but I will take the time to do so and submit it here.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Given the above, the sample code is non-trivial to generate, but I will take the time to do so and submit it here.
I have done some more investigation on this...
The environment is actually
Intel C++ 10.1.025, ia32, MKL 10.2 Update 3, Windows XP, Pentium D, 3.00Ghz,Timing is done for size=50000, and a number of loops to get meaningful numbers...
typedef std::complex
dcomplex * a = new dcomplex[size];
double * b = new double[size];
for(int i=0; i
}
test 1 , using std::complex
for(int i=0; i
}
CPU Time : 21.57 s ( wallclock = 21.81)
test 2 using inline naive abs() function ( more similar to my custom C++ Complex library)
for(int i=0; i
double r=aa.real();
double im=aa.imag();
b=sqrt(r*r +im*im);
}
CPU Time: 0.6875 ( wallclock = 0.687)
test 3 using VML
vzabs( &size, (MKL_Complex16 *)a,b );
CPU Time: 1.13 ( wallclock = 0.566)
Looking at the code for std::complex::abs I had not realized there is , in general, a more numerically stable way to do abs() than just the naive implementation. std::complex::abs is doing this, obviously painfully slow. If vzabs is doing similar then vzabs is clearly relatively very fast ( and it is threaded efficiently)
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
vasci_intel,
Thank you for your answers.
Yes, vzAbs implements numerically enhanced algorithm for calculation of the result to give accurate answers on all valid arguments, while nave implementation sqrt(r*r +im*im) shows total accuracy loss for about half of representable floating-point numbers.
Let us consider A=max(abs,abs(im)). If A<2^-538 (roughly), then sqrt(r*r+im*im) will give zero as a result, but correct result should not be smaller than A. If A>2^+512 (roughly), then sqrt(r*r+im*im) will give infinity, while correct result should not be larger than A*sqrt(2). This gives about half of valid FP numbers.
Still we appreciate your input and if nave implementation is sufficient for your needs then we can consider more optimizations efforts in VML relaxed accuracy modes for vzAbs. Do you know if vzAbs is in yours application hotspot?