- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am interested in a fast Cartesian to polar conversion.
MKL's documentation suggests the following method
...
vdHypot(nelements, re, im, magnitude);
vdAtan2(nelements, re, im, phase);
....
However, my tests indicates that the performance of the above method is about two times worse than the performance of an IPP function:
....
ippsCartToPolar_64f(re, im, magnitude, phase, nelements);
....
Such a difference seems a little strange. Is there anything that I am missing here?
Thank you.
MKL's documentation suggests the following method
...
vdHypot(nelements, re, im, magnitude);
vdAtan2(nelements, re, im, phase);
....
However, my tests indicates that the performance of the above method is about two times worse than the performance of an IPP function:
....
ippsCartToPolar_64f(re, im, magnitude, phase, nelements);
....
Such a difference seems a little strange. Is there anything that I am missing here?
Thank you.
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Which versions of MKL and IPP? Are we, by some chance, comparing speeds of 80x87 instruction sequences and SSE2 instruction sequences?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you check how this perf.result depends on the input sizes?. Please pay into attention that all VML functions are highly optimized for large vector sizes, say 1K.
--Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using MKL v10.3.1 and IPP v7.0.1
Both routines are used from MATLAB as a MEX file (just another name for a shared library) and performance difference is a function of how large your input is. As is evident from the table below MKL lags after IPP up to certain input size.
| n (input size) | MKL (time) | IPP (time) |
----------------------------------------------------
10000 | 0.0064 | 5.6987e-04 |
----------------------------------------------------
20000 | 0.0237 | 0.0013 |
----------------------------------------------------
40000 | 0.0722 | 0.0022 | <---- STRANGE !!! (MKL's result is too bad)
----------------------------------------------------
80000 | 0.0264 | 0.0040 |
----------------------------------------------------
160000 | 0.0411 | 0.0081 |
----------------------------------------------------
320000 | 0.0618 | 0.0155 |
----------------------------------------------------
640000 | 0.1297 | 0.0310 |
----------------------------------------------------
1280000 | 0.2620 | 0.0613 |
----------------------------------------------------
2560000 | 0.3692 | 0.1221 |
----------------------------------------------------
5120000 | 0.4053 | 0.2740 |
----------------------------------------------------
10240000 | 0.6073 | 0.5592 |
----------------------------------------------------
20480000 | 1.0382 | 1.1212 |
----------------------------------------------------
P.S.
There is is an error in Intel's documentation that comes with MKL. In the following code (taken from FFT: Auxiliary Data Transformations) one has to swap re and im in the call to vdAtan2()
Both routines are used from MATLAB as a MEX file (just another name for a shared library) and performance difference is a function of how large your input is. As is evident from the table below MKL lags after IPP up to certain input size.
| n (input size) | MKL (time) | IPP (time) |
----------------------------------------------------
10000 | 0.0064 | 5.6987e-04 |
----------------------------------------------------
20000 | 0.0237 | 0.0013 |
----------------------------------------------------
40000 | 0.0722 | 0.0022 | <---- STRANGE !!! (MKL's result is too bad)
----------------------------------------------------
80000 | 0.0264 | 0.0040 |
----------------------------------------------------
160000 | 0.0411 | 0.0081 |
----------------------------------------------------
320000 | 0.0618 | 0.0155 |
----------------------------------------------------
640000 | 0.1297 | 0.0310 |
----------------------------------------------------
1280000 | 0.2620 | 0.0613 |
----------------------------------------------------
2560000 | 0.3692 | 0.1221 |
----------------------------------------------------
5120000 | 0.4053 | 0.2740 |
----------------------------------------------------
10240000 | 0.6073 | 0.5592 |
----------------------------------------------------
20480000 | 1.0382 | 1.1212 |
----------------------------------------------------
P.S.
There is is an error in Intel's documentation that comes with MKL. In the following code (taken from FFT: Auxiliary Data Transformations) one has to swap re and im in the call to vdAtan2()
[cpp]// Cartesian->polar conversion of complex data // Cartesian representation: z = re + I*im // Polar representation: z = r * exp( I*phi ) #includevoid variant1_Cartesian2Polar(int n,const double *re,const double *im, double *r,double *phi) { vdHypot(n,re,im,r); // compute radii r[] vdAtan2(n,re,im,phi); // compute phases phi[] }[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi eliosh,
First of all, thank you for pointing out the mistake.
That IPP performs faster than MKL for large arrays is understood: with provided MKL implementation of Cartesian2Polarthe whole data set travels a couple of times to-from cache.Blocking of the set for better cache utilization will improve performanceof thefunction, ofcourse, but this was out of scope of this example.
On smaller arrays you are likely seeing performance of threaded IPP (the MKL example is not threaded).
And, you are welcome to submit a feature request for the functionality and performanceyou need at http://premier.intel.com
Thanks
Dima
First of all, thank you for pointing out the mistake.
That IPP performs faster than MKL for large arrays is understood: with provided MKL implementation of Cartesian2Polarthe whole data set travels a couple of times to-from cache.Blocking of the set for better cache utilization will improve performanceof thefunction, ofcourse, but this was out of scope of this example.
On smaller arrays you are likely seeing performance of threaded IPP (the MKL example is not threaded).
And, you are welcome to submit a feature request for the functionality and performanceyou need at http://premier.intel.com
Thanks
Dima
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, eliosh!
MKL transcendental math functions have 3 accuracy levels: VML_HA (most accurate), VML_LA (in the middle), VML_EP (fastest). The default level is VML_HA, which is the most precise, but slowest at the same time.
IPP functionippsCartToPolar_64f does not have that strict accuracy requirements.
In order to make fair comparison you can set lower accuracy requirements
vmlSetMode(VML_LA)
or
vmlSetMode(VML_EP)
In both cases MKL will likely give better results.
Another effect might be found in your timing system: while Iqualitativelyreproduced your results for larger n, smaller n shows not that much difference MKL vs. IPPOne of possible effect may be in warm vs. cold cache. In order to check it you may try to swap your MKL and IPP measurements (measure IPP first, and then MKL). If results for smaller n change significantly - this is probably the case. Hard to say something more without code.
Thanks,
Ilya
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page