- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am running using intel Fortran compiler version 11.1 release 6 and the accomanying release of mkl. I have an array of around two million elements, which I want to raise to the ".23" power. When I use the "**" syntax my program is running in about half the time as when I use usiethe vdpow function from the VML library. Does anyone have ideas of how I can speed up the evaluation time in VML.
Thanks,
Ted
I am running using intel Fortran compiler version 11.1 release 6 and the accomanying release of mkl. I have an array of around two million elements, which I want to raise to the ".23" power. When I use the "**" syntax my program is running in about half the time as when I use usiethe vdpow function from the VML library. Does anyone have ideas of how I can speed up the evaluation time in VML.
Thanks,
Ted
Link Copied
9 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is necessary to segment a vector.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you possibly comparing single precision svml against double precision VML, or possibly threading the Fortran? VML probably has lower precision choices as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried the different precisions on VML and even EP precision was significantly slower.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ted,
Pleaseensure you use vdPowx (vs. vdPow). Powx is intended for raising vector elements to a constant power (0.23 in your case). That should significantly reduce both memory footprint and pressure on the memory subsystem.
I think high pressure on the memory subsystem is the main reason why you see worse performance. There was good suggestion to segment input/output vectors in chunks to ensure results fit into the cache. (Using chunks of a few thousand elements should be fine).
Default math library accuracy in the compiler is equivalent to MKL VML_LA. If you use vdPowx and VML_LA plus vector blocking then I would expect the MKL VML performanceis at least on par with what you see in Fortran.
Regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the suggestions (and to the other poster for the chunking suggestion). However, while these changes led to a small performance improvement, I am still getting the ~ 2X beter performance from the default fortran (svml) function.
I ran some tests using gprof. When I use vml, a huge amount of processor time is used by the function powc_scalar, while when I use svml, that function is not run at all. (Running that function basically accounts for the difference in run time). If anyone has suggestions about what that function is and why it is taking so much time, I would appreciate it.
Thanks.
I ran some tests using gprof. When I use vml, a huge amount of processor time is used by the function powc_scalar, while when I use svml, that function is not run at all. (Running that function basically accounts for the difference in run time). If anyone has suggestions about what that function is and why it is taking so much time, I would appreciate it.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ted,
I now see where such adifference may come from. Having powc_scalar in the hotspot suggests that you probably evaluate the power functionon very large arguments. Is that the case?
Relatively recent optimizations in MKL VML and Fortran compiler SVML were to improve power function performance on typical arguments (not very large) at the cost of performance on large arguments. In earlier versions of MKL and Fortran compiler (including 11.1) very large arguments performed better but at cost of slower performance on reasonable arguments. So if you're using relatively new MKL and old Fortran compiler that may be an explanation.
Assuming that I'm correct with this hypoethtis,areal question is whether your test case arguments for power function represent some real life workload or it is just a synthetic test case. Can you please clarify a bit?
Sergey
I now see where such adifference may come from. Having powc_scalar in the hotspot suggests that you probably evaluate the power functionon very large arguments. Is that the case?
Relatively recent optimizations in MKL VML and Fortran compiler SVML were to improve power function performance on typical arguments (not very large) at the cost of performance on large arguments. In earlier versions of MKL and Fortran compiler (including 11.1) very large arguments performed better but at cost of slower performance on reasonable arguments. So if you're using relatively new MKL and old Fortran compiler that may be an explanation.
Assuming that I'm correct with this hypoethtis,areal question is whether your test case arguments for power function represent some real life workload or it is just a synthetic test case. Can you please clarify a bit?
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks again for the help.
The numbers I'm dealing with are not particularly large -- the largest base is about 20.
Since you indicated this might be a compiler/mkl dependent issue, I tried this with the 12.0 compiler. In that case I am getting comparable speeds between svml and vml, however, this is still slower than the the 11.1 compiler with svml.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ok, Ted, for reducing the discussion, would you please give us the exact test example wich you used for checking the problem on our side? Please let us also know the CPU type you are running this example.
--Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you all for your help. In the end I ended up reworking my code, to eliminate the need for using the such a large vector.
For archival purposes I am using an Core i7 860 processor.
For archival purposes I am using an Core i7 860 processor.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page