Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6956 Discussions

Slow function evaluation for large vector

Ted_Rosenbaum
Beginner
442 Views
Hi,
I am running using intel Fortran compiler version 11.1 release 6 and the accomanying release of mkl. I have an array of around two million elements, which I want to raise to the ".23" power. When I use the "**" syntax my program is running in about half the time as when I use usiethe vdpow function from the VML library. Does anyone have ideas of how I can speed up the evaluation time in VML.
Thanks,
Ted
0 Kudos
9 Replies
yuriisig
Beginner
442 Views
It is necessary to segment a vector.
0 Kudos
TimP
Honored Contributor III
442 Views
Are you possibly comparing single precision svml against double precision VML, or possibly threading the Fortran? VML probably has lower precision choices as well.
0 Kudos
Ted_Rosenbaum
Beginner
442 Views
I tried the different precisions on VML and even EP precision was significantly slower.
0 Kudos
Sergey_M_Intel2
Employee
442 Views

Hi Ted,

Pleaseensure you use vdPowx (vs. vdPow). Powx is intended for raising vector elements to a constant power (0.23 in your case). That should significantly reduce both memory footprint and pressure on the memory subsystem.

I think high pressure on the memory subsystem is the main reason why you see worse performance. There was good suggestion to segment input/output vectors in chunks to ensure results fit into the cache. (Using chunks of a few thousand elements should be fine).

Default math library accuracy in the compiler is equivalent to MKL VML_LA. If you use vdPowx and VML_LA plus vector blocking then I would expect the MKL VML performanceis at least on par with what you see in Fortran.

Regards,

Sergey

0 Kudos
Ted_Rosenbaum
Beginner
442 Views
Thanks for the suggestions (and to the other poster for the chunking suggestion). However, while these changes led to a small performance improvement, I am still getting the ~ 2X beter performance from the default fortran (svml) function.
I ran some tests using gprof. When I use vml, a huge amount of processor time is used by the function powc_scalar, while when I use svml, that function is not run at all. (Running that function basically accounts for the difference in run time). If anyone has suggestions about what that function is and why it is taking so much time, I would appreciate it.
Thanks.
0 Kudos
Sergey_M_Intel2
Employee
442 Views
Hi Ted,

I now see where such adifference may come from. Having powc_scalar in the hotspot suggests that you probably evaluate the power functionon very large arguments. Is that the case?

Relatively recent optimizations in MKL VML and Fortran compiler SVML were to improve power function performance on typical arguments (not very large) at the cost of performance on large arguments. In earlier versions of MKL and Fortran compiler (including 11.1) very large arguments performed better but at cost of slower performance on reasonable arguments. So if you're using relatively new MKL and old Fortran compiler that may be an explanation.

Assuming that I'm correct with this hypoethtis,areal question is whether your test case arguments for power function represent some real life workload or it is just a synthetic test case. Can you please clarify a bit?

Sergey
0 Kudos
Ted_Rosenbaum
Beginner
442 Views
Hi,
Thanks again for the help.
The numbers I'm dealing with are not particularly large -- the largest base is about 20.
Since you indicated this might be a compiler/mkl dependent issue, I tried this with the 12.0 compiler. In that case I am getting comparable speeds between svml and vml, however, this is still slower than the the 11.1 compiler with svml.
Thanks.
0 Kudos
Gennady_F_Intel
Moderator
442 Views
ok, Ted, for reducing the discussion, would you please give us the exact test example wich you used for checking the problem on our side? Please let us also know the CPU type you are running this example.
--Gennady
0 Kudos
Ted_Rosenbaum
Beginner
442 Views
Thank you all for your help. In the end I ended up reworking my code, to eliminate the need for using the such a large vector.
For archival purposes I am using an Core i7 860 processor.
0 Kudos
Reply