Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Slow function evaluation for large vector

Ted_Rosenbaum
초급자
1,266 조회수
Hi,
I am running using intel Fortran compiler version 11.1 release 6 and the accomanying release of mkl. I have an array of around two million elements, which I want to raise to the ".23" power. When I use the "**" syntax my program is running in about half the time as when I use usiethe vdpow function from the VML library. Does anyone have ideas of how I can speed up the evaluation time in VML.
Thanks,
Ted
0 포인트
9 응답
yuriisig
초급자
1,266 조회수
It is necessary to segment a vector.
0 포인트
TimP
명예로운 기여자 III
1,266 조회수
Are you possibly comparing single precision svml against double precision VML, or possibly threading the Fortran? VML probably has lower precision choices as well.
0 포인트
Ted_Rosenbaum
초급자
1,266 조회수
I tried the different precisions on VML and even EP precision was significantly slower.
0 포인트
Sergey_M_Intel2
1,266 조회수

Hi Ted,

Pleaseensure you use vdPowx (vs. vdPow). Powx is intended for raising vector elements to a constant power (0.23 in your case). That should significantly reduce both memory footprint and pressure on the memory subsystem.

I think high pressure on the memory subsystem is the main reason why you see worse performance. There was good suggestion to segment input/output vectors in chunks to ensure results fit into the cache. (Using chunks of a few thousand elements should be fine).

Default math library accuracy in the compiler is equivalent to MKL VML_LA. If you use vdPowx and VML_LA plus vector blocking then I would expect the MKL VML performanceis at least on par with what you see in Fortran.

Regards,

Sergey

0 포인트
Ted_Rosenbaum
초급자
1,266 조회수
Thanks for the suggestions (and to the other poster for the chunking suggestion). However, while these changes led to a small performance improvement, I am still getting the ~ 2X beter performance from the default fortran (svml) function.
I ran some tests using gprof. When I use vml, a huge amount of processor time is used by the function powc_scalar, while when I use svml, that function is not run at all. (Running that function basically accounts for the difference in run time). If anyone has suggestions about what that function is and why it is taking so much time, I would appreciate it.
Thanks.
0 포인트
Sergey_M_Intel2
1,266 조회수
Hi Ted,

I now see where such adifference may come from. Having powc_scalar in the hotspot suggests that you probably evaluate the power functionon very large arguments. Is that the case?

Relatively recent optimizations in MKL VML and Fortran compiler SVML were to improve power function performance on typical arguments (not very large) at the cost of performance on large arguments. In earlier versions of MKL and Fortran compiler (including 11.1) very large arguments performed better but at cost of slower performance on reasonable arguments. So if you're using relatively new MKL and old Fortran compiler that may be an explanation.

Assuming that I'm correct with this hypoethtis,areal question is whether your test case arguments for power function represent some real life workload or it is just a synthetic test case. Can you please clarify a bit?

Sergey
0 포인트
Ted_Rosenbaum
초급자
1,266 조회수
Hi,
Thanks again for the help.
The numbers I'm dealing with are not particularly large -- the largest base is about 20.
Since you indicated this might be a compiler/mkl dependent issue, I tried this with the 12.0 compiler. In that case I am getting comparable speeds between svml and vml, however, this is still slower than the the 11.1 compiler with svml.
Thanks.
0 포인트
Gennady_F_Intel
중재자
1,266 조회수
ok, Ted, for reducing the discussion, would you please give us the exact test example wich you used for checking the problem on our side? Please let us also know the CPU type you are running this example.
--Gennady
0 포인트
Ted_Rosenbaum
초급자
1,266 조회수
Thank you all for your help. In the end I ended up reworking my code, to eliminate the need for using the such a large vector.
For archival purposes I am using an Core i7 860 processor.
0 포인트
응답