Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

VML Performance software

sgwood
初学者
1,944 次查看
To Intel,
On your web site you list the performance, in terms of clocks per element (CPE), for each of the VML functions. Do you provide the benchmarking source code that was used to generate the performance data?

Thanks,

-Simon
0 项奖励
3 回复数
Chao_Y_Intel
主持人
1,944 次查看

Hello Simon,

No code to provide so far, but it would be easy to write one test code to learn it performance,For example, somecode like:

mkl_get_cpu_clocks( &start_clocks );
for( i=0;i call VMKL functions()..
}

mkl_get_cpu_clocks( &end_clocks );
cpe= (end_clocks - start_clocks)/total data elements/REPTIMES

Thanks,
Chao

0 项奖励
sgwood
初学者
1,944 次查看
Chao,
Thanks for the response. Yes, that is the basic format that I use. However I am seeing wild variations in performance (factors of 3x-5x) on the same machine, OS, compiler, etc. The performance varies depending on the use of static arrays vs dynamic allocation, local arrays vs. global arrays, etc. So, I am curious as to how you guys setup your benchmarks to account for these various effects. For example, are there cache alignment isssues to consider? I do make sure that my test arrays are allocated on 16-byte boundaries. Are there more things to consider?

-Simon
0 项奖励
Sergey_M_Intel2
1,945 次查看
Hi Simon,

Yes, VML functions performance greatly depends on the data colocation. It depends on input-output arrays alignment, so please make sure you align your static or dynamic data accordingly.

The best performance is achievable when the data resides in cache. For benchmarking purposes you might want to have a warm-up loop with a few VML calls to ensure the data hits the cache.

Unfortunately there are many other factors that may affect the performance which are difficult to control by a programmer.

In our measurements we use statistical filtering techniques to get rid of timing outliers. These are used within executable as well as for filtering the timing data from multiple executable runs. All above allows achieving more or less stable timings for benchmarking.

VML function timing can give an idea how fast or slow it is. But the only right benchmark is the end application. In other words, do not over-rely on atomic benchmarks. Real life application performance is what really matters.

I hope that helps,
Regards,
Sergey
0 项奖励
回复