VML Performance software

sgwood · ‎02-23-2012

To Intel,
On your web site you list the performance, in terms of clocks per element (CPE), for each of the VML functions. Do you provide the benchmarking source code that was used to generate the performance data?

Thanks,

-Simon

Chao_Y_Intel · ‎02-23-2012

Hello Simon,

No code to provide so far, but it would be easy to write one test code to learn it performance,For example, somecode like:

mkl_get_cpu_clocks( &start_clocks );
for( i=0;i call VMKL functions()..
}

mkl_get_cpu_clocks( &end_clocks );
cpe= (end_clocks - start_clocks)/total data elements/REPTIMES

Thanks,
Chao

sgwood · ‎02-24-2012

Chao,
Thanks for the response. Yes, that is the basic format that I use. However I am seeing wild variations in performance (factors of 3x-5x) on the same machine, OS, compiler, etc. The performance varies depending on the use of static arrays vs dynamic allocation, local arrays vs. global arrays, etc. So, I am curious as to how you guys setup your benchmarks to account for these various effects. For example, are there cache alignment isssues to consider? I do make sure that my test arrays are allocated on 16-byte boundaries. Are there more things to consider?

-Simon

Sergey_M_Intel2 · ‎02-26-2012

Hi Simon,

Yes, VML functions performance greatly depends on the data colocation. It depends on input-output arrays alignment, so please make sure you align your static or dynamic data accordingly.

The best performance is achievable when the data resides in cache. For benchmarking purposes you might want to have a warm-up loop with a few VML calls to ensure the data hits the cache.

Unfortunately there are many other factors that may affect the performance which are difficult to control by a programmer.

In our measurements we use statistical filtering techniques to get rid of timing outliers. These are used within executable as well as for filtering the timing data from multiple executable runs. All above allows achieving more or less stable timings for benchmarking.

VML function timing can give an idea how fast or slow it is. But the only right benchmark is the end application. In other words, do not over-rely on atomic benchmarks. Real life application performance is what really matters.

I hope that helps,
Regards,
Sergey