- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To Intel,
On your web site you list the performance, in terms of clocks per element (CPE), for each of the VML functions. Do you provide the benchmarking source code that was used to generate the performance data?
Thanks,
-Simon
On your web site you list the performance, in terms of clocks per element (CPE), for each of the VML functions. Do you provide the benchmarking source code that was used to generate the performance data?
Thanks,
-Simon
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Simon,
No code to provide so far, but it would be easy to write one test code to learn it performance,For example, somecode like:
mkl_get_cpu_clocks( &start_clocks );
for( i=0;i
}
mkl_get_cpu_clocks( &end_clocks );
cpe= (end_clocks - start_clocks)/total data elements/REPTIMES
Thanks,
Chao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Chao,
Thanks for the response. Yes, that is the basic format that I use. However I am seeing wild variations in performance (factors of 3x-5x) on the same machine, OS, compiler, etc. The performance varies depending on the use of static arrays vs dynamic allocation, local arrays vs. global arrays, etc. So, I am curious as to how you guys setup your benchmarks to account for these various effects. For example, are there cache alignment isssues to consider? I do make sure that my test arrays are allocated on 16-byte boundaries. Are there more things to consider?
-Simon
Thanks for the response. Yes, that is the basic format that I use. However I am seeing wild variations in performance (factors of 3x-5x) on the same machine, OS, compiler, etc. The performance varies depending on the use of static arrays vs dynamic allocation, local arrays vs. global arrays, etc. So, I am curious as to how you guys setup your benchmarks to account for these various effects. For example, are there cache alignment isssues to consider? I do make sure that my test arrays are allocated on 16-byte boundaries. Are there more things to consider?
-Simon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Simon,
Yes, VML functions performance greatly depends on the data colocation. It depends on input-output arrays alignment, so please make sure you align your static or dynamic data accordingly.
The best performance is achievable when the data resides in cache. For benchmarking purposes you might want to have a warm-up loop with a few VML calls to ensure the data hits the cache.
Unfortunately there are many other factors that may affect the performance which are difficult to control by a programmer.
In our measurements we use statistical filtering techniques to get rid of timing outliers. These are used within executable as well as for filtering the timing data from multiple executable runs. All above allows achieving more or less stable timings for benchmarking.
VML function timing can give an idea how fast or slow it is. But the only right benchmark is the end application. In other words, do not over-rely on atomic benchmarks. Real life application performance is what really matters.
I hope that helps,
Regards,
Sergey
Yes, VML functions performance greatly depends on the data colocation. It depends on input-output arrays alignment, so please make sure you align your static or dynamic data accordingly.
The best performance is achievable when the data resides in cache. For benchmarking purposes you might want to have a warm-up loop with a few VML calls to ensure the data hits the cache.
Unfortunately there are many other factors that may affect the performance which are difficult to control by a programmer.
In our measurements we use statistical filtering techniques to get rid of timing outliers. These are used within executable as well as for filtering the timing data from multiple executable runs. All above allows achieving more or less stable timings for benchmarking.
VML function timing can give an idea how fast or slow it is. But the only right benchmark is the end application. In other words, do not over-rely on atomic benchmarks. Real life application performance is what really matters.
I hope that helps,
Regards,
Sergey

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page