Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Cem_Savas_B_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-26-2013
08:37 AM

66 Views

Blas Performance

Hi there,

I am running cblas routines on an older Ubuntu 12.04 (64bit) machine, Intel Core 2 Duo (E6600@2.4 GHz) using the latest 11.0 MKL.

For data of size > 10MB, the performance of

saxpy is 0.9 Gflops, e.g. n = 16777216, t = 0.039717s, where the opcount = 2 * n.

sdot is 1.4 Gflops,e.g. n = 16777216, t = 0.024379s where the opcount = 2 * n - 1.

sgemv is 2.5 Gflops, e.g. m,n = 4096, t = 0.021503s where the opcount = (2 * n - 1) * m.

However in case of

sgemm the performance exceeds 35 Gflops, e.g. m,n,k = 4096, t = 4.114639s where the opcount = (2*k-1)*m*n.

Yet this should be impossible as the peak performance of the E6600 is 19.2 Gflops for single precision.

lda,ldb,ldc = 4096, alpha=1,beta=0 and

cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, 4096, 4096, 4096, 1.0, A, 4096, B, 4096, 0.0, C, 4096);

I have verirfied the results for smaller sizes.

Could someone please tell me how this is possible ?

Thanks a lot,

Cem

Link Copied

2 Replies

Sridevi_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-26-2013
12:38 PM

66 Views

Cem,

Can you please provide me a testcase with which you've verified your results? I'll test it and let you know my results and comments

Thank you,

Sridevi

Sridevi_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-01-2013
02:55 PM

66 Views

Cem,

I notice that you are running your testcase on dual core machine.MKL uses both cores by default.can you please set “export MKL_NUM_THREADS=1” to measure the performance on a single core?

-Sridevi

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.