Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
- cgemm3m, cgemm_compact AND cgemm give poor results for small problem 24*64

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Yosef__Elad

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-08-2019
02:33 AM

28 Views

cgemm3m, cgemm_compact AND cgemm give poor results for small problem 24*64

Hi all,

I using sequential API and direct call to multilply matrices.

C = 1*conj(A')*A

A is 64*24 and C is 24*24 both are complex matrix (complex8).

I have arrays of matrices: A_ARR (filled with random values) and C_ARR (filled with zeros) both array have 1000 matrices.

My application is pinned to sinlge core and to corresponding RAM by NUMA id.

build cmd: icc -c -g -ipo -ipp -Ofast -DMKL_DIRECT_SEQ -xCORE-AVX2 *.c

Setup is Xeon E5-2699A v4, 64G ram on each numa

I run cblas_cgemm/cblas_cgemm3m/mkl_cgemm_compact in a loop over A_ARR and C_ARR (each time only 1 function) and I get really poor results (I'm measuring only the matrices multiplication time)

I'm aware to the MKL "warn-up" issue and running cblas_cgemm in advance with measuring it time

cblas_cgemm(CblasRowMajor, CblasConjTrans, CblasNoTrans, m, n, k, &alpha, &A_ARR*, m, &A_ARR , n, &beta, &C_ARR, m)*

Gives- AVG 6.5ms MAX 8.6ms MIN 6.3ms

cblas_cgemm3m(CblasRowMajor, CblasConjTrans, CblasNoTrans, m, n, k, &alpha, &A_ARR[*, m, &A_ARR[ , n, &beta, &C_ARR[, m)*

Gives- AVG 7.5ms MAX 12ms MIN 7.3ms

mkl_cgemm_compact(CblasRowMajor, CblasConjTrans, CblasNoTrans, m, n, k, &alpha, &a_arr_compact[*, m, &a_arr_compact , n, &beta, &c_arr_compact[, m, COMPACT_FORMAT, 1) *

Gives- AVG 225ms MAX 231ms MIN 224ms

Note COMPACT_FORMAT is from mkl_get_format_compact();

Does any one can assist me with reducing with time it takes?

It is also not clear to me why the compact API that should mostly vectorize matrices multiplication it getting lowest results

Thanks

Elad

7 Replies

Highlighted
##

We need to check but probably compact API has not been optimized for such "big" sizes. What version of MKL do you use?

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2019
12:21 AM

28 Views

Highlighted
##

Yosef__Elad

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2019
01:47 AM

28 Views

MKL version is latest 2019.4.243

Another odd thing is that cblas_cgemm show better results than cblas_cgemm3m.

the latest should imporve by ~25% according to docs

Highlighted
##

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2019
02:10 AM

28 Views

Could you share your benchmark to check these numbers on our side with the latest updates and CPU?

Highlighted
##

Yosef__Elad

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2019
02:53 AM

28 Views

Highlighted
##

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2019
10:45 PM

28 Views

thanks for the project, we will check

Highlighted
##

Yosef__Elad

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-09-2019
11:47 PM

28 Views

I tested it on AVX512 setup - Xeon Platinum 8176 2.10Gz

I can't see any improvments that comes from the AVX512.

Should I expect for any improvement against AVX2 on the above setup?

Can't find any info in release notes.

Elad

Highlighted
##

Yosef__Elad

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-10-2019
06:36 AM

28 Views

Closing this thread I fond the issue in my timer function

For more complete information about compiler optimizations, see our Optimization Notice.