- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am using cblas_dgemv to obtain AT*x. The size of the matrix A is about 10000 Rows x 20000 columns. I am storing A in row major format. Ai,j+1 is stored next to Aij
My questions are as follows (in order to get fastest execution time):
- What is better way to store A -- row major format or column major format? does it matter?
- Is it better to store A and set
Trans
CblasTrans or store AT directly and use it withA= Trans
CblasNoTrans.A= - If answer to #2 is to use AT directly, is it better to store AT in rowmajor format or column major format?
Another related question I have has to do with byte alignment. Let us say we are storing in A in row major format. A has m rows and n columns. I have read that, when doing multithreading using openmp, to avoid false sharing it is better if each row of A starts at a byte aligned boundary. A common way of doing that is by padding the number of columns such that it is divisible by 8 (64 bytes for 8 doubles). So LDA = n + (8 - n%8). Does doing this help dgemv run faster?
Finally, For my calculation I need alpha=1 and beta=0. Does cblas_dgemv optimize for this trivial case or does it do the extra and in this unneccessary calculations?
Thanks in advance for any help.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rajendra,
The performance of sgemv depends on your matrix layout, for your case:
- For RowMajor(cblas): The performance would be better, if m>n, it would be better to save AT(20k*10k), and calculate AT*x, no trans for A
- For ColMajor(sgemv): The performance would be better, if n>m, save A (10k*20k) directly, use ColMajor layout and transform A
Cause the size of matrix A is not quite special, and the size is large enough, there may not with an impressive improvement by using above advice. What I tested on my side(alpha=1, beta=0):
[root@sae-skl01 mkl-sgemv]# ./cblas-sgemv 20000 10000 10000 101 10 28 m=20000,n=10000,lda=10000 layout=RowMajor cores=28 gflop=39.40978 peak=2060.80005 efficiency=0.01912 [root@sae-skl01 mkl-sgemv]# ./cblas-gemmv 20000 10000 20000 102 10 28 m=20000,n=10000,lda=20000 layout=ColMajor cores=28 gflop=27.73358 peak=2060.80005 efficiency=0.01346 [root@sae-skl01 mkl-sgemv]# ./sgemv-trans 10000 20000 10000 102 10 28 m=10000,n=20000,lda=10000 layout=ColMajor cores=28 gflop=35.03814 peak=2060.80005 efficiency=0.01700 [root@sae-skl01 mkl-sgemv]# ./sgemv-trans 10000 20000 20000 101 10 28 m=10000,n=20000,lda=20000 layout=RowMajor cores=28 gflop=21.69320 peak=2060.80005 efficiency=0.01053
For your question about false sharing. Yeah, you probably need to consider to let LDA match the multiple of cache for high performance. If LEVEL2_DCACHE_LINESIZE=64, for single precision, let LDA to be multiple of 16. You could follow this article to see: https://software.intel.com/en-us/mkl-linux-developer-guide-coding-techniques However, for your case, I am afraid the size is large enough, there might do not have much performance improvement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Fiona,
Could you tell me what are the descriptions of the command line arguments to your programs? In particular the last four arguments? Is it possible to share the source code and compile settings for your test program.
Thanks
Raj
Fiona Z. (Intel) wrote:
Hi Rajendra,
The performance of sgemv depends on your matrix layout, for your case:
- For RowMajor(cblas): The performance would be better, if m>n, it would be better to save AT(20k*10k), and calculate AT*x, no trans for A
- For ColMajor(sgemv): The performance would be better, if n>m, save A (10k*20k) directly, use ColMajor layout and transform A
Cause the size of matrix A is not quite special, and the size is large enough, there may not with an impressive improvement by using above advice. What I tested on my side(alpha=1, beta=0):
[root@sae-skl01 mkl-sgemv]# ./cblas-sgemv 20000 10000 10000 101 10 28 m=20000,n=10000,lda=10000 layout=RowMajor cores=28 gflop=39.40978 peak=2060.80005 efficiency=0.01912 [root@sae-skl01 mkl-sgemv]# ./cblas-gemmv 20000 10000 20000 102 10 28 m=20000,n=10000,lda=20000 layout=ColMajor cores=28 gflop=27.73358 peak=2060.80005 efficiency=0.01346 [root@sae-skl01 mkl-sgemv]# ./sgemv-trans 10000 20000 10000 102 10 28 m=10000,n=20000,lda=10000 layout=ColMajor cores=28 gflop=35.03814 peak=2060.80005 efficiency=0.01700 [root@sae-skl01 mkl-sgemv]# ./sgemv-trans 10000 20000 20000 101 10 28 m=10000,n=20000,lda=20000 layout=RowMajor cores=28 gflop=21.69320 peak=2060.80005 efficiency=0.01053For your question about false sharing. Yeah, you probably need to consider to let LDA match the multiple of cache for high performance. If LEVEL2_DCACHE_LINESIZE=64, for single precision, let LDA to be multiple of 16. You could follow this article to see: https://software.intel.com/en-us/mkl-linux-developer-guide-coding-techniques However, for your case, I am afraid the size is large enough, there might do not have much performance improvement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rajendra,
The attributes means:
./cblas-sgemv m n lda layout loop cores
layout is CblasRowMajor(101) or CblasColMajor(102)
Please note, in this sample I just set random data line by line (rowMajor) to see performance. If you would like to get correct result, you have to save data column by column if you are using CblasColMajor(102).
Best regards,
Fiona
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page