How to store A to get fastest performance of AT*x using cblas_dgemv?

Rajendra_G_ · ‎08-18-2017

Hello,

I am using cblas_dgemv to obtain A^T*x. The size of the matrix A is about 10000 Rows x 20000 columns. I am storing A in row major format. A_i,j+1 is stored next to A_ij

My questions are as follows (in order to get fastest execution time):

What is better way to store A -- row major format or column major format? does it matter?
Is it better to store A and set TransA=CblasTrans or store A^T directly and use it with TransA=CblasNoTrans.
If answer to #2 is to use A^T directly, is it better to store A^T in rowmajor format or column major format?

Another related question I have has to do with byte alignment. Let us say we are storing in A in row major format. A has m rows and n columns. I have read that, when doing multithreading using openmp, to avoid false sharing it is better if each row of A starts at a byte aligned boundary. A common way of doing that is by padding the number of columns such that it is divisible by 8 (64 bytes for 8 doubles). So LDA = n + (8 - n%8). Does doing this help dgemv run faster?

Finally, For my calculation I need alpha=1 and beta=0. Does cblas_dgemv optimize for this trivial case or does it do the extra and in this unneccessary calculations?

Thanks in advance for any help.

Zhen_Z_Intel · ‎08-20-2017

Hi Rajendra,

The performance of sgemv depends on your matrix layout, for your case:

For RowMajor(cblas): The performance would be better, if m>n, it would be better to save A^T(20k*10k), and calculate A^T*x, no trans for A
For ColMajor(sgemv): The performance would be better, if n>m, save A (10k*20k) directly, use ColMajor layout and transform A

Cause the size of matrix A is not quite special, and the size is large enough, there may not with an impressive improvement by using above advice. What I tested on my side(alpha=1, beta=0):

[root@sae-skl01 mkl-sgemv]# ./cblas-sgemv 20000 10000 10000 101 10 28
m=20000,n=10000,lda=10000 layout=RowMajor cores=28 gflop=39.40978 peak=2060.80005 efficiency=0.01912
[root@sae-skl01 mkl-sgemv]# ./cblas-gemmv 20000 10000 20000 102 10 28
m=20000,n=10000,lda=20000 layout=ColMajor cores=28 gflop=27.73358 peak=2060.80005 efficiency=0.01346


[root@sae-skl01 mkl-sgemv]# ./sgemv-trans 10000 20000 10000 102 10 28
m=10000,n=20000,lda=10000 layout=ColMajor cores=28 gflop=35.03814 peak=2060.80005 efficiency=0.01700
[root@sae-skl01 mkl-sgemv]# ./sgemv-trans 10000 20000 20000 101 10 28
m=10000,n=20000,lda=20000 layout=RowMajor cores=28 gflop=21.69320 peak=2060.80005 efficiency=0.01053

For your question about false sharing. Yeah, you probably need to consider to let LDA match the multiple of cache for high performance. If LEVEL2_DCACHE_LINESIZE=64, for single precision, let LDA to be multiple of 16. You could follow this article to see: https://software.intel.com/en-us/mkl-linux-developer-guide-coding-techniques However, for your case, I am afraid the size is large enough, there might do not have much performance improvement.

Rajendra_G_ · ‎08-21-2017

Hi Fiona,

Could you tell me what are the descriptions of the command line arguments to your programs? In particular the last four arguments? Is it possible to share the source code and compile settings for your test program.

Thanks

Raj

Fiona Z. (Intel) wrote:

Hi Rajendra,

The performance of sgemv depends on your matrix layout, for your case:

For RowMajor(cblas): The performance would be better, if m>n, it would be better to save A^T(20k*10k), and calculate A^T*x, no trans for A

For ColMajor(sgemv): The performance would be better, if n>m, save A (10k*20k) directly, use ColMajor layout and transform A

Cause the size of matrix A is not quite special, and the size is large enough, there may not with an impressive improvement by using above advice. What I tested on my side(alpha=1, beta=0):
[root@sae-skl01 mkl-sgemv]# ./cblas-sgemv 20000 10000 10000 101 10 28
m=20000,n=10000,lda=10000 layout=RowMajor cores=28 gflop=39.40978 peak=2060.80005 efficiency=0.01912
[root@sae-skl01 mkl-sgemv]# ./cblas-gemmv 20000 10000 20000 102 10 28
m=20000,n=10000,lda=20000 layout=ColMajor cores=28 gflop=27.73358 peak=2060.80005 efficiency=0.01346


[root@sae-skl01 mkl-sgemv]# ./sgemv-trans 10000 20000 10000 102 10 28
m=10000,n=20000,lda=10000 layout=ColMajor cores=28 gflop=35.03814 peak=2060.80005 efficiency=0.01700
[root@sae-skl01 mkl-sgemv]# ./sgemv-trans 10000 20000 20000 101 10 28
m=10000,n=20000,lda=20000 layout=RowMajor cores=28 gflop=21.69320 peak=2060.80005 efficiency=0.01053
For your question about false sharing. Yeah, you probably need to consider to let LDA match the multiple of cache for high performance. If LEVEL2_DCACHE_LINESIZE=64, for single precision, let LDA to be multiple of 16. You could follow this article to see: https://software.intel.com/en-us/mkl-linux-developer-guide-coding-techniques However, for your case, I am afraid the size is large enough, there might do not have much performance improvement.

Zhen_Z_Intel · ‎08-22-2017

Hi Rajendra,

The attributes means:

./cblas-sgemv m n lda layout loop cores

layout is CblasRowMajor(101) or CblasColMajor(102)

Please note, in this sample I just set random data line by line (rowMajor) to see performance. If you would like to get correct result, you have to save data column by column if you are using CblasColMajor(102).

Best regards,
Fiona