Why cblas_daxpy() executes so slow？

yecao · ‎08-09-2021

I attach a simply program, which costs 11 seconds.

I call the program with mkl_set_num_threads(8), so there are 8 threads used.

int dim_1 = 1, dim_2 = 3000*3000*100;

std::vector<double> v1(dim_1*dim_2, 2.0);

std::vector<double> v2(dim_1*dim_2, 1.0);

for (int i=0; i<dim_1; i++) {

cblas_daxpy(dim_2, 4.0, &v1[i], dim_1, &v2[i], dim_1);

}

If I diagonalize a matrix with dimension 3000 using a Lapack program, it should be finished within 10 seconds. We know the complexity of diagonalization is O(N^3), so for a 3000*3000 matrix, the time cost should be much more than that of the attached program, the complexity of which is 3000^2 * 100.

I wonder why this is the case.

VidyalathaB_Intel · ‎08-10-2021

Hi,

Thanks for reaching out to us.

>>I attach a simply program, which costs 11 seconds.

Can you please share minimal reproducer which also shows how you are calculating the time.

>>If I diagonalize a matrix with dimension 3000 using a Lapack program, it should be finished within 10 seconds

Please share a sample reproducer which also shows how the time taken is being calculated for Lapack program so that we can work on it from our end.

Please let us know your environment details as well

OS &version

MKL &version

Regards,

Vidya.

yecao · ‎08-10-2021

Dear Vidya,

Thank you very much for your reply.

I attached a C++ file where I compared a diagonalization of a 3000*3000 random matrix (~1s ) and a daxpy call for two 3000*3000*200 vectors (~3s). For my computer, the diagonalization is much faster. We know the cost for diagonalization should be O(3000^3). I can not understand why the daxpy is so slow?

Yours,

Ye

VidyalathaB_Intel · ‎08-11-2021

Hi,

We tried running your sample code in Ubuntu 20.04 with MKL 2021.3 and the results are as follows.

Our system details:

 lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           158
Model name:                      Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz
Stepping:                        10
CPU MHz:                         1200.170
CPU max MHz:                     4700.0000
CPU min MHz:                     800.0000
BogoMIPS:                        7399.70
Virtualization:                  VT-x
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        1.5 MiB
L3 cache:                        12 MiB
NUMA node0 CPU(s):               0-11

Time elapsed in dsyevd: 1.517391e+00

Time elapsed in daxpy: 1.340842e+00

>> For my computer, the diagonalization is much faster

Could you please provide us your computer details along with OS details and MKL version ?

Regards,

Vidya.

yecao · ‎08-12-2021

Dear Vidya,

Thank you very much for your help.

I attach the MKL and cpu information.

The OS is Ubuntu 18.04 LTS server. The server is Dell Power-edge T640. Do you think it is a hardware issue?

Major version: 2021
Minor version: 0
Update version: 2
Product status: Product
Build: 20210312
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors
================================================================

lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
Stepping: 4
CPU MHz: 1000.492
BogoMIPS: 4600.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 16896K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

VidyalathaB_Intel · ‎08-16-2021

Hi,

>> Update version: 2

Could you please try compiling the code with latest version of MKL (2021.3) and let us know if the issue still persists.

Regards,

Vidya.

yecao · ‎08-16-2021

Dear Vidya,

I installed the latest version (2021.3), and the problem still persists.

If I loop the daxpy 100 times, the time consumption of each loop varys greatly. Some in the front may exceed 3 seconds, and most in the back will be about 0.9 seconds.

Would you please help me to address the issue?

VidyalathaB_Intel · ‎08-25-2021

Hi,

We are working on your issue. we will get back to you soon.

Regards,

Vidya.

Khang_N_Intel · ‎09-28-2021

Hi Ye,

I apologize for the delay in response.

We have tested on many systems ( Thanks Vidya!). We were able to confirm the issue on most systems. My latest run on the dual-socket Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz did confirm the issue:

Time elapsed in dsyevd: 6.328132e-01

Time elapsed in daxpy: 1.168640e+00

The engineer is looking into why the issue cannot be reproduced on some systems.

We will let you know we track down the issue and when the fix will be available.

Best regards,

Khang

Khang_N_Intel · ‎10-06-2021

Hi Ye,

I analyzed and discussed with the engineer and this is what we found:

The memory size allocated by the lapack function dsyevd is much smaller than that of the blas function daxpy:

dsyevd:

double *w = (double *)mkl_calloc(dim, sizeof(double), 64);

double *dis = (double *)mkl_calloc(buffer_size, sizeof(double), 64);

daxpy:

double *v1 = (double *)mkl_calloc(vector_size, sizeof(double), 64);

double *v2 = (double *)mkl_calloc(vector_size, sizeof(double), 64);

With buffer_size=dim x dim = 9000 of type double, the dsyevd data will fit entirely within the cache.

With vector_size =dim x dim x 200 = 1.8e+9 of type double. Moreover, you allocated 2 vectors which will double the memory. Therefore, the memory used by daxpy will have to stay in the main memory, not in cache. Moving data around will cost a lot.

Since, in your code, daxpy has to work with data that is much larger than that of dsyevd and it stays in the main memory instead of cache, daxpy will be slower than dsyevd.

Please do not hesitate to let us know, should you have more questions.

Best regards,

Khang

Khang_N_Intel · ‎10-14-2021

Hi Ye,

Since the solution has been provided, there will be no more communication on this thread.

Best regards,

Khang