Software Archive
Read-only legacy content
17061 Discussions

Low micprun performance on a Xeon Phi 7250

maicon_f_
Beginner
354 Views

Maybe I'm missing something so I would appreciate a lot if someone can point a mistake. I have a system with a 7250, Centos 7.3, intel parallel studio xe 2018, xppsl-1.5.4 installed. I have just one DIMM slot populated with a RDIMM 2400MHz 32GB. After several tries on a FFT code with low performance I suspected that there is something wrong with my system. I try the micprun suite to make syntectic tests and compare with reference ones.  

For small matrices I got good performance:

RESULT: 512 x 512 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 512
2475.76      GFlops

REFERENCE: 512 x 512 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 512
2598.68      GFlops

But for 1024 x 1024:

RESULT
1024 x 1024 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024
732.5      GFlops

REFERENCE: 1024 x 1024 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024
1776.22      GFlops

For bigger matrices the results are even worst:

RESULT: 16384 x 16384 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384
638.05      GFlops

REFERENCE: 16384 x 16384 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384
4321.84      GFlops

I'm aware that for bigger matrices I would relay on DRAM but with 16GB MCDRAM I thought be enough to run 1024x1024 (1GB) matrices. I'm considering buy more RDIMMS to complete the six-channels but I'm not totally sure that this will solve the problem.

Do you guy have some consideration? I'm doing something wrong?

Best Regads, 

Maicon Faria
Abax HPC

P.S:
Full result:

benchmarking: sgemm
timer       : native
num_threads : 0
min_niters  : 3
min_t       : 3.000000
first index : 16384
last  index : 16384
step        : 16384
fixed M     : -1
fixed N     : -1
fixed K     : -1
data transf.: maybe (depends on MKL AO setting)
threads used: 68 (autodetected)
threads/core: 1
affinity    : KMP_AFFINITY (if any)
MKL         : 2017.0.2 build 20170126 (Product)
processor   : Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) for Intel(R) Many Integrated Core Architecture (Intel(R) MIC Architecture) enabled processors
CPU freq.   : 1.48 (may float due to scaling)
# cores aval: 68
max threads : 272
# of co-proc: 0
 
#0: NN
 
testing XGEMM( 'N', 'N', n, n, ... )
 
          n        min        avg        max     stddev
      16384     634.50     638.05     641.10  2.723e+00
*     16384     634.50     638.05     641.10  2.723e+00
 
[ DESCRIPTION ] 16384 x 16384 MKL SGEMM with 0 threads and 3 iterations
[ PERFORMANCE ] Task.Computation.Avg 638.05 GFlops R
***********************************ROLLED UP************************************
*************************************sgemm**************************************
*****************************local__mcdram_example******************************
512 x 512 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 512
2475.76      GFlops
 
1024 x 1024 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024
732.5      GFlops
 
1536 x 1536 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 1536 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1536 --s_step 1536
922.04      GFlops
 
2048 x 2048 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 2048 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2048 --s_step 2048
1004.52      GFlops
 
2560 x 2560 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 2560 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2560 --s_step 2560
1026.21      GFlops
 
3072 x 3072 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 3072 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3072 --s_step 3072
916.68      GFlops
 
3584 x 3584 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 3584 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3584 --s_step 3584
828.52      GFlops
 
4096 x 4096 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 4096 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4096 --s_step 4096
1015.26      GFlops
 
4608 x 4608 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 4608 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4608 --s_step 4608
1073.56      GFlops
 
5120 x 5120 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 5120 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5120 --s_step 5120
1160.63      GFlops
 
5632 x 5632 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 5632 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5632 --s_step 5632
1205.76      GFlops
 
6144 x 6144 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 6144 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6144 --s_step 6144
1254.58      GFlops
 
6656 x 6656 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 6656 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6656 --s_step 6656
1314.4      GFlops
 
7168 x 7168 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 7168 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7168 --s_step 7168
1366.14      GFlops
 
7680 x 7680 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 7680 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7680 --s_step 7680
1344.64      GFlops
 
8192 x 8192 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 8192 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8192 --s_step 8192
745.78      GFlops
 
8704 x 8704 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 8704 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8704 --s_step 8704
739.81      GFlops
 
9216 x 9216 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 9216 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9216 --s_step 9216
701.58      GFlops
 
9728 x 9728 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 9728 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9728 --s_step 9728
721.16      GFlops
 
10240 x 10240 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 10240 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10240 --s_step 10240
679.92      GFlops
 
10752 x 10752 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 10752 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10752 --s_step 10752
677.37      GFlops
 
11264 x 11264 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 11264 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11264 --s_step 11264
684.59      GFlops
 
11776 x 11776 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 11776 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11776 --s_step 11776
656.25      GFlops
 
12288 x 12288 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 12288 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12288 --s_step 12288
692.47      GFlops
 
12800 x 12800 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 12800 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12800 --s_step 12800
624.84      GFlops
 
13312 x 13312 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 13312 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13312 --s_step 13312
558.92      GFlops
 
13824 x 13824 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 13824 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13824 --s_step 13824
664.77      GFlops
 
14336 x 14336 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 14336 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14336 --s_step 14336
694.58      GFlops
 
14848 x 14848 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 14848 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14848 --s_step 14848
684.85      GFlops
 
15360 x 15360 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 15360 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15360 --s_step 15360
678.84      GFlops
 
15872 x 15872 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 15872 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15872 --s_step 15872
657.59      GFlops
 
16384 x 16384 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384
638.05      GFlops
 
********************************************************************************
***********local__mcdram_7250_redhat-7.2_micperf-1.5.2_local_scaling************
512 x 512 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 512
2598.68      GFlops
 
1024 x 1024 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024
1776.22      GFlops
 
1536 x 1536 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 1536 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1536 --s_step 1536
2408.19      GFlops
 
2048 x 2048 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 2048 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2048 --s_step 2048
2753.62      GFlops
 
2560 x 2560 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 2560 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2560 --s_step 2560
3157.43      GFlops
 
3072 x 3072 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 3072 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3072 --s_step 3072
3324.94      GFlops
 
3584 x 3584 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 3584 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3584 --s_step 3584
3488.82      GFlops
 
4096 x 4096 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 4096 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4096 --s_step 4096
3810.68      GFlops
 
4608 x 4608 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 4608 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4608 --s_step 4608
3967.6      GFlops
 
5120 x 5120 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 5120 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5120 --s_step 5120
4023.72      GFlops
 
5632 x 5632 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 5632 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5632 --s_step 5632
4094.68      GFlops
 
6144 x 6144 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 6144 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6144 --s_step 6144
4132.83      GFlops
 
6656 x 6656 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 6656 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6656 --s_step 6656
4082.72      GFlops
 
7168 x 7168 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 7168 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7168 --s_step 7168
4147.48      GFlops
 
7680 x 7680 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 7680 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7680 --s_step 7680
4146.49      GFlops
 
8192 x 8192 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 8192 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8192 --s_step 8192
4195.76      GFlops
 
8704 x 8704 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 8704 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8704 --s_step 8704
4250.19      GFlops
 
9216 x 9216 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 9216 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9216 --s_step 9216
4263.39      GFlops
 
9728 x 9728 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 9728 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9728 --s_step 9728
4229.29      GFlops
 
10240 x 10240 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 10240 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10240 --s_step 10240
4255.04      GFlops
 
10752 x 10752 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 10752 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10752 --s_step 10752
4247.74      GFlops
 
11264 x 11264 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 11264 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11264 --s_step 11264
4274.98      GFlops
 
11776 x 11776 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 11776 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11776 --s_step 11776
4258.92      GFlops
 
12288 x 12288 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 12288 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12288 --s_step 12288
4299.45      GFlops
 
12800 x 12800 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 12800 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12800 --s_step 12800
4283.56      GFlops
 
13312 x 13312 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 13312 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13312 --s_step 13312
4295.48      GFlops
 
13824 x 13824 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 13824 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13824 --s_step 13824
4283.7      GFlops
 
14336 x 14336 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 14336 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14336 --s_step 14336
4316.61      GFlops
 
14848 x 14848 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 14848 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14848 --s_step 14848
4282.81      GFlops
 
15360 x 15360 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 15360 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15360 --s_step 15360
4286.2      GFlops
 
15872 x 15872 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 15872 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15872 --s_step 15872
4321.02      GFlops
 
16384 x 16384 MKL SGEMM with 0 threads and 3 iterations
Parameters:  --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384
4321.84      GFlops
 
********************************************************************************
********************************************************************************
********************************************************************************
0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
354 Views

What mode is your system configured in?   (The performance issues are likely to be different in "flat" vs "cache" modes.)

0 Kudos
maicon_f_
Beginner
354 Views

McCalpin, John wrote:

What mode is your system configured in?   (The performance issues are likely to be different in "flat" vs "cache" modes.)

Hi, that result was for SNC-4, flat. 
Trying cache, I got a nice result. I was not expecting that since I saw flat was recommend for SGEMM in documentation. 
 

KERNEL, OFFLOAD, TAG
sgemm, local, ddr_example

DESCRIPTION, f_first_matrix_size, i_num_rep, T_device, n_num_thread, m_mode, l_last_matrix_size, s_step, Task.Computation.Avg (GFlops)
512 x 512 MKL SGEMM with 0 threads and 3 iterations, 512, 3, -1, 0, NN, 512, 512, 2201.68
1024 x 1024 MKL SGEMM with 0 threads and 3 iterations, 1024, 3, -1, 0, NN, 1024, 1024, 1794.76
1536 x 1536 MKL SGEMM with 0 threads and 3 iterations, 1536, 3, -1, 0, NN, 1536, 1536, 2383.83
2048 x 2048 MKL SGEMM with 0 threads and 3 iterations, 2048, 3, -1, 0, NN, 2048, 2048, 2596.83
2560 x 2560 MKL SGEMM with 0 threads and 3 iterations, 2560, 3, -1, 0, NN, 2560, 2560, 3540.39
3072 x 3072 MKL SGEMM with 0 threads and 3 iterations, 3072, 3, -1, 0, NN, 3072, 3072, 3740.34
3584 x 3584 MKL SGEMM with 0 threads and 3 iterations, 3584, 3, -1, 0, NN, 3584, 3584, 3928.0
4096 x 4096 MKL SGEMM with 0 threads and 3 iterations, 4096, 3, -1, 0, NN, 4096, 4096, 3936.0
4608 x 4608 MKL SGEMM with 0 threads and 3 iterations, 4608, 3, -1, 0, NN, 4608, 4608, 4263.12
5120 x 5120 MKL SGEMM with 0 threads and 3 iterations, 5120, 3, -1, 0, NN, 5120, 5120, 4247.05
5632 x 5632 MKL SGEMM with 0 threads and 3 iterations, 5632, 3, -1, 0, NN, 5632, 5632, 4363.59
6144 x 6144 MKL SGEMM with 0 threads and 3 iterations, 6144, 3, -1, 0, NN, 6144, 6144, 4357.05
6656 x 6656 MKL SGEMM with 0 threads and 3 iterations, 6656, 3, -1, 0, NN, 6656, 6656, 4374.95
7168 x 7168 MKL SGEMM with 0 threads and 3 iterations, 7168, 3, -1, 0, NN, 7168, 7168, 4399.82
7680 x 7680 MKL SGEMM with 0 threads and 3 iterations, 7680, 3, -1, 0, NN, 7680, 7680, 4302.77
8192 x 8192 MKL SGEMM with 0 threads and 3 iterations, 8192, 3, -1, 0, NN, 8192, 8192, 4376.13
8704 x 8704 MKL SGEMM with 0 threads and 3 iterations, 8704, 3, -1, 0, NN, 8704, 8704, 4389.37
9216 x 9216 MKL SGEMM with 0 threads and 3 iterations, 9216, 3, -1, 0, NN, 9216, 9216, 4388.52
9728 x 9728 MKL SGEMM with 0 threads and 3 iterations, 9728, 3, -1, 0, NN, 9728, 9728, 4357.71
10240 x 10240 MKL SGEMM with 0 threads and 3 iterations, 10240, 3, -1, 0, NN, 10240, 10240, 4396.34
10752 x 10752 MKL SGEMM with 0 threads and 3 iterations, 10752, 3, -1, 0, NN, 10752, 10752, 4388.54
11264 x 11264 MKL SGEMM with 0 threads and 3 iterations, 11264, 3, -1, 0, NN, 11264, 11264, 4385.23
11776 x 11776 MKL SGEMM with 0 threads and 3 iterations, 11776, 3, -1, 0, NN, 11776, 11776, 4336.85
12288 x 12288 MKL SGEMM with 0 threads and 3 iterations, 12288, 3, -1, 0, NN, 12288, 12288, 4362.61
12800 x 12800 MKL SGEMM with 0 threads and 3 iterations, 12800, 3, -1, 0, NN, 12800, 12800, 4351.96
13312 x 13312 MKL SGEMM with 0 threads and 3 iterations, 13312, 3, -1, 0, NN, 13312, 13312, 4379.92
13824 x 13824 MKL SGEMM with 0 threads and 3 iterations, 13824, 3, -1, 0, NN, 13824, 13824, 1268.28
14336 x 14336 MKL SGEMM with 0 threads and 3 iterations, 14336, 3, -1, 0, NN, 14336, 14336, 4402.25
14848 x 14848 MKL SGEMM with 0 threads and 3 iterations, 14848, 3, -1, 0, NN, 14848, 14848, 4383.5
15360 x 15360 MKL SGEMM with 0 threads and 3 iterations, 15360, 3, -1, 0, NN, 15360, 15360, 4370.94
15872 x 15872 MKL SGEMM with 0 threads and 3 iterations, 15872, 3, -1, 0, NN, 15872, 15872, 4372.85
16384 x 16384 MKL SGEMM with 0 threads and 3 iterations, 16384, 3, -1, 0, NN, 16384, 16384, 4373.31

0 Kudos
McCalpinJohn
Honored Contributor III
354 Views

SNC-4 mode is a pain to control properly.   It works well when you can run 4 MPI tasks per node and can launch these via a script that computes the correct NUMA node number for binding the memory to correct MCDRAM NUMA node.  A single shared-memory executable is not going to be able to use the MCDRAM in all four quadrants without ugly explicit code to place data using the "memkind" library (http://memkind.github.io/memkind/) or the NUMA APIs (e.g., http://man7.org/linux/man-pages/man3/numa.3.html).

For a single shared-memory executable, Flat-Quadrant mode almost always gives the best performance.  The default memory placement is DDR4 (NUMA node 0), but in this case there is only one non-default location (MCDRAM) so that is referred to as NUMA node 1.   If the job will fit entirely in MCDRAM, then it can easily be launched with

numactl --membind=1 ./a.out

If the job requires more memory than is available in MCDRAM, you either need a very detailed understanding of the access patterns and a carefully controlled explicit staging of data, or you can just use cache mode and accept whatever speedup is available for "free".

0 Kudos
maicon_f_
Beginner
354 Views

Thank you Dr. Bandwidth, that will help know that we are working with real application tests. 

Best Regards,

 

Maicon

0 Kudos
Reply