Low micprun performance on a Xeon Phi 7250

maicon_f_ · ‎03-29-2018

Maybe I'm missing something so I would appreciate a lot if someone can point a mistake. I have a system with a 7250, Centos 7.3, intel parallel studio xe 2018, xppsl-1.5.4 installed. I have just one DIMM slot populated with a RDIMM 2400MHz 32GB. After several tries on a FFT code with low performance I suspected that there is something wrong with my system. I try the micprun suite to make syntectic tests and compare with reference ones.

For small matrices I got good performance:

RESULT: 512 x 512 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 512
2475.76 GFlops

REFERENCE: 512 x 512 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 512
2598.68 GFlops

But for 1024 x 1024:

RESULT: 1024 x 1024 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024
732.5 GFlops

REFERENCE: 1024 x 1024 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024
1776.22 GFlops

For bigger matrices the results are even worst:

RESULT: 16384 x 16384 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384
638.05 GFlops

REFERENCE: 16384 x 16384 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384
4321.84 GFlops

I'm aware that for bigger matrices I would relay on DRAM but with 16GB MCDRAM I thought be enough to run 1024x1024 (1GB) matrices. I'm considering buy more RDIMMS to complete the six-channels but I'm not totally sure that this will solve the problem.

Do you guy have some consideration? I'm doing something wrong?

Best Regads,

Maicon Faria
Abax HPC

P.S:
Full result:

benchmarking: sgemm

timer : native

num_threads : 0

min_niters : 3

min_t : 3.000000

first index : 16384

last index : 16384

step : 16384

fixed M : -1

fixed N : -1

fixed K : -1

data transf.: maybe (depends on MKL AO setting)

threads used: 68 (autodetected)

threads/core: 1

affinity : KMP_AFFINITY (if any)

MKL : 2017.0.2 build 20170126 (Product)

processor : Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) for Intel(R) Many Integrated Core Architecture (Intel(R) MIC Architecture) enabled processors

CPU freq. : 1.48 (may float due to scaling)

# cores aval: 68

max threads : 272

# of co-proc: 0

#0: NN

testing XGEMM( 'N', 'N', n, n, ... )

n min avg max stddev

16384 634.50 638.05 641.10 2.723e+00

* 16384 634.50 638.05 641.10 2.723e+00

[ DESCRIPTION ] 16384 x 16384 MKL SGEMM with 0 threads and 3 iterations

[ PERFORMANCE ] Task.Computation.Avg 638.05 GFlops R

***********************************ROLLED UP************************************

*************************************sgemm**************************************

*****************************local__mcdram_example******************************

512 x 512 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 512

2475.76 GFlops

1024 x 1024 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024

732.5 GFlops

1536 x 1536 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 1536 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1536 --s_step 1536

922.04 GFlops

2048 x 2048 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 2048 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2048 --s_step 2048

1004.52 GFlops

2560 x 2560 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 2560 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2560 --s_step 2560

1026.21 GFlops

3072 x 3072 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 3072 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3072 --s_step 3072

916.68 GFlops

3584 x 3584 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 3584 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3584 --s_step 3584

828.52 GFlops

4096 x 4096 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 4096 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4096 --s_step 4096

1015.26 GFlops

4608 x 4608 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 4608 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4608 --s_step 4608

1073.56 GFlops

5120 x 5120 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 5120 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5120 --s_step 5120

1160.63 GFlops

5632 x 5632 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 5632 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5632 --s_step 5632

1205.76 GFlops

6144 x 6144 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 6144 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6144 --s_step 6144

1254.58 GFlops

6656 x 6656 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 6656 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6656 --s_step 6656

1314.4 GFlops

7168 x 7168 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 7168 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7168 --s_step 7168

1366.14 GFlops

7680 x 7680 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 7680 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7680 --s_step 7680

1344.64 GFlops

8192 x 8192 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 8192 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8192 --s_step 8192

745.78 GFlops

8704 x 8704 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 8704 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8704 --s_step 8704

739.81 GFlops

9216 x 9216 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 9216 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9216 --s_step 9216

701.58 GFlops

9728 x 9728 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 9728 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9728 --s_step 9728

721.16 GFlops

10240 x 10240 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 10240 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10240 --s_step 10240

679.92 GFlops

10752 x 10752 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 10752 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10752 --s_step 10752

677.37 GFlops

11264 x 11264 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 11264 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11264 --s_step 11264

684.59 GFlops

11776 x 11776 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 11776 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11776 --s_step 11776

656.25 GFlops

12288 x 12288 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 12288 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12288 --s_step 12288

692.47 GFlops

12800 x 12800 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 12800 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12800 --s_step 12800

624.84 GFlops

13312 x 13312 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 13312 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13312 --s_step 13312

558.92 GFlops

13824 x 13824 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 13824 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13824 --s_step 13824

664.77 GFlops

14336 x 14336 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 14336 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14336 --s_step 14336

694.58 GFlops

14848 x 14848 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 14848 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14848 --s_step 14848

684.85 GFlops

15360 x 15360 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 15360 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15360 --s_step 15360

678.84 GFlops

15872 x 15872 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 15872 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15872 --s_step 15872

657.59 GFlops

16384 x 16384 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384

638.05 GFlops

********************************************************************************

***********local__mcdram_7250_redhat-7.2_micperf-1.5.2_local_scaling************

512 x 512 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 512

2598.68 GFlops

1024 x 1024 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024

1776.22 GFlops

1536 x 1536 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 1536 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1536 --s_step 1536

2408.19 GFlops

2048 x 2048 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 2048 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2048 --s_step 2048

2753.62 GFlops

2560 x 2560 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 2560 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2560 --s_step 2560

3157.43 GFlops

3072 x 3072 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 3072 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3072 --s_step 3072

3324.94 GFlops

3584 x 3584 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 3584 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3584 --s_step 3584

3488.82 GFlops

4096 x 4096 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 4096 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4096 --s_step 4096

3810.68 GFlops

4608 x 4608 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 4608 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4608 --s_step 4608

3967.6 GFlops

5120 x 5120 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 5120 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5120 --s_step 5120

4023.72 GFlops

5632 x 5632 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 5632 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5632 --s_step 5632

4094.68 GFlops

6144 x 6144 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 6144 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6144 --s_step 6144

4132.83 GFlops

6656 x 6656 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 6656 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6656 --s_step 6656

4082.72 GFlops

7168 x 7168 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 7168 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7168 --s_step 7168

4147.48 GFlops

7680 x 7680 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 7680 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7680 --s_step 7680

4146.49 GFlops

8192 x 8192 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 8192 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8192 --s_step 8192

4195.76 GFlops

8704 x 8704 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 8704 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8704 --s_step 8704

4250.19 GFlops

9216 x 9216 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 9216 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9216 --s_step 9216

4263.39 GFlops

9728 x 9728 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 9728 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9728 --s_step 9728

4229.29 GFlops

10240 x 10240 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 10240 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10240 --s_step 10240

4255.04 GFlops

10752 x 10752 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 10752 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10752 --s_step 10752

4247.74 GFlops

11264 x 11264 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 11264 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11264 --s_step 11264

4274.98 GFlops

11776 x 11776 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 11776 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11776 --s_step 11776

4258.92 GFlops

12288 x 12288 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 12288 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12288 --s_step 12288

4299.45 GFlops

12800 x 12800 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 12800 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12800 --s_step 12800

4283.56 GFlops

13312 x 13312 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 13312 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13312 --s_step 13312

4295.48 GFlops

13824 x 13824 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 13824 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13824 --s_step 13824

4283.7 GFlops

14336 x 14336 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 14336 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14336 --s_step 14336

4316.61 GFlops

14848 x 14848 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 14848 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14848 --s_step 14848

4282.81 GFlops

15360 x 15360 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 15360 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15360 --s_step 15360

4286.2 GFlops

15872 x 15872 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 15872 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15872 --s_step 15872

4321.02 GFlops

16384 x 16384 MKL SGEMM with 0 threads and 3 iterations

Parameters: --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384

4321.84 GFlops

********************************************************************************

********************************************************************************

********************************************************************************

McCalpinJohn · ‎03-30-2018

What mode is your system configured in? (The performance issues are likely to be different in "flat" vs "cache" modes.)

maicon_f_ · ‎03-31-2018

McCalpin, John wrote:

What mode is your system configured in? (The performance issues are likely to be different in "flat" vs "cache" modes.)

Hi, that result was for SNC-4, flat.
Trying cache, I got a nice result. I was not expecting that since I saw flat was recommend for SGEMM in documentation.

KERNEL, OFFLOAD, TAG
sgemm, local, ddr_example

DESCRIPTION, f_first_matrix_size, i_num_rep, T_device, n_num_thread, m_mode, l_last_matrix_size, s_step, Task.Computation.Avg (GFlops)
512 x 512 MKL SGEMM with 0 threads and 3 iterations, 512, 3, -1, 0, NN, 512, 512, 2201.68
1024 x 1024 MKL SGEMM with 0 threads and 3 iterations, 1024, 3, -1, 0, NN, 1024, 1024, 1794.76
1536 x 1536 MKL SGEMM with 0 threads and 3 iterations, 1536, 3, -1, 0, NN, 1536, 1536, 2383.83
2048 x 2048 MKL SGEMM with 0 threads and 3 iterations, 2048, 3, -1, 0, NN, 2048, 2048, 2596.83
2560 x 2560 MKL SGEMM with 0 threads and 3 iterations, 2560, 3, -1, 0, NN, 2560, 2560, 3540.39
3072 x 3072 MKL SGEMM with 0 threads and 3 iterations, 3072, 3, -1, 0, NN, 3072, 3072, 3740.34
3584 x 3584 MKL SGEMM with 0 threads and 3 iterations, 3584, 3, -1, 0, NN, 3584, 3584, 3928.0
4096 x 4096 MKL SGEMM with 0 threads and 3 iterations, 4096, 3, -1, 0, NN, 4096, 4096, 3936.0
4608 x 4608 MKL SGEMM with 0 threads and 3 iterations, 4608, 3, -1, 0, NN, 4608, 4608, 4263.12
5120 x 5120 MKL SGEMM with 0 threads and 3 iterations, 5120, 3, -1, 0, NN, 5120, 5120, 4247.05
5632 x 5632 MKL SGEMM with 0 threads and 3 iterations, 5632, 3, -1, 0, NN, 5632, 5632, 4363.59
6144 x 6144 MKL SGEMM with 0 threads and 3 iterations, 6144, 3, -1, 0, NN, 6144, 6144, 4357.05
6656 x 6656 MKL SGEMM with 0 threads and 3 iterations, 6656, 3, -1, 0, NN, 6656, 6656, 4374.95
7168 x 7168 MKL SGEMM with 0 threads and 3 iterations, 7168, 3, -1, 0, NN, 7168, 7168, 4399.82
7680 x 7680 MKL SGEMM with 0 threads and 3 iterations, 7680, 3, -1, 0, NN, 7680, 7680, 4302.77
8192 x 8192 MKL SGEMM with 0 threads and 3 iterations, 8192, 3, -1, 0, NN, 8192, 8192, 4376.13
8704 x 8704 MKL SGEMM with 0 threads and 3 iterations, 8704, 3, -1, 0, NN, 8704, 8704, 4389.37
9216 x 9216 MKL SGEMM with 0 threads and 3 iterations, 9216, 3, -1, 0, NN, 9216, 9216, 4388.52
9728 x 9728 MKL SGEMM with 0 threads and 3 iterations, 9728, 3, -1, 0, NN, 9728, 9728, 4357.71
10240 x 10240 MKL SGEMM with 0 threads and 3 iterations, 10240, 3, -1, 0, NN, 10240, 10240, 4396.34
10752 x 10752 MKL SGEMM with 0 threads and 3 iterations, 10752, 3, -1, 0, NN, 10752, 10752, 4388.54
11264 x 11264 MKL SGEMM with 0 threads and 3 iterations, 11264, 3, -1, 0, NN, 11264, 11264, 4385.23
11776 x 11776 MKL SGEMM with 0 threads and 3 iterations, 11776, 3, -1, 0, NN, 11776, 11776, 4336.85
12288 x 12288 MKL SGEMM with 0 threads and 3 iterations, 12288, 3, -1, 0, NN, 12288, 12288, 4362.61
12800 x 12800 MKL SGEMM with 0 threads and 3 iterations, 12800, 3, -1, 0, NN, 12800, 12800, 4351.96
13312 x 13312 MKL SGEMM with 0 threads and 3 iterations, 13312, 3, -1, 0, NN, 13312, 13312, 4379.92
13824 x 13824 MKL SGEMM with 0 threads and 3 iterations, 13824, 3, -1, 0, NN, 13824, 13824, 1268.28
14336 x 14336 MKL SGEMM with 0 threads and 3 iterations, 14336, 3, -1, 0, NN, 14336, 14336, 4402.25
14848 x 14848 MKL SGEMM with 0 threads and 3 iterations, 14848, 3, -1, 0, NN, 14848, 14848, 4383.5
15360 x 15360 MKL SGEMM with 0 threads and 3 iterations, 15360, 3, -1, 0, NN, 15360, 15360, 4370.94
15872 x 15872 MKL SGEMM with 0 threads and 3 iterations, 15872, 3, -1, 0, NN, 15872, 15872, 4372.85
16384 x 16384 MKL SGEMM with 0 threads and 3 iterations, 16384, 3, -1, 0, NN, 16384, 16384, 4373.31

McCalpinJohn · ‎04-02-2018

SNC-4 mode is a pain to control properly. It works well when you can run 4 MPI tasks per node and can launch these via a script that computes the correct NUMA node number for binding the memory to correct MCDRAM NUMA node. A single shared-memory executable is not going to be able to use the MCDRAM in all four quadrants without ugly explicit code to place data using the "memkind" library (http://memkind.github.io/memkind/) or the NUMA APIs (e.g., http://man7.org/linux/man-pages/man3/numa.3.html).

For a single shared-memory executable, Flat-Quadrant mode almost always gives the best performance. The default memory placement is DDR4 (NUMA node 0), but in this case there is only one non-default location (MCDRAM) so that is referred to as NUMA node 1. If the job will fit entirely in MCDRAM, then it can easily be launched with

numactl --membind=1 ./a.out

If the job requires more memory than is available in MCDRAM, you either need a very detailed understanding of the access patterns and a carefully controlled explicit staging of data, or you can just use cache mode and accept whatever speedup is available for "free".

maicon_f_ · ‎05-04-2018

Thank you Dr. Bandwidth, that will help know that we are working with real application tests.

Best Regards,

Maicon