- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe I'm missing something so I would appreciate a lot if someone can point a mistake. I have a system with a 7250, Centos 7.3, intel parallel studio xe 2018, xppsl-1.5.4 installed. I have just one DIMM slot populated with a RDIMM 2400MHz 32GB. After several tries on a FFT code with low performance I suspected that there is something wrong with my system. I try the micprun suite to make syntectic tests and compare with reference ones.
For small matrices I got good performance:
RESULT: 512 x 512 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 512
2475.76 GFlops
REFERENCE: 512 x 512 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 512
2598.68 GFlops
But for 1024 x 1024:
RESULT: 1024 x 1024 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024
732.5 GFlops
REFERENCE: 1024 x 1024 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024
1776.22 GFlops
For bigger matrices the results are even worst:
RESULT: 16384 x 16384 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384
638.05 GFlops
REFERENCE: 16384 x 16384 MKL SGEMM with 0 threads and 3 iterations
Parameters: --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384
4321.84 GFlops
I'm aware that for bigger matrices I would relay on DRAM but with 16GB MCDRAM I thought be enough to run 1024x1024 (1GB) matrices. I'm considering buy more RDIMMS to complete the six-channels but I'm not totally sure that this will solve the problem.
Do you guy have some consideration? I'm doing something wrong?
Best Regads,
Maicon Faria
Abax HPC
P.S:
Full result:
benchmarking: sgemmtimer : nativenum_threads : 0min_niters : 3min_t : 3.000000first index : 16384last index : 16384step : 16384fixed M : -1fixed N : -1fixed K : -1data transf.: maybe (depends on MKL AO setting)threads used: 68 (autodetected)threads/core: 1affinity : KMP_AFFINITY (if any)MKL : 2017.0.2 build 20170126 (Product)processor : Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) for Intel(R) Many Integrated Core Architecture (Intel(R) MIC Architecture) enabled processorsCPU freq. : 1.48 (may float due to scaling)# cores aval: 68max threads : 272# of co-proc: 0#0: NNtesting XGEMM( 'N', 'N', n, n, ... )n min avg max stddev16384 634.50 638.05 641.10 2.723e+00* 16384 634.50 638.05 641.10 2.723e+00[ DESCRIPTION ] 16384 x 16384 MKL SGEMM with 0 threads and 3 iterations[ PERFORMANCE ] Task.Computation.Avg 638.05 GFlops R***********************************ROLLED UP**************************** ******** *************************************sgemm****************** ******************** *****************************local__mcdram_example********* ********************* 512 x 512 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 5122475.76 GFlops1024 x 1024 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 1024732.5 GFlops1536 x 1536 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 1536 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1536 --s_step 1536922.04 GFlops2048 x 2048 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 2048 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2048 --s_step 20481004.52 GFlops2560 x 2560 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 2560 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2560 --s_step 25601026.21 GFlops3072 x 3072 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 3072 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3072 --s_step 3072916.68 GFlops3584 x 3584 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 3584 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3584 --s_step 3584828.52 GFlops4096 x 4096 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 4096 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4096 --s_step 40961015.26 GFlops4608 x 4608 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 4608 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4608 --s_step 46081073.56 GFlops5120 x 5120 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 5120 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5120 --s_step 51201160.63 GFlops5632 x 5632 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 5632 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5632 --s_step 56321205.76 GFlops6144 x 6144 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 6144 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6144 --s_step 61441254.58 GFlops6656 x 6656 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 6656 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6656 --s_step 66561314.4 GFlops7168 x 7168 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 7168 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7168 --s_step 71681366.14 GFlops7680 x 7680 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 7680 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7680 --s_step 76801344.64 GFlops8192 x 8192 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 8192 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8192 --s_step 8192745.78 GFlops8704 x 8704 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 8704 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8704 --s_step 8704739.81 GFlops9216 x 9216 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 9216 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9216 --s_step 9216701.58 GFlops9728 x 9728 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 9728 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9728 --s_step 9728721.16 GFlops10240 x 10240 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 10240 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10240 --s_step 10240679.92 GFlops10752 x 10752 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 10752 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10752 --s_step 10752677.37 GFlops11264 x 11264 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 11264 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11264 --s_step 11264684.59 GFlops11776 x 11776 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 11776 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11776 --s_step 11776656.25 GFlops12288 x 12288 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 12288 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12288 --s_step 12288692.47 GFlops12800 x 12800 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 12800 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12800 --s_step 12800624.84 GFlops13312 x 13312 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 13312 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13312 --s_step 13312558.92 GFlops13824 x 13824 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 13824 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13824 --s_step 13824664.77 GFlops14336 x 14336 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 14336 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14336 --s_step 14336694.58 GFlops14848 x 14848 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 14848 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14848 --s_step 14848684.85 GFlops15360 x 15360 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 15360 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15360 --s_step 15360678.84 GFlops15872 x 15872 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 15872 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15872 --s_step 15872657.59 GFlops16384 x 16384 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 16384638.05 GFlops************************************************************ ******************** ***********local__mcdram_7250_redhat-7.2_micperf-1.5.2_ local_scaling************ 512 x 512 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 512 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 512 --s_step 5122598.68 GFlops1024 x 1024 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 1024 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1024 --s_step 10241776.22 GFlops1536 x 1536 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 1536 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 1536 --s_step 15362408.19 GFlops2048 x 2048 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 2048 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2048 --s_step 20482753.62 GFlops2560 x 2560 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 2560 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 2560 --s_step 25603157.43 GFlops3072 x 3072 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 3072 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3072 --s_step 30723324.94 GFlops3584 x 3584 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 3584 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 3584 --s_step 35843488.82 GFlops4096 x 4096 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 4096 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4096 --s_step 40963810.68 GFlops4608 x 4608 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 4608 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 4608 --s_step 46083967.6 GFlops5120 x 5120 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 5120 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5120 --s_step 51204023.72 GFlops5632 x 5632 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 5632 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 5632 --s_step 56324094.68 GFlops6144 x 6144 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 6144 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6144 --s_step 61444132.83 GFlops6656 x 6656 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 6656 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 6656 --s_step 66564082.72 GFlops7168 x 7168 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 7168 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7168 --s_step 71684147.48 GFlops7680 x 7680 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 7680 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 7680 --s_step 76804146.49 GFlops8192 x 8192 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 8192 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8192 --s_step 81924195.76 GFlops8704 x 8704 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 8704 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 8704 --s_step 87044250.19 GFlops9216 x 9216 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 9216 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9216 --s_step 92164263.39 GFlops9728 x 9728 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 9728 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 9728 --s_step 97284229.29 GFlops10240 x 10240 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 10240 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10240 --s_step 102404255.04 GFlops10752 x 10752 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 10752 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 10752 --s_step 107524247.74 GFlops11264 x 11264 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 11264 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11264 --s_step 112644274.98 GFlops11776 x 11776 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 11776 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 11776 --s_step 117764258.92 GFlops12288 x 12288 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 12288 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12288 --s_step 122884299.45 GFlops12800 x 12800 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 12800 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 12800 --s_step 128004283.56 GFlops13312 x 13312 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 13312 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13312 --s_step 133124295.48 GFlops13824 x 13824 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 13824 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 13824 --s_step 138244283.7 GFlops14336 x 14336 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 14336 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14336 --s_step 143364316.61 GFlops14848 x 14848 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 14848 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 14848 --s_step 148484282.81 GFlops15360 x 15360 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 15360 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15360 --s_step 153604286.2 GFlops15872 x 15872 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 15872 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 15872 --s_step 158724321.02 GFlops16384 x 16384 MKL SGEMM with 0 threads and 3 iterationsParameters: --f_first_matrix_size 16384 --i_num_rep 3 --T_device -1 --n_num_thread 0 --m_mode NN --l_last_matrix_size 16384 --s_step 163844321.84 GFlops************************************************************ ******************** ************************************************************ ******************** ************************************************************ ********************
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What mode is your system configured in? (The performance issues are likely to be different in "flat" vs "cache" modes.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
McCalpin, John wrote:
What mode is your system configured in? (The performance issues are likely to be different in "flat" vs "cache" modes.)
Hi, that result was for SNC-4, flat.
Trying cache, I got a nice result. I was not expecting that since I saw flat was recommend for SGEMM in documentation.
KERNEL, OFFLOAD, TAG
sgemm, local, ddr_example
DESCRIPTION, f_first_matrix_size, i_num_rep, T_device, n_num_thread, m_mode, l_last_matrix_size, s_step, Task.Computation.Avg (GFlops)
512 x 512 MKL SGEMM with 0 threads and 3 iterations, 512, 3, -1, 0, NN, 512, 512, 2201.68
1024 x 1024 MKL SGEMM with 0 threads and 3 iterations, 1024, 3, -1, 0, NN, 1024, 1024, 1794.76
1536 x 1536 MKL SGEMM with 0 threads and 3 iterations, 1536, 3, -1, 0, NN, 1536, 1536, 2383.83
2048 x 2048 MKL SGEMM with 0 threads and 3 iterations, 2048, 3, -1, 0, NN, 2048, 2048, 2596.83
2560 x 2560 MKL SGEMM with 0 threads and 3 iterations, 2560, 3, -1, 0, NN, 2560, 2560, 3540.39
3072 x 3072 MKL SGEMM with 0 threads and 3 iterations, 3072, 3, -1, 0, NN, 3072, 3072, 3740.34
3584 x 3584 MKL SGEMM with 0 threads and 3 iterations, 3584, 3, -1, 0, NN, 3584, 3584, 3928.0
4096 x 4096 MKL SGEMM with 0 threads and 3 iterations, 4096, 3, -1, 0, NN, 4096, 4096, 3936.0
4608 x 4608 MKL SGEMM with 0 threads and 3 iterations, 4608, 3, -1, 0, NN, 4608, 4608, 4263.12
5120 x 5120 MKL SGEMM with 0 threads and 3 iterations, 5120, 3, -1, 0, NN, 5120, 5120, 4247.05
5632 x 5632 MKL SGEMM with 0 threads and 3 iterations, 5632, 3, -1, 0, NN, 5632, 5632, 4363.59
6144 x 6144 MKL SGEMM with 0 threads and 3 iterations, 6144, 3, -1, 0, NN, 6144, 6144, 4357.05
6656 x 6656 MKL SGEMM with 0 threads and 3 iterations, 6656, 3, -1, 0, NN, 6656, 6656, 4374.95
7168 x 7168 MKL SGEMM with 0 threads and 3 iterations, 7168, 3, -1, 0, NN, 7168, 7168, 4399.82
7680 x 7680 MKL SGEMM with 0 threads and 3 iterations, 7680, 3, -1, 0, NN, 7680, 7680, 4302.77
8192 x 8192 MKL SGEMM with 0 threads and 3 iterations, 8192, 3, -1, 0, NN, 8192, 8192, 4376.13
8704 x 8704 MKL SGEMM with 0 threads and 3 iterations, 8704, 3, -1, 0, NN, 8704, 8704, 4389.37
9216 x 9216 MKL SGEMM with 0 threads and 3 iterations, 9216, 3, -1, 0, NN, 9216, 9216, 4388.52
9728 x 9728 MKL SGEMM with 0 threads and 3 iterations, 9728, 3, -1, 0, NN, 9728, 9728, 4357.71
10240 x 10240 MKL SGEMM with 0 threads and 3 iterations, 10240, 3, -1, 0, NN, 10240, 10240, 4396.34
10752 x 10752 MKL SGEMM with 0 threads and 3 iterations, 10752, 3, -1, 0, NN, 10752, 10752, 4388.54
11264 x 11264 MKL SGEMM with 0 threads and 3 iterations, 11264, 3, -1, 0, NN, 11264, 11264, 4385.23
11776 x 11776 MKL SGEMM with 0 threads and 3 iterations, 11776, 3, -1, 0, NN, 11776, 11776, 4336.85
12288 x 12288 MKL SGEMM with 0 threads and 3 iterations, 12288, 3, -1, 0, NN, 12288, 12288, 4362.61
12800 x 12800 MKL SGEMM with 0 threads and 3 iterations, 12800, 3, -1, 0, NN, 12800, 12800, 4351.96
13312 x 13312 MKL SGEMM with 0 threads and 3 iterations, 13312, 3, -1, 0, NN, 13312, 13312, 4379.92
13824 x 13824 MKL SGEMM with 0 threads and 3 iterations, 13824, 3, -1, 0, NN, 13824, 13824, 1268.28
14336 x 14336 MKL SGEMM with 0 threads and 3 iterations, 14336, 3, -1, 0, NN, 14336, 14336, 4402.25
14848 x 14848 MKL SGEMM with 0 threads and 3 iterations, 14848, 3, -1, 0, NN, 14848, 14848, 4383.5
15360 x 15360 MKL SGEMM with 0 threads and 3 iterations, 15360, 3, -1, 0, NN, 15360, 15360, 4370.94
15872 x 15872 MKL SGEMM with 0 threads and 3 iterations, 15872, 3, -1, 0, NN, 15872, 15872, 4372.85
16384 x 16384 MKL SGEMM with 0 threads and 3 iterations, 16384, 3, -1, 0, NN, 16384, 16384, 4373.31
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SNC-4 mode is a pain to control properly. It works well when you can run 4 MPI tasks per node and can launch these via a script that computes the correct NUMA node number for binding the memory to correct MCDRAM NUMA node. A single shared-memory executable is not going to be able to use the MCDRAM in all four quadrants without ugly explicit code to place data using the "memkind" library (http://memkind.github.io/memkind/) or the NUMA APIs (e.g., http://man7.org/linux/man-pages/man3/numa.3.html).
For a single shared-memory executable, Flat-Quadrant mode almost always gives the best performance. The default memory placement is DDR4 (NUMA node 0), but in this case there is only one non-default location (MCDRAM) so that is referred to as NUMA node 1. If the job will fit entirely in MCDRAM, then it can easily be launched with
numactl --membind=1 ./a.out
If the job requires more memory than is available in MCDRAM, you either need a very detailed understanding of the access patterns and a carefully controlled explicit staging of data, or you can just use cache mode and accept whatever speedup is available for "free".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Dr. Bandwidth, that will help know that we are working with real application tests.
Best Regards,
Maicon

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page