My Machine
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6252N CPU @ 2.30GHz
Stepping: 7
CPU MHz: 1699.871
CPU max MHz: 3600.0000
CPU min MHz: 1000.0000
BogoMIPS: 4600.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
Core topology: two sockets, 24 cores per socket, 48 cores total
SMT status: enabled, but not utilized
Max clock rate: 1.7GHz (single-core and multicore)
Peak performance:
--single-core: 54.4 GFLOPS(double-precision)
--multicore: 54.4 GFLOPS/core (double-precision) 2611.2 GFLOPS/48 cores(double-precision)
I have fixed the frequency of the CPU at 1.7GHz by commands: sudo cpupower -c all frequency-set -u 1.7GHz, sudo cpupower -c all frequency-set -d 1.7GHz.
Code sample
int main(int argc, const char *argv[])
{
// matrix parameters: A * X = B, column major
int N, NRHS, LDA, LDB;
// test parameters
int N_START = 1000, N_END = 30000, NRHS_START = 1000, NRHS_END = 1000, INC = 1000, REPEAT = 3;
// N, NRHS, REPEAT=3
N = N_START, LDA = N, NRHS = NRHS_START, LDB = N;
double gflops[50][50][10];
while (N <= N_END){
while(NRHS <= NRHS_END){
double *A = NULL, *B = NULL;
int *IPIV = NULL;
for(int re_count = 0; re_count < REPEAT; ++ re_count)
{
A = (double *) malloc (sizeof(double) * N * N);
B = (double *) malloc (sizeof(double) * N * NRHS);
int seed[] = {0, 0, 0, 1};
LAPACKE_dlarnv(1, seed, N * N, A);
LAPACKE_dlarnv(1, seed, N * NRHS, B);
IPIV = (int *) malloc (sizeof(int) * N);
struct timeval start, finish;
gettimeofday(&start, NULL);
int info = LAPACKE_dgesv(LAPACK_COL_MAJOR, N, NRHS, A, LDA, IPIV, B, LDB);
gettimeofday(&finish, NULL);
if(info == 0){
double d_n = N, d_nrhs = NRHS;
double ops = ((2.0*d_n*d_n*d_n/3.0 - d_n*d_n/2.0 + 5.0*d_n/6.0) + (d_nrhs * (2*d_n*d_n - d_n))) * 1.0e-9;
gflops[N/INC - 1][NRHS/INC - 1][re_count] = ops / ( (finish.tv_sec - start.tv_sec) * 1.0 + (finish.tv_usec - start.tv_usec) * 1.0e-6 );
}
else{
fprintf(stderr, "[ERROR]: LAPACKE_dgesv failed\n");
exit(EXIT_FAILURE);
}
free(A), free(B), free(IPIV);
A = NULL, B = NULL, IPIV = NULL;
}
NRHS += INC;
}
N += INC, LDA = N, NRHS = NRHS_START, LDB = N;
sleep(10);
}
return 0;
}
[xx@cn0 code]$ export OMP_NUM_THREADS=48 GOMP_CPU_AFFINITY="0-47:1"
[xx@cn0 code]$ make test_dgesv_mkl.x
gcc -O2 -fopenmp -fPIC -o test_dgesv.o -c test_dgesv.c
gcc test_dgesv.o -L/home/xx/lib/intel/oneapi/mkl/2022.1.0/lib/intel64/ -lmkl_intel_lp64 -lmkl_core -lmkl_gnu_thread -lpthread -lm -ldl -fopenmp -o test_dgesv_mkl.x -lm -fopenmp -fPIC
[xx@cn0 code]$ numactl --interleave=all ./test_dgesv_mkl.x
In the best case(N=27 000,NRHS=1 000), MKL can reach 65.64%(1714.609/2611.2) of the theoretical peak. Have I gotten the right results? Where can I find some relevant experimental results?
Regards,
lianchen.
链接已复制
Hi Lianchen,
Thanks for reaching out to us.
>>Peak performance:
--single-core: 54.4 GFLOPS(double-precision)
--multicore: 54.4 GFLOPS/core (double-precision) 2611.2 GFLOPS/48 cores(double-precision)
Could you please let us know how did you calculate the GFLOPS for single-core and multicore in this case?
Regards,
Vidya.
Hi Vidya,
single-core:
1.7 (Ghz) * 8 (AVX512 contains eight doubles) * 2 (FMA) * 2 (ways of FPU) = 54.4GFLOPS.
multi-cores:
54.4 (GFLOPS) * 48 (cores, not utilized SMT) = 2611.2 (GFLOPS)
Regards,
Lianchen.
Hi Lianchen,
I tried running the code on CPU model Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz and I'm attaching the results(the gflops count from the code) here.
Could you please check it once and confirm if the similar behaviour is replicated with this CPU model so that we can proceed further in this case?
Regards,
Vidya.
Hi Lianchen,
As we haven't heard back from you, we are closing this thread. Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.
Regards,
Vidya.
