Perfomance issue about LAPACKE_dgesv of MKL on Xeon(R) Gold 6252N

lianchen · ‎11-02-2022

My Machine
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6252N CPU @ 2.30GHz
Stepping: 7
CPU MHz: 1699.871
CPU max MHz: 3600.0000
CPU min MHz: 1000.0000
BogoMIPS: 4600.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95

Core topology: two sockets, 24 cores per socket, 48 cores total
SMT status: enabled, but not utilized
Max clock rate: 1.7GHz (single-core and multicore)
Peak performance:
--single-core: 54.4 GFLOPS(double-precision)
--multicore: 54.4 GFLOPS/core (double-precision) 2611.2 GFLOPS/48 cores(double-precision)
I have fixed the frequency of the CPU at 1.7GHz by commands: sudo cpupower -c all frequency-set -u 1.7GHz, sudo cpupower -c all frequency-set -d 1.7GHz.

Code sample

int main(int argc, const char *argv[])
{
    // matrix parameters: A * X = B, column major
    int N, NRHS, LDA, LDB;
    // test parameters
    int N_START = 1000, N_END = 30000, NRHS_START = 1000, NRHS_END = 1000, INC = 1000, REPEAT = 3;

    // N, NRHS, REPEAT=3
    N = N_START, LDA = N, NRHS = NRHS_START, LDB = N;
    double gflops[50][50][10];
    while (N <= N_END){
        while(NRHS <= NRHS_END){
            double *A = NULL, *B = NULL;
            int *IPIV = NULL;

            for(int re_count = 0; re_count < REPEAT; ++ re_count)
            {
                A = (double *) malloc (sizeof(double) * N * N);
                B = (double *) malloc (sizeof(double) * N * NRHS);
                int seed[] = {0, 0, 0, 1};
                LAPACKE_dlarnv(1, seed, N * N, A);
                LAPACKE_dlarnv(1, seed, N * NRHS, B);

                IPIV = (int *) malloc (sizeof(int) * N);       
                struct timeval start, finish;
                gettimeofday(&start, NULL);
                int info = LAPACKE_dgesv(LAPACK_COL_MAJOR, N, NRHS, A, LDA, IPIV, B, LDB);
                gettimeofday(&finish, NULL);

                if(info == 0){
                    double d_n = N, d_nrhs = NRHS;
                    double ops = ((2.0*d_n*d_n*d_n/3.0 - d_n*d_n/2.0 + 5.0*d_n/6.0) + (d_nrhs * (2*d_n*d_n - d_n))) * 1.0e-9;

                    gflops[N/INC - 1][NRHS/INC - 1][re_count] = ops / ( (finish.tv_sec - start.tv_sec) * 1.0 + (finish.tv_usec - start.tv_usec) * 1.0e-6 );
                }
                else{
                    fprintf(stderr, "[ERROR]: LAPACKE_dgesv failed\n");
                    exit(EXIT_FAILURE);
                }

                free(A), free(B), free(IPIV);
                A = NULL, B = NULL, IPIV = NULL;
            }

            NRHS += INC;
        }
        N += INC, LDA = N, NRHS = NRHS_START, LDB = N;

        sleep(10);
    }
    
    return 0;
}

Command Line

[xx@cn0 code]$ export OMP_NUM_THREADS=48 GOMP_CPU_AFFINITY="0-47:1"
[xx@cn0 code]$ make test_dgesv_mkl.x
gcc -O2 -fopenmp -fPIC -o test_dgesv.o -c test_dgesv.c 
gcc test_dgesv.o -L/home/xx/lib/intel/oneapi/mkl/2022.1.0/lib/intel64/ -lmkl_intel_lp64 -lmkl_core -lmkl_gnu_thread -lpthread -lm -ldl -fopenmp -o test_dgesv_mkl.x -lm -fopenmp -fPIC
[xx@cn0 code]$ numactl --interleave=all ./test_dgesv_mkl.x

Performance on my machine(N ranges from 1 000 to 30 000 and each step is 1 000, NRHS = 1 000)

In the best case(N=27 000,NRHS=1 000), MKL can reach 65.64%(1714.609/2611.2) of the theoretical peak. Have I gotten the right results? Where can I find some relevant experimental results?

Regards,

lianchen.

VidyalathaB_Intel · ‎11-03-2022

Hi Lianchen,

Thanks for reaching out to us.

>>Peak performance:

--single-core: 54.4 GFLOPS(double-precision)

--multicore: 54.4 GFLOPS/core (double-precision) 2611.2 GFLOPS/48 cores(double-precision)

Could you please let us know how did you calculate the GFLOPS for single-core and multicore in this case?

Regards,

Vidya.

lianchen · ‎11-04-2022

Hi Vidya,

single-core:

1.7 (Ghz) * 8 (AVX512 contains eight doubles) * 2 (FMA) * 2 (ways of FPU) = 54.4GFLOPS.

multi-cores:

54.4 (GFLOPS) * 48 (cores, not utilized SMT) = 2611.2 (GFLOPS)

Regards,

Lianchen.

VidyalathaB_Intel · ‎11-08-2022

Hi Lianchen,

I tried running the code on CPU model Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz and I'm attaching the results(the gflops count from the code) here.

Could you please check it once and confirm if the similar behaviour is replicated with this CPU model so that we can proceed further in this case?

Regards,

Vidya.

VidyalathaB_Intel · ‎11-15-2022

Hi Lianchen,

As we haven't heard back from you, could you please provide us with an update regarding the issue?

Regards,

Vidya.

VidyalathaB_Intel · ‎11-21-2022

Hi Lianchen,

As we haven't heard back from you, we are closing this thread. Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.

Regards,

Vidya.