Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
40 Views

The problem with the performance of 'pdgetrf': very poor

I want to check the performance of LU factorization on the cluster. So the first, I called pdgetrf routine on the MKL library and execute it in my computer (Intel® Xeon Phi™ Processor 7250). 

Performance of pdgetrf with the size matrix = 20000 is 170Gfops for implement on 4 processes.

Please help me check my code, Does my program have an error? How to improve to performance of pdgetrf on KNL - 7250?  How much is the maximum performance of pdgetrf ?

Thanks a lot.

  • This is compiler:

mpiicc pdgetrf.c -O3 -qopenmp -lmemkind -mkl  -xMIC-AVX512 -restrict  \
      -o pdgetrf -I./  -I/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/include/  -lmkl_scalapack_lp64  -lmkl_core -lmkl_blacs_intelmpi_lp64  -lpthread -liomp5 

  • This is for execute:          mpirun -np 4 ./pdgetrf
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include "mpi.h"



int main(int argc, char **argv) {
   int i, j, k;
/************  MPI ***************************/
   int myrank_mpi, nprocs_mpi;
   MPI_Init( &argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &myrank_mpi);
   MPI_Comm_size(MPI_COMM_WORLD, &nprocs_mpi);
/************  BLACS ***************************/
   int ictxt, nprow, npcol, myrow, mycol,nb;
   int info,itemp;
   int ZERO=0,ONE=1;
	 nprow = 2; npcol = 2; nb =500;
	 int M=nprow*10000;
	 int K=npcol*10000;

   Cblacs_pinfo( &myrank_mpi, &nprocs_mpi ) ;
   Cblacs_get( -1, 0, &ictxt );
   Cblacs_gridinit( &ictxt, "Row", nprow, npcol );
   Cblacs_gridinfo( ictxt, &nprow, &npcol, &myrow, &mycol );


   int rA = numroc_( &M, &nb, &myrow, &ZERO, &nprow );
   int cA = numroc_( &K, &nb, &mycol, &ZERO, &npcol );

   double *A = (double*) malloc(rA*cA*sizeof(double));

   int descA[9];
   int *IPIV;
	 IPIV = (int *)calloc(rA + nb, sizeof(int));
	 descinit(descA, &M,   &K,   &nb,  &nb,  &ZERO, &ZERO, &ictxt, &rA,  &info);
   
     double alpha = 1.0; double beta = 1.0;	
     double start, end, flops;
	 srand(time(NULL)*myrow + mycol);
	 #pragma simd
	 for (j=0; j<rA*cA; j++)
	 {
		 A=((double)rand()-(double)(RAND_MAX)*0.5)/(double)(RAND_MAX);
	 }
     
	 MPI_Barrier(MPI_COMM_WORLD);
     start=MPI_Wtime();

	 pdgetrf(&M, &K, A, &ONE, &ONE, descA, IPIV, &info);

	 MPI_Barrier(MPI_COMM_WORLD);
     end=MPI_Wtime();
	 
	double duration = (double)(end - start); 
	 if (myrow==0 && mycol==0)
	 {
      if (M > K)
	  {
	     printf("%f Gigaflops\n", ((double)K * (double)K * (double)M - (double)K * (double)K * (double)K / (double)3) * 1.0e-9 / duration);
	  }
	  else if (K < M)
	  {
	    printf("%f Gigaflops\n", ((double)M * (double)M * (double)K - (double)M * (double)M * (double)M / (double)3) * 1.0e-9 / duration);
      }
	  else
	  {
	    printf("%f Gigaflops\n", ((double)2*(double)K *(double)K * (double)K  / (double)3) * 1.0e-9 / duration);
		
      }
	 // printf("%f Gflops\n", flops);
	 }
   Cblacs_gridexit( 0 );
   MPI_Finalize();
   return 0;
}

 

 

0 Kudos
11 Replies
Highlighted
Black Belt
40 Views

I don't know if your code is correct, but this seems like a very small problem.  What is the performance like on one node?

0 Kudos
Highlighted
Beginner
40 Views

Dear @McCalpin, John (Blackbelt),

My project is to improve the performance of LU factorization and have to compare performance with pdgetrf of MKL library. So I need to know performance of pdgetrf firstly. 

If I implement with: mpirun -np 1 ./pdgetrf   (also size 20000) that performance is 107 Gflops, it's less than the performance of LAPACKE_dgetrf  (1127 Gflops with size 20000 - 1 node) - This is routine for executing LU factorization with only node ( not parallel) of MKL.

 

 

 
0 Kudos
Highlighted
Black Belt
40 Views

The LAPACKE_dgetrf performance looks fine on one node, so 10x slowdown for using the ScaLAPACK version seems like a problem....

The first thing I would check is thread-level parallelism.  In addition to dumping the environment for each case for later review, I would run the program under "perf stat" to get a quick summary of the number of cores/threads used for the execution.

0 Kudos
Highlighted
Beginner
40 Views

Dear @McCalpin, John (Blackbelt), 

If you know where is the problem, please tell me.

Thanks.

 

 
0 Kudos
Highlighted
Black Belt
40 Views

I can't get the code above to compile (no definitions of the Cblacs* functions, and crashes when I tried to convert to use the blacs* interfaces) so I have not done any testing.

0 Kudos
Highlighted
Beginner
40 Views

Dear @McCalpin, John (Blackbelt), 

I followed an example from this link: https://software.intel.com/en-us/articles/using-cluster-mkl-pblasscalapack-fortran-routine-in-your-c...

If I change Cblacs* function by blacs* function, the program will not be executed successfully. The error like: 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 226740 RUNNING AT ourKNL
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 226740 RUNNING AT ourKNL
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

The performance of pdgetrf is really important with me. Please help me check it (by your way) and please lets me know if I had a mistake that  decreasing its performance.

 

 
0 Kudos
Highlighted
Black Belt
40 Views

I can't make this code work....

My installation of Intel 2018 has no header files for the Cblacs files.   Intel's documentation for the ScaLAPACK interfaces in MKL includes only the Fortran interfaces, and I can't figure out how to make them work from C.

0 Kudos
Highlighted
Beginner
40 Views

Dear @McCalpin, John (Blackbelt), 

Please help me check performance with this program, I changed Cblacs* function by blacs*function. 

Thanks a lot.

With compiler:

 mpiicc pdgetrf.c -O3 -qopenmp -lmemkind -mkl  -xMIC-AVX512 -restrict  \
      -o pdgetrf   -I/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/include/  -lmkl_scalapack_lp64  -lmkl_core  -lpthread -liomp5  -lmkl_blacs_intelmpi_lp64  -lmkl_intel_lp64 -lmkl_intel_thread
 

 

mpirun -np 4 ./pdgetrf

 

 #include <stdio.h>
 #include <time.h>
 #include <string.h>
 #include <stdlib.h>
 #include "mpi.h"

 int main(int argc, char **argv) {
    int i, j, k;
 /************  MPI ***************************/
    int myrank_mpi, nprocs_mpi;
      MPI_Init( &argc, &argv);
      MPI_Comm_rank(MPI_COMM_WORLD, &myrank_mpi);
      MPI_Comm_size(MPI_COMM_WORLD, &nprocs_mpi);
 /************  BLACS ***************************/
    int ictxt, nprow, npcol, myrow, mycol,nb;
    int info,itemp;
    int ZERO=0,ONE=1;
      nprow = 2; npcol = 2; nb =1000;
      int M=nprow*10000;
      int K=npcol*10000;
      int what = -1;
      int val = 0;
    blacs_pinfo( &myrank_mpi, &nprocs_mpi ) ;
    blacs_get(&what, &val, &ictxt);
    blacs_gridinit(&ictxt, "Row", &nprow, &npcol );
    blacs_gridinfo(&ictxt, &nprow, &npcol, &myrow, &mycol );


    int rA = numroc_( &M, &nb, &myrow, &ZERO, &nprow );
    int cA = numroc_( &K, &nb, &mycol, &ZERO, &npcol );

    double *A = (double*) malloc(rA*cA*sizeof(double));

    int descA[9];
    int *IPIV;
      IPIV = (int *)calloc(rA + nb, sizeof(int));
      descinit(descA, &M,   &K,   &nb,  &nb,  &ZERO, &ZERO, &ictxt, &rA,  &info);

      double alpha = 1.0; double beta = 1.0;
      double start, end, flops;
      srand(time(NULL)*myrow + mycol);
      #pragma simd
      for (j=0; j<rA*cA; j++)
      {
          A=((double)rand()-(double)(RAND_MAX)*0.5)/(double)(RAND_MAX);
      }

      MPI_Barrier(MPI_COMM_WORLD);
      start=MPI_Wtime();

      pdgetrf(&M, &K, A, &ONE, &ONE, descA, IPIV, &info);

      MPI_Barrier(MPI_COMM_WORLD);
      end=MPI_Wtime();

     double duration = (double)(end - start);
     if (myrow==0 && mycol==0)
     {
      if (M > K)
      {
         printf("%f Gigaflops\n", ((double)K * (double)K * (double)M - (double)K * (double)K * (doub
le)K / (double)3) * 1.0e-9 / duration);
      }
      else if (K < M)
      {
        printf("%f Gigaflops\n", ((double)M * (double)M * (double)K - (double)M * (double)M * (doubl
e)M / (double)3) * 1.0e-9 / duration);
      }
      else
      {
        printf("%f Gigaflops\n", ((double)2*(double)K *(double)K * (double)K  / (double)3) * 1.0e-9 
/ duration);

      }
     // printf("%f Gflops\n", flops);
     }
   blacs_gridexit(&ictxt);
   MPI_Finalize();
   return 0;
}


 

 

 
0 Kudos
Highlighted
Black Belt
40 Views

The problem seems to be a difference in the way MKL automatically determines the amount of thread parallelism to use.

I ran "mpirun -np 4 ./pdgetrf" in four different configurations:

  • Xeon Phi 7250, no MKL variables set:            91 GFLOPS (58.3 seconds)
  • Xeon Phi 7250, MKL_NUM_THREADS=16:  422 GFLOPS (12.6 seconds)
  • 2s Xeon Platinum 8160, no MKL variables:   948 GFLOPS (5.6 seconds)
  • 2s Xeon Platinum 8160, MKL_NUM_THREADS=12:  944 GFLOPS (5.6 seconds)

I used "perf stat -a -A mpirun -np 4 ./pdgetrf" (on a single node) to verify that the first case only used 4 cores, while the other cases spread the work across the cores.  (Running on Xeon Phi 7250 using MKL_NUM_THREADS=17 did not change the performance.)

0 Kudos
Highlighted
Beginner
40 Views

Dear @McCalpin, John (Blackbelt), 

Thanks a lot for your help.

Could you show me detail your compiler using in case"Xeon Phi 7250, MKL_NUM_THREADS=16:  422 GFLOPS (12.6 seconds)"? I also ran on Xeon Phi 7250 but the performance is less than you ( just 190Gflops, with MKL_NUM_THREADS=16).

 

 
0 Kudos
Highlighted
Black Belt
40 Views

I made one fairly important change to the code -- I call pdgetrf twice and compute the performance based on the second call.  Most MKL routines run slower on the first call due to various one-time setup overheads.   The results below include the runtime for both the first and second calls.

My default compiler is an earlier revision that yours, but otherwise the setup should be very similar....

$ icc --version
icc (ICC) 18.0.2 20180210
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.
$ mpiicc pdgetrf.c -O3 -qopenmp -lmemkind -mkl  -xMIC-AVX512 -restrict -o pdgetrf.MIC-AVX512.exe  -I/opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/include/  -lmkl_scalapack_lp64  -lmkl_core  -lpthread -liomp5  -lmkl_blacs_intelmpi_lp64  -lmkl_intel_lp64 -lmkl_intel_thread 
$ export MKL_NUM_THREADS=16
$ mpirun -np 4 ./pdgetrf.MIC-AVX512.exe
Running pdgetrf once for warmup
Initial execution time 14.038583
Running pdgetrf again for time
Second execution time 12.744775
418.472143 Gigaflops

0 Kudos