Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Using sgemm with multiple cores


Hello! I am trying to implement sgemm matrix multiplication on multiple physical cores and I am a little confused on how to do so. 

Say I have obtained 9 physical cores from an HPC system and I want sgemm to use all of these cores to do the matrix multiplication. In this case I do not want to use multithreading on these 9 cores, only these 9 cores as a whole. So in a way I guess you could say that the 9 cores are the threads to be used by sgemm. Below is some code I have created, which I believe implements what I want to do. Is this implementation correct?  

program mkl_matrixmul

use mpi

implicit none

integer :: N,max_threads,mkl_get_max_threads

real, allocatable, dimension(:,:) :: A,B,C

integer :: ierror,num_cores,my_rank

double precision :: time1,time2

CALL MPI_Init(ierror)             !Flag for error                                                                                                                                    

CALL MPI_COMM_Size(MPI_COMM_WORLD,num_cores,ierror)   !puts in the number of cores into num_cores                                                                                    

CALL MPI_Comm_rank(MPI_COMM_WORLD,my_rank,ierror)     !defining the variable for the rank of the core                                                                    


if(my_rank == 0)then

   !starting the timer                                                                                                                                                               

   time1 = MPI_Wtime()

end if    

N = 61740


A = 1.0                                                                                                                                                                             

B = 2.0                                                                                                                                                                             

C = 0.0                                                                                                                                                                             

call mkl_set_num_threads(num_cores)

call sgemm('N','N',N,N,N,1.0,A,N,B,N,1.0,C,N)                                                                                                              


if(my_rank == 0)then

   !printing the elapsed time                                                                                                                                                        

   time2 = MPI_Wtime()

   print *, 'elapsed time' , time2 - time1

   print *, C(1,2)

end if

CALL MPI_Finalize(ierror)

end program mkl_matrixmul


Also if it helps, I am using a Sandy Bridge node with 256 GB of memory. 

Thank you, 




0 Kudos
2 Replies

Hi Brandon,

Not sure if I understand  about your question correctly.

1) In generally, as MKL was mulithreaded by OpenMP run-time library,  if you call mkl sgemm directly as the MKL fortran sample,

and compile it if with ifort  your.for -mkl.  The sgemm can run with 9 physical cores on one nodes automatically.  User don't need write threading code.

2) You can use MPI process, then let process 0 to call sgemm (please note, there is MPI version sgemm psgemm() in MKL). Then let sgemm use 9  OpenMP threads, say mkl_set_num_threads(9)

I guess the two ways should be ok.  but they looks still being using 9 threads on 9 cores.  In your case, may you want to cooridinate the OpenMP threads with MPI Process, (or we call it as OpenMP Affinity)?

If yes, you may consider the OpenMP Place or Affinity in OpenMP documentation, please see MKL user guide, like  => control MPI and OpenMP number  => control OpenMP thread affinity to core.

Best Regards,

0 Kudos

Option number 2 is the one I believe I wanted. Thank you very much for you help. 

0 Kudos