Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Brandon_R_
Beginner
67 Views

Using sgemm with multiple cores

Hello! I am trying to implement sgemm matrix multiplication on multiple physical cores and I am a little confused on how to do so. 

Say I have obtained 9 physical cores from an HPC system and I want sgemm to use all of these cores to do the matrix multiplication. In this case I do not want to use multithreading on these 9 cores, only these 9 cores as a whole. So in a way I guess you could say that the 9 cores are the threads to be used by sgemm. Below is some code I have created, which I believe implements what I want to do. Is this implementation correct?  

program mkl_matrixmul

use mpi

implicit none

integer :: N,max_threads,mkl_get_max_threads

real, allocatable, dimension(:,:) :: A,B,C

integer :: ierror,num_cores,my_rank

double precision :: time1,time2

CALL MPI_Init(ierror)             !Flag for error                                                                                                                                    

CALL MPI_COMM_Size(MPI_COMM_WORLD,num_cores,ierror)   !puts in the number of cores into num_cores                                                                                    

CALL MPI_Comm_rank(MPI_COMM_WORLD,my_rank,ierror)     !defining the variable for the rank of the core                                                                    

CALL MPI_BARRIER(MPI_COMM_WORLD,ierror)

if(my_rank == 0)then

   !starting the timer                                                                                                                                                               

   time1 = MPI_Wtime()

end if    

N = 61740

Allocate(A(N,N),B(N,N),C(N,N))

A = 1.0                                                                                                                                                                             

B = 2.0                                                                                                                                                                             

C = 0.0                                                                                                                                                                             

call mkl_set_num_threads(num_cores)

call sgemm('N','N',N,N,N,1.0,A,N,B,N,1.0,C,N)                                                                                                              

CALL MPI_BARRIER(MPI_COMM_WORLD,ierror)

if(my_rank == 0)then

   !printing the elapsed time                                                                                                                                                        

   time2 = MPI_Wtime()

   print *, 'elapsed time' , time2 - time1

   print *, C(1,2)

end if

CALL MPI_Finalize(ierror)

end program mkl_matrixmul

 

Also if it helps, I am using a Sandy Bridge node with 256 GB of memory. 

Thank you, 

Brandon 

 

 

0 Kudos
2 Replies
Ying_H_Intel
Employee
67 Views

Hi Brandon,

Not sure if I understand  about your question correctly.

1) In generally, as MKL was mulithreaded by OpenMP run-time library,  if you call mkl sgemm directly as the MKL fortran sample, https://software.intel.com/sites/default/files/mkl_fortran_samples_05162016.zip

and compile it if with ifort  your.for -mkl.  The sgemm can run with 9 physical cores on one nodes automatically.  User don't need write threading code.

2) You can use MPI process, then let process 0 to call sgemm (please note, there is MPI version sgemm psgemm() in MKL). Then let sgemm use 9  OpenMP threads, say mkl_set_num_threads(9)

I guess the two ways should be ok.  but they looks still being using 9 threads on 9 cores.  In your case, may you want to cooridinate the OpenMP threads with MPI Process, (or we call it as OpenMP Affinity)?

If yes, you may consider the OpenMP Place or Affinity in OpenMP documentation, please see MKL user guide, like

https://software.intel.com/en-us/node/599522  => control MPI and OpenMP number

 https://software.intel.com/en-us/node/528552  => control OpenMP thread affinity to core.

Best Regards,
Ying

Brandon_R_
Beginner
67 Views

Option number 2 is the one I believe I wanted. Thank you very much for you help. 

Reply