Option number 2 is the one I

Brandon_R_ · ‎07-08-2016

Hello! I am trying to implement sgemm matrix multiplication on multiple physical cores and I am a little confused on how to do so.

Say I have obtained 9 physical cores from an HPC system and I want sgemm to use all of these cores to do the matrix multiplication. In this case I do not want to use multithreading on these 9 cores, only these 9 cores as a whole. So in a way I guess you could say that the 9 cores are the threads to be used by sgemm. Below is some code I have created, which I believe implements what I want to do. Is this implementation correct?

program mkl_matrixmul

use mpi

implicit none

integer :: N,max_threads,mkl_get_max_threads

real, allocatable, dimension(:,:) :: A,B,C

integer :: ierror,num_cores,my_rank

double precision :: time1,time2

CALL MPI_Init(ierror) !Flag for error

CALL MPI_COMM_Size(MPI_COMM_WORLD,num_cores,ierror) !puts in the number of cores into num_cores

CALL MPI_Comm_rank(MPI_COMM_WORLD,my_rank,ierror) !defining the variable for the rank of the core

CALL MPI_BARRIER(MPI_COMM_WORLD,ierror)

if(my_rank == 0)then

!starting the timer

time1 = MPI_Wtime()

end if

N = 61740

Allocate(A(N,N),B(N,N),C(N,N))

A = 1.0

B = 2.0

C = 0.0

call mkl_set_num_threads(num_cores)

call sgemm('N','N',N,N,N,1.0,A,N,B,N,1.0,C,N)

CALL MPI_BARRIER(MPI_COMM_WORLD,ierror)

if(my_rank == 0)then

!printing the elapsed time

time2 = MPI_Wtime()

print *, 'elapsed time' , time2 - time1

print *, C(1,2)

end if

CALL MPI_Finalize(ierror)

end program mkl_matrixmul

Also if it helps, I am using a Sandy Bridge node with 256 GB of memory.

Thank you,

Brandon

Ying_H_Intel · ‎07-10-2016

Hi Brandon,

Not sure if I understand about your question correctly.

1) In generally, as MKL was mulithreaded by OpenMP run-time library, if you call mkl sgemm directly as the MKL fortran sample, https://software.intel.com/sites/default/files/mkl_fortran_samples_05162016.zip

and compile it if with ifort your.for -mkl. The sgemm can run with 9 physical cores on one nodes automatically. User don't need write threading code.

2) You can use MPI process, then let process 0 to call sgemm (please note, there is MPI version sgemm psgemm() in MKL). Then let sgemm use 9 OpenMP threads, say mkl_set_num_threads(9)

I guess the two ways should be ok. but they looks still being using 9 threads on 9 cores. In your case, may you want to cooridinate the OpenMP threads with MPI Process, (or we call it as OpenMP Affinity)?

If yes, you may consider the OpenMP Place or Affinity in OpenMP documentation, please see MKL user guide, like

https://software.intel.com/en-us/node/599522 => control MPI and OpenMP number

https://software.intel.com/en-us/node/528552 => control OpenMP thread affinity to core.

Best Regards,
Ying

Brandon_R_ · ‎07-11-2016

Option number 2 is the one I believe I wanted. Thank you very much for you help.

Using sgemm with multiple cores