MKL can't get any scaling

joan_puig · ‎04-03-2006

#include iostream
#include omp.h
#include "mkl.h"

int main(){
int len = 1500;
double* m1;
double* m2;
double* m3;
double t0, tf, tm1, time;
int i, procs;

for (procs =1; procs 4+1; procs++){
if (procs%1==0 || procs==1){

omp_set_num_threads(procs);
m1 = (double*)malloc(len*len*sizeof(double));
m2 = (double*)malloc(len*len*sizeof(double));
m3 = (double*)malloc(len*len*sizeof(double));

#pragma omp parallel for
for (i = 0; i
m1 = (i%10)-5;
m2 = (i%7)-3.5;
m3 = 0;
}

t0 = omp_get_wtime();
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, len, len, len, 1.0, m1, len, m2, len, 0.0, m3, len);
tf = omp_get_wtime();
time = tf-t0;
if (procs == 1) { tm1 = time; }
cout "Elapsed time: " time " - " procs " threads loop ratio:" time/tm1 endl;

free(m1);
free(m2);
free(m3);
}
}

exit(0);
}

To compile:
/opt/intel/cc/9.0/bin/icc -openmp mklTest2.cxx -lmkl -L /opt/intel/mkl/8.0/lib/32/ -I /opt/intel/mkl/8.0/include/

Timings I got on a 32p machine:
./a.out
Elapsed time: 1.19079 - 1 threads loop ratio:1
Elapsed time: 1.18762 - 2 threads loop ratio:0.997338
Elapsed time: 1.18804 - 3 threads loop ratio:0.997687
Elapsed time: 1.21605 - 4 threads loop ratio:1.02121

joan_puig · ‎04-03-2006

It looks like the code is giving good performance for 1p, but it doesn't scale at all after that.

I was wondering if there is any switch that I need to enable so that MKL will be multithreaded. If there isn't, is there something simple I am missing in my code?

Thanks,

Joan

TimP · ‎04-03-2006

What settings are you using for OMP_NUM_THREADS and KMP_SERIAL?
Are you asking all the threads you created to share the same memory regions, and asking MKL to create as many additional threads as possible?

joan_puig · ‎04-03-2006

Hi Tim, thanks for your reply, it provided me with the pointer to what I needed to change to make it all work.Now, I think this is might be an mkl bug:

I don't set OMP_NUM_THREADS
My code uses the omp_set_num_threads()
It seems though that unless OMP_NUM_THREADS is set to something at the beggining of the program it won't honor any future calls to omp_set_num_threads()
Now, if I take out the call to the MKL function, the plain openmp for loop will actually be parallelized well.

[jpuig@altix jpuig]$ export -n OMP_NUM_THREADS
[jpuig@altix jpuig]$ ./a.out
Elapsed time: 1.8064 - 1 threads loop ratio:1
Elapsed time: 1.79981 - 2 threads loop ratio:0.996353
Elapsed time: 1.85461 - 3 threads loop ratio:1.02669
Elapsed time: 1.82016 - 4 threads loop ratio:1.00762
[jpuig@altix jpuig]$ export OMP_NUM_THREADS=4
[jpuig@altix jpuig]$ ./a.out
Elapsed time: 1.84285 - 1 threads loop ratio:1
Elapsed time: 0.929641 - 2 threads loop ratio:0.504457
Elapsed time: 0.62401 - 3 threads loop ratio:0.338611
Elapsed time: 0.476085 - 4 threads loop ratio:0.258341
[jpuig@altix jpuig]$

Message Edited by joan.puig@gmail.com on 04-03-200602:58 PM