Solved: Ubuntu mkl_set_num_threads ignored

Ion_C_ · ‎07-31-2016

Good afternoon!

I have the following code:

/*
********************************************************************************
*   Copyright(C) 2004-2011 Intel Corporation. All Rights Reserved.
*   
*   The source code, information  and  material ("Material") contained herein is
*   owned  by Intel Corporation or its suppliers or licensors, and title to such
*   Material remains  with Intel Corporation  or its suppliers or licensors. The
*   Material  contains proprietary information  of  Intel or  its  suppliers and
*   licensors. The  Material is protected by worldwide copyright laws and treaty
*   provisions. No  part  of  the  Material  may  be  used,  copied, reproduced,
*   modified, published, uploaded, posted, transmitted, distributed or disclosed
*   in any way  without Intel's  prior  express written  permission. No  license
*   under  any patent, copyright  or  other intellectual property rights  in the
*   Material  is  granted  to  or  conferred  upon  you,  either  expressly,  by
*   implication, inducement,  estoppel or  otherwise.  Any  license  under  such
*   intellectual  property  rights must  be express  and  approved  by  Intel in
*   writing.
*   
*   *Third Party trademarks are the property of their respective owners.
*   
*   Unless otherwise  agreed  by Intel  in writing, you may not remove  or alter
*   this  notice or  any other notice embedded  in Materials by Intel or Intel's
*   suppliers or licensors in any way.
*
********************************************************************************
*   Content : Simple MKL Matrix Multiply C example
*
********************************************************************************/

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include "mkl.h"


void print_arr(int N, char * name, double* array);
void init_arr(int N, double* a);
void Dgemm_multiply(double* a,double*  b,double*  c, int N);
int num_threads=1;

int main(int argc, char* argv[])
{
        
 clock_t start, stop;
 int N;
 double* a;
 double* b;
 double* c;
 if(argc < 2)
 {
  printf("No. of threads available is :%d\n",mkl_get_max_threads());
  printf("Enter matrix size N=");
  //please enter small number first to ensure that the 
  //multiplication is correct! and then you may enter 
  //a "reasonably" large number say like 500 or even 1000
  scanf("%d",&N);
  printf("Enter number of threads:");
  scanf("%d",&num_threads);
 
 }
 else
 {
  N = atoi(argv[1]);
  num_threads = atoi(argv[2]);

 }
 mkl_set_num_threads(num_threads);

 a=(double*) malloc( sizeof(double)*N*N );
 b=(double*) malloc( sizeof(double)*N*N );
 c=(double*) malloc( sizeof(double)*N*N );

 init_arr(N,a);
 init_arr(N,b);

 //DGEMM Multiply
 //reallocate to force cash to be flushed
 a=(double*) malloc( sizeof(double)*N*N );
 b=(double*) malloc( sizeof(double)*N*N );
 c=(double*) malloc( sizeof(double)*N*N );
 init_arr(N,a);
 init_arr(N,b);

 start = clock();
 Dgemm_multiply(a,b,c,N);
 stop = clock();

 printf("Dgemm_multiply(). Elapsed time = %g seconds using %d threads\n",
  ((double)(stop - start)) / CLOCKS_PER_SEC, num_threads);
 //print simple test case of data to be sure multiplication is correct
 if (N < 7) {
  print_arr(N,"a", a);
  print_arr(N,"b", b);
  print_arr(N,"c", c);
 }

 free(a);
 free(b);
 free(c);

 return 0;
}


//DGEMM way. The PREFERED way, especially for large matrices
void Dgemm_multiply(double* a,double*  b,double*  c, int N)
{ 

 double alpha = 1.0, beta = 0.;
 int incx = 1;
 int incy = N;
 cblas_dgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,N,N,N,alpha,b,N,a,N,beta,c,N);
}

//initialize array with random data
void init_arr(int N, double* a)
{ 
 int i,j;
 for (i=0; i< N;i++) {
  for (j=0; j<N;j++) {
   a[i*N+j] = (i+j+1)%10; //keep all entries less than 10. pleasing to the eye!
  }
 }
}

//print array to std out
void print_arr(int N, char * name, double* array)
{ 
 int i,j; 
 printf("\n%s\n",name);
 for (i=0;i<N;i++){
  for (j=0;j<N;j++) {
   printf("%g\t",array[N*i+j]);
  }
  printf("\n");
 }
}

The compilation command is following:

gcc -m64 -O2 -mpc80 -march=core2 mkl_lab_solution2.cpp -o test1  -I/opt/intel/mkl/include -Wl,--start-group /opt/intel/mkl/lib/intel64/libmkl_int
el_lp64.a /opt/intel/mkl/lib/intel64/libmkl_core.a /opt/intel/mkl/lib/intel64/libmkl_intel_thread.a /opt/intel/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/li
nux/compiler/lib/intel64/libiomp5.a  -Wl,--end-group -lm -ldl -lpthread -static-libstdc++ -w

When I run the program I obtain the following results:

user@ubuntu1:~/cpp/openmp$ ./test1 4096 1
Dgemm_multiply(). Elapsed time = 6.79888 seconds using 1 threads
user@ubuntu1:~/cpp/openmp$ ./test1 4096 2
Dgemm_multiply(). Elapsed time = 6.84065 seconds using 2 threads
user@ubuntu1:~/cpp/openmp$ ./test1 4096 2

In Windows 10 using C++ from Visual Studio 2015 and the same program I obtain:

No. of threads available is :6
Enter matrix size N=4096
Enter number of threads:1
Dgemm_multiply(). Elapsed time = 10.803 seconds using 1 threads


No. of threads available is :6
Enter matrix size N=4096
Enter number of threads:2
Dgemm_multiply(). Elapsed time = 5.647 seconds using 2 threads

The results of gcc -v is :

user@ubuntu1:~/cpp/openmp$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.1' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,ja
va,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=po
six --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-
unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=
/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-
amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --w
ith-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.1)

How obtain the same scaling in Ubuntu as in Windows?

George_Silva_Intel · ‎08-04-2016

Hello,

I just compiled using the exact same configuration and got same result from comand line, 2 threads taking same time or even longer than 1 thread.

So I used VTune to compare both executions, VTune pointed that the option with 2 threads run almost twice faster than 1 thread option ... so I found the answer:

Looks like the clock function in linux counts the time for each individual core usage, so the "real" time and and the user time are different things, you can refer to the result below:

$> time ./test1 5000 1

Dgemm_multiply(). Elapsed time = 11.8785 seconds using 1 threads

real    0m12.222s
user    0m12.076s
sys    0m0.144s

$> time ./test1 5000 2
Dgemm_multiply(). Elapsed time = 14.0519 seconds using 2 threads

real    0m7.468s
user    0m14.276s
sys    0m0.156s

View solution in original post

Gennady_F_Intel · ‎08-01-2016

Is that Ubuntu 5.4.0? if, yes, pls have a look at the system requirements - https://software.intel.com/en-us/articles/intel-mkl-113-system-requirements. You may see list of Supported operating systems : ...... Ubunty* 12.04, 13.10 and 14.04 .....

asd__asdqwe · ‎08-01-2016

It's Ubuntu 16.04 with gcc 5.4.0

Ion_C_ · ‎08-01-2016

It is Ubuntu 16.04 server with gcc 5.4.0. But how control the CPU scaling? I have a C++ program which uses Intel MKL library in Amazon AWS Cloud and I need multicore support.

George_Silva_Intel · ‎08-04-2016

Hello,

I just compiled using the exact same configuration and got same result from comand line, 2 threads taking same time or even longer than 1 thread.

So I used VTune to compare both executions, VTune pointed that the option with 2 threads run almost twice faster than 1 thread option ... so I found the answer:

Looks like the clock function in linux counts the time for each individual core usage, so the "real" time and and the user time are different things, you can refer to the result below:

$> time ./test1 5000 1

Dgemm_multiply(). Elapsed time = 11.8785 seconds using 1 threads

real    0m12.222s
user    0m12.076s
sys    0m0.144s

$> time ./test1 5000 2
Dgemm_multiply(). Elapsed time = 14.0519 seconds using 2 threads

real    0m7.468s
user    0m14.276s
sys    0m0.156s