- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good afternoon!
I have the following code:
/* ******************************************************************************** * Copyright(C) 2004-2011 Intel Corporation. All Rights Reserved. * * The source code, information and material ("Material") contained herein is * owned by Intel Corporation or its suppliers or licensors, and title to such * Material remains with Intel Corporation or its suppliers or licensors. The * Material contains proprietary information of Intel or its suppliers and * licensors. The Material is protected by worldwide copyright laws and treaty * provisions. No part of the Material may be used, copied, reproduced, * modified, published, uploaded, posted, transmitted, distributed or disclosed * in any way without Intel's prior express written permission. No license * under any patent, copyright or other intellectual property rights in the * Material is granted to or conferred upon you, either expressly, by * implication, inducement, estoppel or otherwise. Any license under such * intellectual property rights must be express and approved by Intel in * writing. * * *Third Party trademarks are the property of their respective owners. * * Unless otherwise agreed by Intel in writing, you may not remove or alter * this notice or any other notice embedded in Materials by Intel or Intel's * suppliers or licensors in any way. * ******************************************************************************** * Content : Simple MKL Matrix Multiply C example * ********************************************************************************/ #include <stdio.h> #include <time.h> #include <stdlib.h> #include "mkl.h" void print_arr(int N, char * name, double* array); void init_arr(int N, double* a); void Dgemm_multiply(double* a,double* b,double* c, int N); int num_threads=1; int main(int argc, char* argv[]) { clock_t start, stop; int N; double* a; double* b; double* c; if(argc < 2) { printf("No. of threads available is :%d\n",mkl_get_max_threads()); printf("Enter matrix size N="); //please enter small number first to ensure that the //multiplication is correct! and then you may enter //a "reasonably" large number say like 500 or even 1000 scanf("%d",&N); printf("Enter number of threads:"); scanf("%d",&num_threads); } else { N = atoi(argv[1]); num_threads = atoi(argv[2]); } mkl_set_num_threads(num_threads); a=(double*) malloc( sizeof(double)*N*N ); b=(double*) malloc( sizeof(double)*N*N ); c=(double*) malloc( sizeof(double)*N*N ); init_arr(N,a); init_arr(N,b); //DGEMM Multiply //reallocate to force cash to be flushed a=(double*) malloc( sizeof(double)*N*N ); b=(double*) malloc( sizeof(double)*N*N ); c=(double*) malloc( sizeof(double)*N*N ); init_arr(N,a); init_arr(N,b); start = clock(); Dgemm_multiply(a,b,c,N); stop = clock(); printf("Dgemm_multiply(). Elapsed time = %g seconds using %d threads\n", ((double)(stop - start)) / CLOCKS_PER_SEC, num_threads); //print simple test case of data to be sure multiplication is correct if (N < 7) { print_arr(N,"a", a); print_arr(N,"b", b); print_arr(N,"c", c); } free(a); free(b); free(c); return 0; } //DGEMM way. The PREFERED way, especially for large matrices void Dgemm_multiply(double* a,double* b,double* c, int N) { double alpha = 1.0, beta = 0.; int incx = 1; int incy = N; cblas_dgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,N,N,N,alpha,b,N,a,N,beta,c,N); } //initialize array with random data void init_arr(int N, double* a) { int i,j; for (i=0; i< N;i++) { for (j=0; j<N;j++) { a[i*N+j] = (i+j+1)%10; //keep all entries less than 10. pleasing to the eye! } } } //print array to std out void print_arr(int N, char * name, double* array) { int i,j; printf("\n%s\n",name); for (i=0;i<N;i++){ for (j=0;j<N;j++) { printf("%g\t",array[N*i+j]); } printf("\n"); } }
The compilation command is following:
gcc -m64 -O2 -mpc80 -march=core2 mkl_lab_solution2.cpp -o test1 -I/opt/intel/mkl/include -Wl,--start-group /opt/intel/mkl/lib/intel64/libmkl_int el_lp64.a /opt/intel/mkl/lib/intel64/libmkl_core.a /opt/intel/mkl/lib/intel64/libmkl_intel_thread.a /opt/intel/parallel_studio_xe_2016.3.067/compilers_and_libraries_2016/li nux/compiler/lib/intel64/libiomp5.a -Wl,--end-group -lm -ldl -lpthread -static-libstdc++ -w
When I run the program I obtain the following results:
user@ubuntu1:~/cpp/openmp$ ./test1 4096 1 Dgemm_multiply(). Elapsed time = 6.79888 seconds using 1 threads user@ubuntu1:~/cpp/openmp$ ./test1 4096 2 Dgemm_multiply(). Elapsed time = 6.84065 seconds using 2 threads user@ubuntu1:~/cpp/openmp$ ./test1 4096 2
In Windows 10 using C++ from Visual Studio 2015 and the same program I obtain:
No. of threads available is :6 Enter matrix size N=4096 Enter number of threads:1 Dgemm_multiply(). Elapsed time = 10.803 seconds using 1 threads No. of threads available is :6 Enter matrix size N=4096 Enter number of threads:2 Dgemm_multiply(). Elapsed time = 5.647 seconds using 2 threads
The results of gcc -v is :
user@ubuntu1:~/cpp/openmp$ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.1' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,ja va,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=po six --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu- unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home= /usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5- amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --w ith-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.1)
How obtain the same scaling in Ubuntu as in Windows?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I just compiled using the exact same configuration and got same result from comand line, 2 threads taking same time or even longer than 1 thread.
So I used VTune to compare both executions, VTune pointed that the option with 2 threads run almost twice faster than 1 thread option ... so I found the answer:
Looks like the clock function in linux counts the time for each individual core usage, so the "real" time and and the user time are different things, you can refer to the result below:
$> time ./test1 5000 1 Dgemm_multiply(). Elapsed time = 11.8785 seconds using 1 threads real 0m12.222s user 0m12.076s sys 0m0.144s $> time ./test1 5000 2 Dgemm_multiply(). Elapsed time = 14.0519 seconds using 2 threads real 0m7.468s user 0m14.276s sys 0m0.156s
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is that Ubuntu 5.4.0? if, yes, pls have a look at the system requirements - https://software.intel.com/en-us/articles/intel-mkl-113-system-requirements. You may see list of Supported operating systems : ...... Ubunty* 12.04, 13.10 and 14.04 .....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's Ubuntu 16.04 with gcc 5.4.0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I just compiled using the exact same configuration and got same result from comand line, 2 threads taking same time or even longer than 1 thread.
So I used VTune to compare both executions, VTune pointed that the option with 2 threads run almost twice faster than 1 thread option ... so I found the answer:
Looks like the clock function in linux counts the time for each individual core usage, so the "real" time and and the user time are different things, you can refer to the result below:
$> time ./test1 5000 1 Dgemm_multiply(). Elapsed time = 11.8785 seconds using 1 threads real 0m12.222s user 0m12.076s sys 0m0.144s $> time ./test1 5000 2 Dgemm_multiply(). Elapsed time = 14.0519 seconds using 2 threads real 0m7.468s user 0m14.276s sys 0m0.156s
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page