- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I am new to the forum, and of course, to MKL (though I've used TBB before). I am using the MKL Link Helper to compile and link the first C example dgemm_threading_effect_example.c, but I cannot figure how to use TBB.
I know it is possible to use just TBB without OpenMP (which I don't have, being on a Mac), but it seems that I need to link the mkl_sequential library, and it seems no threads can be used.
Below you can find the example with my few added lines of code, and here are my linker switches:
-L/usr/local/lib -ltbb -ltbbmalloc -L/opt/intel/compilers_and_libraries_2016/mac/mkl/lib -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential
Thanks for any help you can give me!
Franco
#include <stdio.h> #include <stdlib.h> #include "mkl.h" #include <tbb/task_scheduler_init.h> /* Consider adjusting LOOP_COUNT based on the performance of your computer */ /* to make sure that total run time is at least 1 second */ #define LOOP_COUNT 10 int main() { double *A, *B, *C; int m, n, p, i, j, r, max_threads; double alpha, beta; double s_initial, s_elapsed; printf ("\n This example demonstrates threading impact on computing real matrix product \n" " C=alpha*A*B+beta*C using Intel(R) MKL function dgemm, where A, B, and C are \n" " matrices and alpha and beta are double precision scalars \n\n"); m = 2000, p = 200, n = 1000; printf (" Initializing data for matrix multiplication C=A*B for matrix \n" " A(%ix%i) and matrix B(%ix%i)\n\n", m, p, p, n); alpha = 1.0; beta = 0.0; printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n" " performance \n\n"); A = (double *)mkl_malloc( m*p*sizeof( double ), 64 ); B = (double *)mkl_malloc( p*n*sizeof( double ), 64 ); C = (double *)mkl_malloc( m*n*sizeof( double ), 64 ); if (A == NULL || B == NULL || C == NULL) { printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n"); mkl_free(A); mkl_free(B); mkl_free(C); return 1; } printf (" Intializing matrix data \n\n"); for (i = 0; i < (m*p); i++) { A = (double)(i+1); } for (i = 0; i < (p*n); i++) { B = (double)(-i-1); } for (i = 0; i < (m*n); i++) { C = 0.0; } // HERE I TRY BUT IT'S ALWAYS ONE SINGLE THREAD tbb::task_scheduler_init scheduler(4); mkl_set_num_threads(4); mkl_set_num_threads_local(4); printf (" Finding max number of threads Intel(R) MKL can use for parallel runs \n\n"); // HERE I ALWAYS GET ONE max_threads = mkl_get_max_threads(); printf (" Running Intel(R) MKL from 1 to %i threads \n\n", max_threads); for (i = 1; i <= max_threads; i++) { for (j = 0; j < (m*n); j++) C= 0.0; printf (" Requesting Intel(R) MKL to use %i thread(s) \n\n", i); mkl_set_num_threads(i); printf (" Making the first run of matrix product using Intel(R) MKL dgemm function \n" " via CBLAS interface to get stable run time measurements \n\n"); cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, p, alpha, A, p, B, n, beta, C, n); printf (" Measuring performance of matrix product using Intel(R) MKL dgemm function \n" " via CBLAS interface on %i thread(s) \n\n", i); s_initial = dsecnd(); for (r = 0; r < LOOP_COUNT; r++) { cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, p, alpha, A, p, B, n, beta, C, n); } s_elapsed = (dsecnd() - s_initial) / LOOP_COUNT; printf (" == Matrix multiplication using Intel(R) MKL dgemm completed ==\n" " == at %.5f milliseconds using %d thread(s) ==\n\n", (s_elapsed * 1000), i); } printf (" Deallocating memory \n\n"); mkl_free(A); mkl_free(B); mkl_free(C); if (s_elapsed < 0.9/LOOP_COUNT) { s_elapsed=1.0/LOOP_COUNT/s_elapsed; i=(int)(s_elapsed*LOOP_COUNT)+1; printf(" It is highly recommended to define LOOP_COUNT for this example on your \n" " computer as %i to have total execution time about 1 second for reliability \n" " of measurements\n\n", i); } printf (" Example completed. \n\n"); return 0; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Franco,
Right, MKL add TBB threading in latest 2016 version. But as you are using Clang compiler, there may be two issues.
1. Regarding clang Compiler, from MKL Link Helper , there is only OpenMP support the command line is like
-m64 -I${MKLROOT}/include -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl
2. Regarding the sample, dgemm_threading_effect_example.c,which was sample to demo OpenMP thread originally. like the functions
mkl_set_num_threads(4); |
056 |
mkl_set_num_threads_local(4); only effect for OpenMP threads. |
as you know, TBB haven't direct API control the threading number. the TBB setting may not have effect here.
So if you'd like to see the threading effect, you can compile the sample like
source /opt/intel/compilerxxxxxx. /compilervars.sh intel64
clang dgemm_threading_effect_example.c -m64 -I${MKLROOT}/include -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread.
Then
>./a.out
Best Regards,
Ying
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What am I doing wrong?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not able to give you OSX-specific advice, but here is one of your choices that has me puzzled. You specified -lmkl_intel_ilp64. Why? You did not pass any 8-byte integer arguments (scalar or array) to MKL routines in your code.
I suggest that you use the simpler options -mkl -tbb when you use icc to compile and link your code. Later, depending on your needs, you can tinker with the options to make things more suitable for your requirements.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I should have mentioned I am using clang on OSX. All those flags are needed to avoid linker errors.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Clang? Ah, OK.
My comment regarding your using ILP64 instead of LP64 still applies, unless you have Clang configured so that default int-s are 8-bytes long.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the answer, I've removed the ILP reference and it seems to link. However, I get no additional threads. If I remove the sequential library the linker complains.
Here's my last linker setting:
-L/opt/intel/compilers_and_libraries_2016/mac/mkl/lib -lmkl_intel -lmkl_core -lmkl_sequential -L/usr/local/lib -ltbb -ltbbmalloc
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Franco,
Right, MKL add TBB threading in latest 2016 version. But as you are using Clang compiler, there may be two issues.
1. Regarding clang Compiler, from MKL Link Helper , there is only OpenMP support the command line is like
-m64 -I${MKLROOT}/include -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl
2. Regarding the sample, dgemm_threading_effect_example.c,which was sample to demo OpenMP thread originally. like the functions
mkl_set_num_threads(4); |
056 |
mkl_set_num_threads_local(4); only effect for OpenMP threads. |
as you know, TBB haven't direct API control the threading number. the TBB setting may not have effect here.
So if you'd like to see the threading effect, you can compile the sample like
source /opt/intel/compilerxxxxxx. /compilervars.sh intel64
clang dgemm_threading_effect_example.c -m64 -I${MKLROOT}/include -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread.
Then
>./a.out
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Franco,
You can add some modification to make clang + tbb work.
for example,
055 |
// mkl_set_num_threads(4); |
056 |
// mkl_set_num_threads_local(4); |
057 |
|
058 |
printf ( " Finding max number of threads Intel(R) MKL can use for parallel runs \n\n" ); |
059 |
060 |
// HERE I ALWAYS GET ONE |
061 |
max_threads = 4; |
062 |
|
063 |
printf ( " Running Intel(R) MKL from 1 to %i threads \n\n" , max_threads); |
064 |
for (i = 1; i <= max_threads; i++) { |
065 |
for (j = 0; j < (m*n); j++) |
066 |
C |
067 |
|
068 |
printf ( " Requesting Intel(R) MKL to use %i thread(s) \n\n" , i); |
069 |
tbb::task_scheduler_init scheduler(i); |
Then compile it with the below command line
>source /opt/intel/compilers_and_libraries_2016.2.146/mac/bin/compilervars.sh intel64
>clang dgemm_threading_effect_example.cpp -I${MKLROOT}/include -L{MKLROOT}/lib -lmkl_intel_lp64 -lmkl_tbb_thread -lmkl_core -ltbb -lstdc++ -lpthread -o tbb_dgemm_threading
The run result should be almost same as above png ( OpenMP result).
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, Ying, sorry for the delay, I had too many papers to correct and grade.
Requesting Intel(R) MKL to use 1 thread(s) of 4 Making the first run of matrix product using Intel(R) MKL dgemm function via CBLAS interface to get stable run time measurements [...]
It works now!
Thank you!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page