Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

MKL with TBB on OSX

Franco_M_
Beginner
913 Views

Dear all,

I am new to the forum, and of course, to MKL (though I've used TBB before). I am using the MKL Link Helper to compile and link the first C example dgemm_threading_effect_example.c, but I cannot figure how to use TBB.

I know it is possible to use just TBB without OpenMP (which I don't have, being on a Mac), but it seems that I need to link the mkl_sequential library, and it seems no threads can be used.

Below you can find the example with my few added lines of code, and here are my linker switches:

-L/usr/local/lib -ltbb -ltbbmalloc -L/opt/intel/compilers_and_libraries_2016/mac/mkl/lib -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential

Thanks for any help you can give me!
     Franco

#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"

#include <tbb/task_scheduler_init.h>

/* Consider adjusting LOOP_COUNT based on the performance of your computer */
/* to make sure that total run time is at least 1 second */
#define LOOP_COUNT 10

int main()
{
    double *A, *B, *C;
    int m, n, p, i, j, r, max_threads;
    double alpha, beta;
    double s_initial, s_elapsed;
    
    printf ("\n This example demonstrates threading impact on computing real matrix product \n"
            " C=alpha*A*B+beta*C using Intel(R) MKL function dgemm, where A, B, and C are \n"
            " matrices and alpha and beta are double precision scalars \n\n");
    
    m = 2000, p = 200, n = 1000;
    printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
            " A(%ix%i) and matrix B(%ix%i)\n\n", m, p, p, n);
    alpha = 1.0; beta = 0.0;
    
    printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n"
            " performance \n\n");
    A = (double *)mkl_malloc( m*p*sizeof( double ), 64 );
    B = (double *)mkl_malloc( p*n*sizeof( double ), 64 );
    C = (double *)mkl_malloc( m*n*sizeof( double ), 64 );
    if (A == NULL || B == NULL || C == NULL) {
        printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
        mkl_free(A);
        mkl_free(B);
        mkl_free(C);
        return 1;
    }
    
    printf (" Intializing matrix data \n\n");
    for (i = 0; i < (m*p); i++) {
        A = (double)(i+1);
    }
    
    for (i = 0; i < (p*n); i++) {
        B = (double)(-i-1);
    }
    
    for (i = 0; i < (m*n); i++) {
        C = 0.0;
    }
    
    // HERE I TRY BUT IT'S ALWAYS ONE SINGLE THREAD
    tbb::task_scheduler_init scheduler(4);
    mkl_set_num_threads(4);
    mkl_set_num_threads_local(4);
    
    printf (" Finding max number of threads Intel(R) MKL can use for parallel runs \n\n");

    // HERE I ALWAYS GET ONE
    max_threads = mkl_get_max_threads();
    
    printf (" Running Intel(R) MKL from 1 to %i threads \n\n", max_threads);
    for (i = 1; i <= max_threads; i++) {
        for (j = 0; j < (m*n); j++)
            C = 0.0;
        
        printf (" Requesting Intel(R) MKL to use %i thread(s) \n\n", i);
        mkl_set_num_threads(i);
        
        printf (" Making the first run of matrix product using Intel(R) MKL dgemm function \n"
                " via CBLAS interface to get stable run time measurements \n\n");
        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                    m, n, p, alpha, A, p, B, n, beta, C, n);
        
        printf (" Measuring performance of matrix product using Intel(R) MKL dgemm function \n"
                " via CBLAS interface on %i thread(s) \n\n", i);
        s_initial = dsecnd();
        for (r = 0; r < LOOP_COUNT; r++) {
            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        m, n, p, alpha, A, p, B, n, beta, C, n);
        }
        s_elapsed = (dsecnd() - s_initial) / LOOP_COUNT;
        
        printf (" == Matrix multiplication using Intel(R) MKL dgemm completed ==\n"
                " == at %.5f milliseconds using %d thread(s) ==\n\n", (s_elapsed * 1000), i);
    }
    
    printf (" Deallocating memory \n\n");
    mkl_free(A);
    mkl_free(B);
    mkl_free(C);
    
    if (s_elapsed < 0.9/LOOP_COUNT) {
        s_elapsed=1.0/LOOP_COUNT/s_elapsed;
        i=(int)(s_elapsed*LOOP_COUNT)+1;
        printf(" It is highly recommended to define LOOP_COUNT for this example on your \n"
               " computer as %i to have total execution time about 1 second for reliability \n"
               " of measurements\n\n", i);
    }
    
    printf (" Example completed. \n\n");
    return 0;
}

 

0 Kudos
1 Solution
Ying_H_Intel
Employee
913 Views

Hi Franco, 

Right, MKL add TBB threading in latest 2016 version. But as you are using Clang compiler, there may be two issues. 

1. Regarding clang Compiler, from MKL Link Helper ,  there is only OpenMP support  the command line is like 

 -m64 -I${MKLROOT}/include -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl 

2. Regarding the sample,  dgemm_threading_effect_example.c,which was sample to demo OpenMP thread originally. like the functions  

mkl_set_num_threads(4);
 
056     mkl_set_num_threads_local(4); only effect for OpenMP threads.  

as you know, TBB haven't direct API control the threading number.  the TBB setting may not have effect here. 

So if you'd like to see the threading effect, you can compile the sample like 

source /opt/intel/compilerxxxxxx.  /compilervars.sh intel64

clang dgemm_threading_effect_example.c  -m64 -I${MKLROOT}/include -L${MKLROOT}/lib  -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread. 

Then 

>./a.out

Best Regards,

Ying

View solution in original post

0 Kudos
9 Replies
Franco_M_
Beginner
913 Views

What am I doing wrong?

0 Kudos
mecej4
Honored Contributor III
913 Views

I am not able to give you OSX-specific advice, but here is one of your choices that has me puzzled. You specified -lmkl_intel_ilp64. Why? You did not pass any 8-byte integer arguments (scalar or array) to MKL routines in your code.

I suggest that you use the simpler options -mkl -tbb when you use icc to compile and link your code. Later, depending on your needs, you can tinker with the options to make things more suitable for your requirements.

0 Kudos
Franco_M_
Beginner
913 Views

Sorry, I should have mentioned I am using clang on OSX. All those flags are needed to avoid linker errors.

Thanks!

0 Kudos
mecej4
Honored Contributor III
913 Views

Clang? Ah, OK.

My comment regarding your using ILP64 instead of LP64 still applies, unless you have Clang configured so that default int-s are 8-bytes long.

0 Kudos
Franco_M_
Beginner
913 Views


Thanks for the answer, I've removed the ILP reference and it seems to link. However, I get no additional threads. If I remove the sequential library the linker complains.

Here's my last linker setting:

-L/opt/intel/compilers_and_libraries_2016/mac/mkl/lib -lmkl_intel -lmkl_core -lmkl_sequential -L/usr/local/lib -ltbb -ltbbmalloc

Thanks!

0 Kudos
Ying_H_Intel
Employee
914 Views

Hi Franco, 

Right, MKL add TBB threading in latest 2016 version. But as you are using Clang compiler, there may be two issues. 

1. Regarding clang Compiler, from MKL Link Helper ,  there is only OpenMP support  the command line is like 

 -m64 -I${MKLROOT}/include -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl 

2. Regarding the sample,  dgemm_threading_effect_example.c,which was sample to demo OpenMP thread originally. like the functions  

mkl_set_num_threads(4);
 
056     mkl_set_num_threads_local(4); only effect for OpenMP threads.  

as you know, TBB haven't direct API control the threading number.  the TBB setting may not have effect here. 

So if you'd like to see the threading effect, you can compile the sample like 

source /opt/intel/compilerxxxxxx.  /compilervars.sh intel64

clang dgemm_threading_effect_example.c  -m64 -I${MKLROOT}/include -L${MKLROOT}/lib  -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread. 

Then 

>./a.out

Best Regards,

Ying

0 Kudos
Ying_H_Intel
Employee
913 Views

Attach the screenshot,  compile command line,  and thread 4 vs. thread 1

0 Kudos
Ying_H_Intel
Employee
913 Views

Hi Franco,

You can add some modification to make clang + tbb  work. 

for example, 

055    // mkl_set_num_threads(4);
056    // mkl_set_num_threads_local(4);
057      
058     printf (" Finding max number of threads Intel(R) MKL can use for parallel runs \n\n");
059  
060     // HERE I ALWAYS GET ONE
061     max_threads = 4;
062      
063     printf (" Running Intel(R) MKL from 1 to %i threads \n\n", max_threads);
064     for (i = 1; i <= max_threads; i++) {
065         for (j = 0; j < (m*n); j++)
066             C = 0.0;
067          
068         printf (" Requesting Intel(R) MKL to use %i thread(s) \n\n", i);
069        tbb::task_scheduler_init scheduler(i);

 

Then compile it with the below command line 

>source /opt/intel/compilers_and_libraries_2016.2.146/mac/bin/compilervars.sh intel64

>clang dgemm_threading_effect_example.cpp -I${MKLROOT}/include -L{MKLROOT}/lib -lmkl_intel_lp64 -lmkl_tbb_thread -lmkl_core -ltbb -lstdc++ -lpthread -o tbb_dgemm_threading

The run result should be almost same as above png ( OpenMP result). 

Best Regards,

Ying 

0 Kudos
Franco_M_
Beginner
913 Views

Thanks, Ying, sorry for the delay, I had too many papers to correct and grade.

 Requesting Intel(R) MKL to use 1 thread(s) of 4 

 Making the first run of matrix product using Intel(R) MKL dgemm function 
 via CBLAS interface to get stable run time measurements 
 [...]

It works now! 

Thank you!

0 Kudos
Reply