Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

MKL Batch GEMM with TBB threading solution gives no performance improvements

Garigipati__Pradeep
573 Views

As part of the open source library ArrayFire, Intel MKL is used for GEMM operations and recently updated the code to use batch version of GEMM. We have noticed that using GNU OpenMP or Intel OpenMP as threading solution is giving the expected speedups but TBB is not. We wanted to bring it to your attention. Given below is the arrayfire benchmark code used to time the GEMM operations.

#include <arrayfire.h>
#include <stdio.h>
#include <math.h>
#include <cstdlib>

using namespace af;

// create a small wrapper to benchmark
static array A; // populated before each timing
static void fn()
{
    array B = matmul(A, A);  // matrix multiply
    B.eval();                // ensure evaluated
}

int main(int argc, char ** argv)
{
    double peak = 0;
    try {
        int device = argc > 1 ? atoi(argv[1]) : 0;
        setDevice(device);
        info();

        printf("Benchmark N-by-N matrix multiply\n");
        for (int n = 128; n <= 2048; n += 128) {

            //printf("%4d x %4d: ", n, n);
            A = constant(1,n,n,3);
            double time = timeit(fn); // time in seconds
            double gflops = 2.0 * powf(n,3) / (time * 1e9);
            if (gflops > peak)
                peak = gflops;

            printf("%4.2f\n", gflops);
            fflush(stdout);
        }
    } catch (af::exception& e) {
        fprintf(stderr, "%s\n", e.what());
        throw;
    }


    printf(" ### peak %g GFLOPS\n", peak);

    return 0;
}

The benchmark results are provided in the form an interactive chart at the this URL

The usage of batch GEMM call inside arrayfire can be found in the following source file.

https://github.com/9prady9/arrayfire/blob/57eb26d03a738c8a99b664dcbe374bcefdb8572c/src/backend/cpu/blas.cpp

Thank you,

Pradeep.

0 Kudos
4 Replies
Ying_H_Intel
Employee
573 Views

Hi Pradeep, 

Thank you a lot to integrate MKL into ArrayFire and report the issue. 

We will look into the problem. By the way, could you please tell   how do you link the MKL  and tbb ,  and MKL version, compiler and your test machine  as the batched on-line article. 

https://software.intel.com/en-us/articles/introducing-batch-gemm-operations

 

Best Regards,

Ying 

0 Kudos
Garigipati__Pradeep
573 Views

Hi Ying,

On the machine I have tested the following are the details you have asked.

We dynamically link to MKL, the following are the linking flags.

-L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64
-Wl,-rpath,/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64:
-lmkl_core -ldl -lmkl_tbb_thread -lmkl_intel_lp64 -ltbb

In the above flags, I used Intel OpenMP, hence the flag iomp5

MKL Version: 2018.1.163

Compiler: GCC 8.1.1

Yes, I did follow that article only to write my code.

Thank you for looking into it.

Regards,

Pradeep.

0 Kudos
Ying_H_Intel
Employee
573 Views

Hi Pradeep ,

​Thank you for your reply. 

I had escalated the problem, when update you if there any updates.

Thanks
​Ying

0 Kudos
tambellini__william
573 Views

Hi,

Strangely, the intel ngraph team has implemented batched matmul using batch gemm and unless they changed it, they are using TBB and reported good speed up :  

https://github.com/NervanaSystems/ngraph/commit/dbd767994fff79d32988d8823271868d38fd3fdf

Kind

William T.

0 Kudos
Reply