MKL Batch GEMM with TBB threading solution gives no performance improvements

Garigipati__Pradeep · ‎07-20-2018

As part of the open source library ArrayFire, Intel MKL is used for GEMM operations and recently updated the code to use batch version of GEMM. We have noticed that using GNU OpenMP or Intel OpenMP as threading solution is giving the expected speedups but TBB is not. We wanted to bring it to your attention. Given below is the arrayfire benchmark code used to time the GEMM operations.

#include <arrayfire.h>
#include <stdio.h>
#include <math.h>
#include <cstdlib>

using namespace af;

// create a small wrapper to benchmark
static array A; // populated before each timing
static void fn()
{
    array B = matmul(A, A);  // matrix multiply
    B.eval();                // ensure evaluated
}

int main(int argc, char ** argv)
{
    double peak = 0;
    try {
        int device = argc > 1 ? atoi(argv[1]) : 0;
        setDevice(device);
        info();

        printf("Benchmark N-by-N matrix multiply\n");
        for (int n = 128; n <= 2048; n += 128) {

            //printf("%4d x %4d: ", n, n);
            A = constant(1,n,n,3);
            double time = timeit(fn); // time in seconds
            double gflops = 2.0 * powf(n,3) / (time * 1e9);
            if (gflops > peak)
                peak = gflops;

            printf("%4.2f\n", gflops);
            fflush(stdout);
        }
    } catch (af::exception& e) {
        fprintf(stderr, "%s\n", e.what());
        throw;
    }


    printf(" ### peak %g GFLOPS\n", peak);

    return 0;
}

The benchmark results are provided in the form an interactive chart at the this URL

The usage of batch GEMM call inside arrayfire can be found in the following source file.

https://github.com/9prady9/arrayfire/blob/57eb26d03a738c8a99b664dcbe374bcefdb8572c/src/backend/cpu/blas.cpp

Thank you,

Pradeep.

Ying_H_Intel · ‎07-23-2018

Hi Pradeep,

Thank you a lot to integrate MKL into ArrayFire and report the issue.

We will look into the problem. By the way, could you please tell how do you link the MKL and tbb , and MKL version, compiler and your test machine as the batched on-line article.

https://software.intel.com/en-us/articles/introducing-batch-gemm-operations

Best Regards,

Ying

Garigipati__Pradeep · ‎07-23-2018

Hi Ying,

On the machine I have tested the following are the details you have asked.

We dynamically link to MKL, the following are the linking flags.

-L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64
-Wl,-rpath,/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64:
-lmkl_core -ldl -lmkl_tbb_thread -lmkl_intel_lp64 -ltbb

In the above flags, I used Intel OpenMP, hence the flag iomp5

MKL Version: 2018.1.163

Compiler: GCC 8.1.1

Yes, I did follow that article only to write my code.

Thank you for looking into it.

Regards,

Pradeep.

Ying_H_Intel · ‎07-24-2018

Hi Pradeep ,

Thank you for your reply.

I had escalated the problem, when update you if there any updates.

Thanks
Ying

tambellini__william · ‎07-26-2018

Hi,

Strangely, the intel ngraph team has implemented batched matmul using batch gemm and unless they changed it, they are using TBB and reported good speed up :

https://github.com/NervanaSystems/ngraph/commit/dbd767994fff79d32988d8823271868d38fd3fdf

Kind

William T.