- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As part of the open source library ArrayFire, Intel MKL is used for GEMM operations and recently updated the code to use batch version of GEMM. We have noticed that using GNU OpenMP or Intel OpenMP as threading solution is giving the expected speedups but TBB is not. We wanted to bring it to your attention. Given below is the arrayfire benchmark code used to time the GEMM operations.
#include <arrayfire.h> #include <stdio.h> #include <math.h> #include <cstdlib> using namespace af; // create a small wrapper to benchmark static array A; // populated before each timing static void fn() { array B = matmul(A, A); // matrix multiply B.eval(); // ensure evaluated } int main(int argc, char ** argv) { double peak = 0; try { int device = argc > 1 ? atoi(argv[1]) : 0; setDevice(device); info(); printf("Benchmark N-by-N matrix multiply\n"); for (int n = 128; n <= 2048; n += 128) { //printf("%4d x %4d: ", n, n); A = constant(1,n,n,3); double time = timeit(fn); // time in seconds double gflops = 2.0 * powf(n,3) / (time * 1e9); if (gflops > peak) peak = gflops; printf("%4.2f\n", gflops); fflush(stdout); } } catch (af::exception& e) { fprintf(stderr, "%s\n", e.what()); throw; } printf(" ### peak %g GFLOPS\n", peak); return 0; }
The benchmark results are provided in the form an interactive chart at the this URL
The usage of batch GEMM call inside arrayfire can be found in the following source file.
Thank you,
Pradeep.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Pradeep,
Thank you a lot to integrate MKL into ArrayFire and report the issue.
We will look into the problem. By the way, could you please tell how do you link the MKL and tbb , and MKL version, compiler and your test machine as the batched on-line article.
https://software.intel.com/en-us/articles/introducing-batch-gemm-operations
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ying,
On the machine I have tested the following are the details you have asked.
We dynamically link to MKL, the following are the linking flags.
-L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -Wl,-rpath,/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64: -lmkl_core -ldl -lmkl_tbb_thread -lmkl_intel_lp64 -ltbb
In the above flags, I used Intel OpenMP, hence the flag iomp5
MKL Version: 2018.1.163
Compiler: GCC 8.1.1
Yes, I did follow that article only to write my code.
Thank you for looking into it.
Regards,
Pradeep.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Pradeep ,
Thank you for your reply.
I had escalated the problem, when update you if there any updates.
Thanks
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Strangely, the intel ngraph team has implemented batched matmul using batch gemm and unless they changed it, they are using TBB and reported good speed up :
https://github.com/NervanaSystems/ngraph/commit/dbd767994fff79d32988d8823271868d38fd3fdf
Kind
William T.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page