- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
We are working on RNN kernel optimization and we are trying to parallel 2 SGEMM on 2 socket SKX6148 server( 20 core per socket).
The SGEMM size is M = 20, N = 2400, K = 800.
Our target is to map the first SGEMM to socket0 and the other SGEMM to socket1.
We measured the GFLOPS with this benchmark(https://github.com/xhzhao/GemmEfficiency/tree/tbb), and got the following performance data:
- OMP 1 x 40 core 2261 GFLOPS code: https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L120
- Pthread 2 * 20 core 3550 GFLOPS code: https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L291
- OMP Nested 2 x 20 core 1068 GFLOPS code: https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_omp.cpp#L336
- TBB Nested 2 x 20 core 752 GFLOPS code: https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_tbb.cpp#L159
I found that the performance of OMP+MKL or TBB MKL is not as good as we expect, and i'm not sure if i miss something with MKL in threaded application.
BTW, the pthread+MKL solution is not suitable for our real case , as it will double the threads and make the performance even worse.
Thanks in advance.
Link Copied
0 Replies

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page