- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

For matrix mul A(m,k) * B(k,n):

m=9, k=256, n=256 is faster than m=9, k=512, n=512 and all larger k and n.

On my E5-2630v3 (16 cores, HT disabled), k,n=256 get 850 GFLOPS while k,n=512 only get 257 GFLOPS.

Here is my testing code. I am doing 64 gemms here:

#include <cstdio> #include <cstdlib> #include <chrono> #include <algorithm> #include <functional> #include <random> #include <omp.h> #include <mkl.h> #include <unistd.h> #define ITERATION 10000 int main(int argc, char *argv[]) { int opt; int n = 9; int c = 256; int c_block = 256; int k = 256; int k_block = 256; int t = 1; while ((opt = getopt(argc, argv, "n:c:k:t:")) != -1) { switch (opt) { case 'n': n = strtol(optarg, NULL, 10); break; case 'c': c = strtol(optarg, NULL, 10); break; case 'k': k = strtol(optarg, NULL, 10); break; case 't': t = strtol(optarg, NULL, 10); break; default: printf("unknown option\n"); } } omp_set_dynamic(0); omp_set_num_threads(t); float *AS[64], *BS[64], *CS[64]; for (int i = 0; i < 64; ++i) { AS= (float*)mkl_malloc(sizeof(float)*n*c, 64); BS= (float*)mkl_malloc(sizeof(float)*c*k, 64); CS= (float*)mkl_malloc(sizeof(float)*n*k, 64); } auto randgen = std::bind(std::uniform_real_distribution<float>(), std::mt19937(0)); for (int i = 0; i < 64; ++i) { std::generate(AS, AS+n*c, std::ref(randgen)); std::generate(BS, BS+c*k, std::ref(randgen)); // std::generate(CS, CS+n*k, std::ref(randgen)); } using Clock = std::chrono::high_resolution_clock; auto t1 = Clock::now(); for (int iter = 0; iter < ITERATION; ++iter) { #pragma omp parallel { const int nthreads = omp_get_num_threads(); const int mythread = omp_get_thread_num(); const int start = mythread*64/nthreads; const int finish = (mythread+1)*64/nthreads; mkl_set_num_threads_local(1); for (int i = start; i < finish; ++i) { float * A = AS; float * B = BS; float * C = CS; cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, k, c, 1, A, c, B, k, 0, C, k); } } } auto t2 = Clock::now(); auto elapsed = t2 - t1; auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(elapsed).count(); // printf("%.1lfs\n", 1e-9 * time); printf("%.lfGFLOPS\n", 1.0 * ITERATION * 64 * 2 * n * c * k / time); for (int i = 0; i < 64; ++i) { mkl_free(AS); mkl_free(BS); mkl_free(CS); } return 0; }

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Youwei,

When the size is the same as L2 cache (k=n=256) the performance will improve a lot.

Once the problem size is large (k=n=512), all data are from LLC and the performance is lower.

Hope this helps!

Khang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Nguyen, Khang T (Intel) wrote:

Hi Youwei,

When the size is the same as L2 cache (k=n=256) the performance will improve a lot.

Once the problem size is large (k=n=512), all data are from LLC and the performance is lower.

Hope this helps!

Khang

Thanks Khang. An interesting followup question is: will tilling work in this case? If intel mkl can do tiling into 256x256, there will not be significant FLOPS degradation, right?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Tim P. wrote:

The way you use omp parallel doesn't make sense. It appears you are forcing each thread to perform the entire problem and turning off mkl local threading. As you appear to be using the same data on each iieration while not taking advantage of mkl internal partitioning, you will see the effect Khang describes.

Hi Tim,

I think the openmp here is correct. Here is the intel manual: https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications. Setting mkl_thread to 1 is good practice.

Moreover, the gemm I do in each thread is small. Even if I set mkl_thread_num to a larger one (say 16), mkl is free to determine the actual number of threads (1 in my case).

In fact, in my first attempt, I did not do openmp threading and let mkl do its internal threading. I got much lower FLOPS.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Following the advice from Khang and Tim, I try to do the tiling.

// replace this line // cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, k, c, 1, A, c, B, k, 0, C, k); // with // where c_block, k_block is 256, 256 for (int c_i = 0; c_i < c; c_i += c_block) { float *TA = A + n * c_i; float beta = c_i ? 0 : 1; for (int k_i = 0; k_i < k; k_i += k_block) { float *TB = B + c_block * k_i + c_i * k; float *TC = C + n * k_i; cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, k_block, c_block, 1, TA, c_block, TB, k_block, beta, TC, k_block); } }

In this way, even for a larger matrix (9,512,512), it will keep the working set of the matrix the same size as in (9,256,256) case. **However, I can not see any speedup.**

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Youwei,

The dimension n = 9 might be too small for the Intel(R) MKL to reuse the data. Try to increase the size of n and see happen.

Best Regards,

Khang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Nguyen, Khang T (Intel) wrote:

Hi Youwei,

The dimension n = 9 might be too small for the Intel(R) MKL to reuse the data. Try to increase the size of n and see happen.

Best Regards,

Khang

Thanks Khang. I know that increasing N will increase the use. However, the question here is: why is there a large performance gap in the relative performance?

After tiling, the working set of both programs are the same. The only difference is that whether all the data can be held in L3 cache. I can expect lower FLOPS because of LLC replacement.

I used to believe that if I keep the **working set** the same, the total data size will not matter that much. In this example, if I can do 64 9x256x256 gemm fast, I can also do 256 9x256x256 gemm (after tiling 64 9x512x512 gemm). Could you please comment on that?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page