topic Hi Youwei, in Intel® oneAPI Math Kernel Library

MKL GEMM slower for larger matrices

Youwei_Z_ — Tue, 26 Jun 2018 00:56:09 GMT

For matrix mul A(m,k) * B(k,n):

m=9, k=256, n=256 is faster than m=9, k=512, n=512 and all larger k and n.

On my E5-2630v3 (16 cores, HT disabled), k,n=256 get 850 GFLOPS while k,n=512 only get 257 GFLOPS.

Here is my testing code. I am doing 64 gemms here:

#include <cstdio>
#include <cstdlib>
#include <chrono>
#include <algorithm>
#include <functional>
#include <random>
#include <omp.h>
#include <mkl.h>
#include <unistd.h>


#define ITERATION 10000

int main(int argc, char *argv[])
{
	int opt;
	int n = 9;
	int c = 256; int c_block = 256;
	int k = 256; int k_block = 256;
	int t = 1;
	while ((opt = getopt(argc, argv, "n:c:k:t:")) != -1) {
		switch (opt) {
			case 'n': n = strtol(optarg, NULL, 10); break;
			case 'c': c = strtol(optarg, NULL, 10); break;
			case 'k': k = strtol(optarg, NULL, 10); break;
			case 't': t = strtol(optarg, NULL, 10); break;
			default: printf("unknown option\n");
		}
	}

	omp_set_dynamic(0);
	omp_set_num_threads(t);
	
	float *AS[64], *BS[64], *CS[64];
	for (int i = 0; i < 64; ++i) {
		AS = (float*)mkl_malloc(sizeof(float)*n*c, 64);
		BS = (float*)mkl_malloc(sizeof(float)*c*k, 64);
		CS = (float*)mkl_malloc(sizeof(float)*n*k, 64);
	} 
	
	auto randgen = std::bind(std::uniform_real_distribution<float>(), std::mt19937(0));
	for (int i = 0; i < 64; ++i) {
		std::generate(AS, AS+n*c, std::ref(randgen));
		std::generate(BS, BS+c*k, std::ref(randgen));
		// std::generate(CS, CS+n*k, std::ref(randgen));
	}

	using Clock = std::chrono::high_resolution_clock;
	auto t1 = Clock::now();
	for (int iter = 0; iter < ITERATION; ++iter) {
		#pragma omp parallel
		{
			const int nthreads = omp_get_num_threads();
    		const int mythread = omp_get_thread_num();
    		const int start = mythread*64/nthreads;
   			const int finish = (mythread+1)*64/nthreads;  
			mkl_set_num_threads_local(1);
			for (int i = start; i < finish; ++i)
			{
				float * A = AS;
				float * B = BS;
				float * C = CS;
				cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, k, c, 1, A, c, B, k, 0, C, k);
			}
		}
		
	}
	auto t2 = Clock::now();
	auto elapsed = t2 - t1;
	auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(elapsed).count();
	// printf("%.1lfs\n", 1e-9 * time);
	printf("%.lfGFLOPS\n", 1.0 * ITERATION * 64 * 2 * n * c * k / time);
	
	for (int i = 0; i < 64; ++i) {
		mkl_free(AS);
		mkl_free(BS);
		mkl_free(CS);
	} 
	return 0;
}

Hi Youwei,

Khang_N_Intel — Thu, 28 Jun 2018 18:24:07 GMT

Hi Youwei,

When the size is the same as L2 cache (k=n=256) the performance will improve a lot.

Once the problem size is large (k=n=512), all data are from LLC and the performance is lower.

Hope this helps!

Khang

The way you use omp parallel

TimP — Thu, 28 Jun 2018 20:36:00 GMT

The way you use omp parallel doesn't make sense. It appears you are forcing each thread to perform the entire problem and turning off mkl local threading. As you appear to be using the same data on each iieration while not taking advantage of mkl internal partitioning, you will see the effect Khang describes.

Quote:Nguyen, Khang T (Intel)

Youwei_Z_ — Thu, 28 Jun 2018 20:42:25 GMT

Nguyen, Khang T (Intel) wrote:

Hi Youwei,

When the size is the same as L2 cache (k=n=256) the performance will improve a lot.

Once the problem size is large (k=n=512), all data are from LLC and the performance is lower.

Hope this helps!

Khang

Thanks Khang. An interesting followup question is: will tilling work in this case? If intel mkl can do tiling into 256x256, there will not be significant FLOPS degradation, right?

Quote:Tim P. wrote:

Youwei_Z_ — Thu, 28 Jun 2018 20:48:59 GMT

Tim P. wrote:

The way you use omp parallel doesn't make sense. It appears you are forcing each thread to perform the entire problem and turning off mkl local threading. As you appear to be using the same data on each iieration while not taking advantage of mkl internal partitioning, you will see the effect Khang describes.

Hi Tim,

I think the openmp here is correct. Here is the intel manual: https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications. Setting mkl_thread to 1 is good practice.

Moreover, the gemm I do in each thread is small. Even if I set mkl_thread_num to a larger one (say 16), mkl is free to determine the actual number of threads (1 in my case).

In fact, in my first attempt, I did not do openmp threading and let mkl do its internal threading. I got much lower FLOPS.

Following the advice from

Youwei_Z_ — Thu, 28 Jun 2018 20:57:34 GMT

Following the advice from Khang and Tim, I try to do the tiling.

// replace this line
// cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, k, c, 1, A, c, B, k, 0, C, k);
// with
// where c_block, k_block is 256, 256
for (int c_i = 0; c_i < c; c_i += c_block)
{
	float *TA = A + n * c_i;
	float beta = c_i ? 0 : 1;
	for (int k_i = 0; k_i < k; k_i += k_block)
	{
		float *TB = B + c_block * k_i + c_i * k;
		float *TC = C + n * k_i;
		cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, k_block, c_block, 1, TA, c_block, TB, k_block, beta, TC, k_block);
	}
}

In this way, even for a larger matrix (9,512,512), it will keep the working set of the matrix the same size as in (9,256,256) case. However, I can not see any speedup.

Hi Youwei,

Khang_N_Intel — Mon, 02 Jul 2018 19:25:46 GMT

Hi Youwei,

The dimension n = 9 might be too small for the Intel(R) MKL to reuse the data. Try to increase the size of n and see happen.

Best Regards,

Khang

Quote:Nguyen, Khang T (Intel)

Youwei_Z_ — Thu, 05 Jul 2018 18:20:03 GMT

Nguyen, Khang T (Intel) wrote:

Hi Youwei,

The dimension n = 9 might be too small for the Intel(R) MKL to reuse the data. Try to increase the size of n and see happen.

Best Regards,

Khang

Thanks Khang. I know that increasing N will increase the use. However, the question here is: why is there a large performance gap in the relative performance?

After tiling, the working set of both programs are the same. The only difference is that whether all the data can be held in L3 cache. I can expect lower FLOPS because of LLC replacement.

I used to believe that if I keep the working set the same, the total data size will not matter that much. In this example, if I can do 64 9x256x256 gemm fast, I can also do 256 9x256x256 gemm (after tiling 64 9x512x512 gemm). Could you please comment on that?