Solved: using TBB thread local storage with MKL

bohan_w_ · ‎02-21-2017

Suppose I have the following code:

tbb::enumerable_thread_specific<DataType> tls;

tbb::parallel_for ( [&] () {

  DataType &data = tls.local();

  cblas_....(data);

});

When the MKL is compiled using OpenMP, the above code can compute the correct answer. When the MKL is compiled used TBB, however, the above code cannot produce the correct result. I think the computation is in parallel in the cblas function.

I think the problem may be related to the thread local storage, but I don't know why it does not work. Any ideas? How to solve the problem?

bohan_w_ · ‎02-27-2017

Finally, I have found out why my problem happens. I have a block of memory in my TLS. This block memory is managed by std::vector. When I used a default allocator for the std::vector, the problem disappeared. When I used TBB's allocator (tbb::cache_aligned_allocator), however, the problem happened. A more concrete code example about my problem is listed below. MKL is compiled using TBB. From what I observed, it means that TBB's cblas is using tbb::cache_aligned_allocator somehow. There must be some memory contention inside TBB's allocator and the memory resource is not protected properly. I hope this is not a bug inside TBB and MKL but my incorrect usage.

Update: The above is partially correct. I finally found out that no matter which allocator is used for TLS, the result is not correct at all. If I don't use TLS, I will get the same answer as the serial loop.

double A[100][16];
double B[100][16];
double D[100][16];

typedef std::allocator Allocator;

tbb::enumerable_thread_specific<std::vector<double, Allocator>> tls(std::vector<double, Allocator>(16));

tbb::parallel_for (tbb::blocked_range(0, 100), [&] (const tbb::blocked_range<int> &rng) {
  std::vector<double, Allocator> &C = tls.local();

  for (int i = rng.begin(); i != rng.end(); i++) {
    // C = Bi * Ai
    cblas_dgemm(A, A, C.data());
    // D = Ai^T * C
    cblas_dgemm(A, C.data(), D);
  }
});

View solution in original post

Zhen_Z_Intel · ‎02-21-2017

Dear customer,

I don't know which cblas function you are using. And how you split matrix/vector during the calculation. For example, if you are using dgemm(c=a*b) for two N size array of matrix(M*M) multiplication that sizeof(a) & sizeof(b) actually N*M*M, you could set a loop for N times that each step is M*M. And you will get N size of array c, each element of this array is actually one M*M matrix.

But if you are calculating two M*M matrix multiplication by calling it within the M size tbb loop, and each step of a is M*1, of b is 1*M. The result would be wrong,because each time you get 1*1 c. And you just get diagonal element of matrix c during the calculation.

Best regards,
Fiona

bohan_w_ · ‎02-21-2017

Fiona Z. (Intel) wrote:

Dear customer,

I don't know which cblas function you are using. And how you split matrix/vector during the calculation. For example, if you are using dgemm(c=a*b) for two N size array of matrix(M*M) multiplication that sizeof(a) & sizeof(b) actually N*M*M, you could set a loop for N times that each step is M*M. And you will get N size of array c, each element of this array is actually one M*M matrix.

But if you are calculating two M*M matrix multiplication by calling it within the M size tbb loop, and each step of a is M*1, of b is 1*M. The result would be wrong,because each time you get 1*1 c. And you just get diagonal element of matrix c during the calculation.

Best regards,
Fiona

Thanks for the reply! I'm not splitting a single matrix multiplication. I'm doing several matrix multiplications in each tbb thread. Let me make the example more concrete.

double A[100][16];
double B[100][16];
double D[100][16];

tbb::enumerable_thread_specific<std::array<double,16>> tls;

tbb::parallel_for (tbb::blocked_range(0, 100), [&] (const tbb::blocked_range<int> &rng) {
  std::array<double,16> &C = tls.local();

  for (int i = rng.begin(); i != rng.end(); i++) {
    // C = Bi * Ai
    cblas_dgemm(A, A, C.data());
    // D = Ai^T * C
    cblas_dgemm(A, C.data(), D);
  }
});

The problem I described happened to the above code. If I use tbb version MKL, the above code cannot compute the correct D.

bohan_w_ · ‎02-27-2017

Finally, I have found out why my problem happens. I have a block of memory in my TLS. This block memory is managed by std::vector. When I used a default allocator for the std::vector, the problem disappeared. When I used TBB's allocator (tbb::cache_aligned_allocator), however, the problem happened. A more concrete code example about my problem is listed below. MKL is compiled using TBB. From what I observed, it means that TBB's cblas is using tbb::cache_aligned_allocator somehow. There must be some memory contention inside TBB's allocator and the memory resource is not protected properly. I hope this is not a bug inside TBB and MKL but my incorrect usage.

Update: The above is partially correct. I finally found out that no matter which allocator is used for TLS, the result is not correct at all. If I don't use TLS, I will get the same answer as the serial loop.

double A[100][16];
double B[100][16];
double D[100][16];

typedef std::allocator Allocator;

tbb::enumerable_thread_specific<std::vector<double, Allocator>> tls(std::vector<double, Allocator>(16));

tbb::parallel_for (tbb::blocked_range(0, 100), [&] (const tbb::blocked_range<int> &rng) {
  std::vector<double, Allocator> &C = tls.local();

  for (int i = rng.begin(); i != rng.end(); i++) {
    // C = Bi * Ai
    cblas_dgemm(A, A, C.data());
    // D = Ai^T * C
    cblas_dgemm(A, C.data(), D);
  }
});

Zhen_Z_Intel · ‎02-27-2017

Dear customer,

I could not reproduce your problem. The result are same what ever I used TBB cache_align_allocator or std allocator. The cache_align allocator is false sharing that the cache line could be moved from one processor to another while one processor is going to modify and the other is going to read. It's just need more costs for memory moving but will not affect value. I tried with an example in following code. Uncomment 21 line & comment 22 to use tbb cache align allocator. The result is same.

int main() {
	double** A = new double*[9];
	double** B = new double*[9];
	double** D = new double*[9];
	for(int i = 0; i < 9; ++i)
	{
    	A = new double[4];
    	B = new double[4];
    	D = new double[4];
	}

	for(int i=0;i<4;i++)
	{
		for(int j=0;j<9;j++)
		{
			A=1;
			B=1;
			D=0;
		}
	}
	//typedef std::allocator<double> Allocator;
	typedef tbb::cache_aligned_allocator<double> Allocator;
	tbb::enumerable_thread_specific<std::vector<double, Allocator>> tls(std::vector<double, Allocator>(4));

	tbb::parallel_for (tbb::blocked_range<int>(0, 9), [&] (const tbb::blocked_range<int> &rng) {
	  std::vector<double, Allocator> &C = tls.local();

	  for (int i = rng.begin(); i != rng.end(); i++) {
	    // C = Bi * Ai
	    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 2, 2, 2, 1, B, 2, A, 2, 0, C.data(), 2);
	    // D = Ai^T * C
	    cblas_dgemm(CblasRowMajor, CblasTrans, CblasNoTrans, 2, 2, 2, 1, A, 2, C.data(), 2, 0, D, 2);
	  }
	});  
	for(int i=0;i<9;i++)
	{
		printf("matrix %d:\n",(i+1));
		for(int j=0;j<2;j++)
		{
			for(int k=0;k<2;k++)
			{
				printf("%f\t", D[j*2+k]);
			}
			printf("\n");
		}
		printf("\n\n");
	}
    return 0;
}