Two parallel cblas_dgemm on XeonPhi 7210

TBane · ‎01-12-2018

I intend to run two parallel cblas_dgemm instances, and used the following code snippet:

#pragma omp parallel num_threads(2)
{

if (omp_get_thread_num() == 0){
mkl_set_num_threads( NUM_OF_THREADS/2 );
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, p, 1, pA, p, pB, n, 0, pC, n);
} else{
mkl_set_num_threads( NUM_OF_THREADS/2 );
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, p, 1, pA, p, pB, n, 0, pD, n);
}
}

The time taken to do the above is compared with two serial instances of cblas_dgemm as follows:

cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, p, 1, pA, p, pB, n, 0, pC, n);
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, p, 1, pA, p, pB, n, 0, pD, n);

each cblas_dgemm using a full NUM_OF_THREADS thread set.

On a XeonPhi, the time taken to do the parallel implementation is 0.272649 seconds, while the time taken to do the serial implementation is 0.164373 seconds. On a CPU the time taken to do the parallel version is about half the time taken to do the serial version. Any feedback is greatly appreciated. the code for the parallel and serial versions is attached.

Among the various setup, KMP_AFFINITY was "compact,1,0,granularity=fine", NUM_OF_THREADS was 32, matrix size was 100x100 for all matrices.