Solved: mkl_?imatcopy single-threaded?

Randy_Clepper · ‎08-03-2011

It would appear this, and maybe it's the nature of the problem, only appears to use one thread to transpose the matrix. Now, that may be because it's not possible to do it in a threaded fashion. The out-of-place version also appears to only run on one thread - which I would think could be threaded. Grab a row in the source, make it a column in the target, and at the least have devy up row to threads. So, this makes me think I'm doing something else wrong.

Am I doing something wrong in setup, or is the function truly only sequential? I'm linking against mkl_intel_thread.lib. My compiler is MSVC 2008.

Thanks!

Vladimir_Petrov__Int · ‎08-03-2011

Hi Randy,

Threading in mkl_?imatcopy (for certain square sizes) will be available in one of our nearest update releases.

As to the out-of-place transpose, at MKL's level it is not obvious how a user expects the data in both input and output matrices to be distributed across threads (cores). On the other hand, using existing mkl_?omatcopy in a parallel section seems pretty straight-forward (similar to what you described above):

#pragma omp parallel
{
// A - input matrix in row-major layout
// B - output matrix in row-major layout
// rows - number of rows in A
// cols - number of cols
// t_id - omp thread id
// n_threads - number of omp threads
... // user code which works with a part of A from the row my_start and consisting of my_part rows
mkl_?omatcopy('R', 'T', my_part, cols, A + my_start*lda, lda, B + my_start, ldb);
... // user code which works with B
}

Best regards,
-Vladimir

View solution in original post

Vladimir_Petrov__Int · ‎08-03-2011

Hi Randy,

Threading in mkl_?imatcopy (for certain square sizes) will be available in one of our nearest update releases.

As to the out-of-place transpose, at MKL's level it is not obvious how a user expects the data in both input and output matrices to be distributed across threads (cores). On the other hand, using existing mkl_?omatcopy in a parallel section seems pretty straight-forward (similar to what you described above):

#pragma omp parallel
{
// A - input matrix in row-major layout
// B - output matrix in row-major layout
// rows - number of rows in A
// cols - number of cols
// t_id - omp thread id
// n_threads - number of omp threads
... // user code which works with a part of A from the row my_start and consisting of my_part rows
mkl_?omatcopy('R', 'T', my_part, cols, A + my_start*lda, lda, B + my_start, ldb);
... // user code which works with B
}

Best regards,
-Vladimir

Randy_Clepper · ‎08-04-2011

Vladimir,

I can't ask for more than that! That was what I did, actually - well using TBB - for an out of place parallel algorithm.

That's a wonderful answer about the inplace receiving a threaded update. I'll leave my out of place parallel in for now, and then I'll gladly switch it when that update arrives.

I appreciate your response and your time.

Thanks!