slow out-of-place matrix transposition using mkl_domatcopy

Ján_K_ · ‎10-03-2015

I created test program to test speed of matrix transposition for various scenarios, in particular:

1. in-place transposition of square matrix using mkl_dimatcopy method
2. out-of-place transposition of square matrix using mkl_domatcopy method
3. in-place transposition of rectangular matrix using mkl_dimatcopy method
4. out-of-place transposition of rectangular matrix using mkl_domatcopy method
5. in-place transposition of square matrix using naive solution in C
6. out-of-place transposition of square matrix using naive solution in C
7. out-of-place transposition of rectangular matrix using naive solution in C

I didnt implement in-place transposition of rectangular matrix in C because of its complexity. My solution was single-threaded (i linked mkl_sequential_dll.lib to my binary), version of MKL used 11.3. This is output of my test program containing times needed for one matrix transposition. These times were averaged using 101 loops:

== Square in-place matrix (1000x1000) transposition using mkl_dimatcopy completed ==
== at 3.35496 milliseconds ==

== Square out-of-place matrix (1000x1000) transposition using mkl_domatcopy completed ==
== at 9.11405 milliseconds ==

== Rectangular in-place matrix (1500x230) transposition using mkl_dimatcopy completed ==
== at 58.50708 milliseconds ==

== Rectangular out-of-place matrix (1500x230) transposition using mkl_domatcopy completed ==
== at 4.34861 milliseconds ==

== Square in-place matrix (1000x1000) transposition using C completed ==
== at 5.41542 milliseconds ==

== Square out-of-place matrix (1000x1000) transposition using C completed ==
== at 12.04492 milliseconds ==

== Rectangular out-of-place matrix (1500x230) transposition using C completed ==
== at 1.51498 milliseconds ==

In all cases but one MKL solution was better. This exception is out-of-place transposition of rectangular matrix with size 1500x230. Very simple C solution which didnt care about optimal using of cache memory is 2.87 times faster ! The data type used was double. C solution was faster than MKL also for larger rectangular matrix 1500x600 which didnt fit into L3 cache. For this case C solution was 1.25 times faster than MKL. Even when MKL does additional operation (multiplication by parameter alpha) it doesnt explain the difference. There are 1500x230 = 345 000 multiplications. According to Intel optimization manual throughput of mulpd instruction for my architecture is 1 instruction per cycle (i.e. 2 double multiplication per cycle) if I interpret the manual correctly. To do 345 000 multiplication in ideal case I need 345 000 / 2 = 172 500 cycles or around 0.1 ms.

Configuration of my PC is as follows:

Windows 7 Home Premium 64-bit SP1

Intel Core i3 370M @ 2.40GHz
Arrandale 32nm Technology
Cores   2
Threads   4
L1 Data Cache Size   2 x 32 KBytes
L1 Instructions Cache Size   2 x 32 KBytes
L2 Unified Cache Size   2 x 256 KBytes
L3 Unified Cache Size   3072 KBytes

6,00GB Dual-Channel DDR3 @ 531MHz (7-7-7-20)

Visual Studio Express 2012 for Windows Desktop

Source code follows:

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include "mkl.h"

// Parameters of square matrices
int m1 = 1000;
int n1 = 1000;
int loop_count1 = 101;

// Parameters of rectangular matrices
int m2 = 1500;
int n2 = 230;
int loop_count2 = 101;

void TestTransMklSqIn();
void TestTransMklSqOut();
void TestTransMklRecIn();
void TestTransMklRecOut();
void TestTransCSqIn();
void TestTransCSqOut();
void TestTransCRecOut();
void CheckTranspose(double *A, double *B, int m, int n);

int main()
{
   TestTransMklSqIn();
   TestTransMklSqOut();
   TestTransMklRecIn();
   TestTransMklRecOut();
   TestTransCSqIn();
   TestTransCSqOut();
   TestTransCRecOut();
}

double *CreateMatrix(int m, int n, bool zero = false)
{
   double *A = (double*)mkl_malloc(m * n * sizeof(double), 16);
   if (zero)
   {
      memset((void*)A, 0, m * n *sizeof(double));
   }
   else
   {
      int count = m * n;
      for (int i = 0; i < count; i++)
      {
         A = (double)(i+1);
      }
   }

   return A;
}

// In-place transposition of square matrix using MKL
void TestTransMklSqIn()
{
   double *A = CreateMatrix(m1, n1);
   if (A == NULL)
   {
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int i = 0; i < loop_count1; i++)
   {
      mkl_dimatcopy('R', 'T', m1, n1, 1.0, A, n1, m1);
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count1;

   printf (" == Square in-place matrix (%dx%d) transposition using mkl_dimatcopy completed == \n"
      " == at %.5f milliseconds == \n\n", m1, n1, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
}

// Out-of-place transposition of square matrix using MKL
void TestTransMklSqOut()
{
   double *A = CreateMatrix(m1, n1);
   if (A == NULL)
   {
      return;
   }

   double *B = CreateMatrix(m1, n1);
   if (B == NULL)
   {
      mkl_free(A);
      A = NULL;
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int i = 0; i < loop_count1; i++)
   {
      mkl_domatcopy('R', 'T', m1, n1, 1.0, A, n1, B, m1);
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count1;

   printf (" == Square out-of-place matrix (%dx%d) transposition using mkl_domatcopy completed == \n"
      " == at %.5f milliseconds == \n\n", m1, n1, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
   mkl_free(B);
   B = NULL;
}

// In-place transposition of rectangular matrix using MKL
void TestTransMklRecIn()
{
   double *A = CreateMatrix(m2, n2);
   if (A == NULL)
   {
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int i = 0; i < loop_count2; i++)
   {
      mkl_dimatcopy('R', 'T', m2, n2, 1.0, A, n2, m2);
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count2;

   printf (" == Rectangular in-place matrix (%dx%d) transposition using mkl_dimatcopy completed == \n"
      " == at %.5f milliseconds == \n\n", m2, n2, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
}

// Out-of-place transposition of rectangular matrix using MKL
void TestTransMklRecOut()
{
   double *A = CreateMatrix(m2, n2);
   if (A == NULL)
   {
      return;
   }

   double *B = CreateMatrix(m2, n2, true);
   if (B == NULL)
   {
      mkl_free(A);
      A = NULL;
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int i = 0; i < loop_count2; i++)
   {
      mkl_domatcopy('R', 'T', m2, n2, 1.0, A, n2, B, m2);
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count2;

   CheckTranspose(A, B, m2, n2);

   printf (" == Rectangular out-of-place matrix (%dx%d) transposition using mkl_domatcopy completed == \n"
      " == at %.5f milliseconds == \n\n", m2, n2, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
   mkl_free(B);
   B = NULL;
}

// In-place transposition of square matrix using naive solution in C
void TestTransCSqIn()
{
   double *A = CreateMatrix(m1, n1);
   if (A == NULL)
   {
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   double temp;
   for (int k = 0; k < loop_count1; k++)
   {
      for (int i = 0; i < m1; i++)
      {
         int idx1 = i * n1;
         int idx2 = i;
         for (int j = 0; j < i; j++)
         {
            temp = A[idx1];
            A[idx1] = A[idx2];
            A[idx2] = temp;
            idx1++;
            idx2+=n1;
         }
      }
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count1;

   printf (" == Square in-place matrix (%dx%d) transposition using C completed == \n"
      " == at %.5f milliseconds == \n\n", m1, n1, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
}

// Out-of-place transposition of square matrix using naive solution in C
void TestTransCSqOut()
{
   double *A = CreateMatrix(m1, n1);
   if (A == NULL)
   {
      return;
   }
   double *B = CreateMatrix(m1, n1);
   if (B == NULL)
   {
      mkl_free(A);
      A = NULL;
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int k = 0; k < loop_count1; k++)
   {
      for (int i = 0; i < m1; i++)
      {
         int idx1 = i * n1;
         int idx2 = i;
         for (int j = 0; j < n1; j++)
         {
            B[idx2] = A[idx1];
            idx1++;
            idx2+=n1;
         }
      }
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count1;

   printf (" == Square out-of-place matrix (%dx%d) transposition using C completed == \n"
      " == at %.5f milliseconds == \n\n", m1, n1, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
   mkl_free(B);
   B = NULL;
}

// Out-of-place transposition of rectangular matrix using naive solution in C
void TestTransCRecOut()
{
   double *A = CreateMatrix(m2, n2);
   if (A == NULL)
   {
      return;
   }
   double *B = CreateMatrix(m2, n2, true);
   if (B == NULL)
   {
      mkl_free(A);
      A = NULL;
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int k = 0; k < loop_count2; k++)
   {
      for (int i = 0; i < m2; i++)
      {
         int idx1 = i * n2;
         int idx2 = i;
         for (int j = 0; j < n2; j++)
         {
            B[idx2] = A[idx1];
            idx1++;
            idx2+=m2;
         }
      }
   } 

   s_elapsed = (dsecnd() - s_initial) / loop_count2;

   CheckTranspose(A, B, m2, n2);

   printf (" == Rectangular out-of-place matrix (%dx%d) transposition using C completed == \n"
      " == at %.5f milliseconds == \n\n", m2, n2, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
   mkl_free(B);
   B = NULL;
}

void CheckTranspose(double *A, double *B, int m, int n)
{
   bool res = true;
   for (int i = 0; i < m; i++)
   {
      for (int j = 0; j < n; j++)
      {
         if (A[i*n + j] != B[j*m + i])
         {
            res = false;
            break;
         }
      }
      if (!res)
      {
         break;
      }
   }

   if (!res)
   {
      printf("Matrix tranpose error detected !\n\n");
   }
}

TimP · ‎10-03-2015

I suppose an optimized matrix transpose would use streaming stores, which can't be done when your inner loops are backwards. VS2012 did include some 128-bit wide auto-vectorization but not auto-streaming stores and probably not a mixed stride. So it seems that MKL has an opportunity for 2x speedup over VS2012 compiler even when cache blocking is not a factor, and the streaming stores could cut the cache demand in half.

Still, the VS2012 compiler may be smart enough to shortcut your dummy loop. If so, it would be almost embarrassing if you saw only 3x speedup in your version.

I suppose the improvements in auto-vectorization which Intel apparently negotiated for the VS2015 compiler may apply primarily to AVX and AVX2, so may not be of interest to you.

I don't know how much practical use there is for physical transposition of moderately large matrices. Much optimization of widely quoted benchmarks has been done by eliminating unnecessary transposition. e.g. making use of the DGEMM options to include transposition in the code.

Gennady_F_Intel · ‎02-28-2016

Jan, pls check the latest version of MKL 11.3 update ( released one week ago ) and let us know the performance results you see on your side. thanks.

McCalpinJohn · ‎02-29-2016

Naive matrix transpose operations often have serious troubles with cache associativity conflicts in the L1 and L2 caches.

The associativity of the L1 and L2 caches is 8, so I have typically used an 8x8 unrolling of the loops to minimize the chance of kicking cache lines out of the cache before I have read all the data items in the cache line. One example that worked well was simply:

#pragma unroll(8)
   for (i=0; i<N1; i++) {
#pragma unroll(8)
      for (j=0; j<N2; j++) {
         a = b;
      }
   }

This was about 5x-6x times as fast as the naive code for 1024x1024 arrays on a Xeon E5-2680 (3.1 GHz max turbo for a single core, with 20 MiB L3, so both arrays were L3-contained). The same approach should work fine for the "flattened" arrays above.

On this Xeon E5-2680 system I got ~18.4 GB/s for a straight copy of the same arrays, while the 8x8 unrolled version corresponded to about 11.6 GB/s (1.48 milliseconds). The transpose took a bit less than 1.6 times as long to execute as the copy, which (in my experience) is a good result. It is probably possible to improve on this result slightly, but I would expect to start hitting additional complexities before too long....

In your case it might be worthwhile to try to figure out how to implement the in-place transpose for the rectangular matrix (with loop unrolling), since one 1500x230 array of doubles will fit in your L3 cache, while two copies will not....

Ján_K_ · ‎03-24-2016

Hello Gennady, unfortunately witk MKL 11.3.2 the times didnt change significantly. Same operating system, same compiler. For check this is version i tested:

#define __INTEL_MKL_BUILD_DATE 20160121

#define __INTEL_MKL__ 11
#define __INTEL_MKL_MINOR__ 3
#define __INTEL_MKL_UPDATE__ 2

#define INTEL_MKL_VERSION 110302

And these are the results:

== Square in-place matrix (1000x1000) transposition using mkl_dimatcopy completed ==
== at 3.21081 milliseconds ==

== Square out-of-place matrix (1000x1000) transposition using mkl_domatcopy completed ==
== at 8.70370 milliseconds ==

== Rectangular in-place matrix (1500x230) transposition using mkl_dimatcopy completed ==
== at 57.78071 milliseconds ==

== Rectangular out-of-place matrix (1500x230) transposition using mkl_domatcopy completed ==
== at 4.10067 milliseconds ==

== Square in-place matrix (1000x1000) transposition using C completed ==
== at 5.03611 milliseconds ==

== Square out-of-place matrix (1000x1000) transposition using C completed ==
== at 11.63497 milliseconds ==

== Rectangular out-of-place matrix (1500x230) transposition using C completed ==
== at 1.41246 milliseconds ==

Gennady_F_Intel · ‎03-26-2016

Jan, but what I see on my side 1 and 2 threads ( win64, static linking):

u594997_DPD200574983>_ser.exe
Major version: 11
Minor version: 3
Update version: 2
Product status: Product
Build: 20160120
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors
================================================================

**** mkl #threads == 1
== Rectangular out-of-place matrix (1500x230) transposition using mkl_domatcopy completed ==
== at 1.07731 milliseconds ==

== Rectangular out-of-place matrix (1500x230) transposition using C completed ==
== at 1.23810 milliseconds ==

u594997_DPD200574983>_thr.exe
Major version: 11
Minor version: 3
Update version: 2
Product status: Product
Build: 20160120
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors
================================================================

**** mkl #threads == 2
== Rectangular out-of-place matrix (1500x230) transposition using mkl_domatcopy completed ==
== at 0.91667 milliseconds ==

== Rectangular out-of-place matrix (1500x230) transposition using C completed ==
== at 1.20657 milliseconds ==

the test is attached.

McCalpinJohn · ‎03-28-2016

On a newer platform (Core i7-4960HQ, quad-core Haswell) I see significantly better performance with the C version of the code.

Compiled with compiler version: icpc (ICC) 16.0.1 20151020

Compiled with command line: icpc -O3 -mkl -xCORE-AVX2 transposition.cpp -o transpose

The results show that the C code is almost twice as fast as the MKL code, but the MKL code shows effectively perfect parallel speedup.
Adding a "#pragma omp parallel for" to the outer loop of the C code gives almost perfect parallel speedup in that case as well, making the C code between 1.8x and 2x the speed of the MKL code for this specific example. The C code continues to get faster when using additional threads (HyperThreading is enabled on this system), but the timings become more erratic when using all 8 logical processors on this box....

Results with serial version of the C code:

 **** mkl #threads == 1 
 == Rectangular out-of-place matrix (1500x230) transposition using mkl_domatcopy completed == 
 == at 1.23188 milliseconds == 
 == Rectangular out-of-place matrix (1500x230) transposition using C completed == 
 == at 0.65431 milliseconds == 

 **** mkl #threads == 2 
 == Rectangular out-of-place matrix (1500x230) transposition using mkl_domatcopy completed == 
 == at 0.59453 milliseconds == 
 == Rectangular out-of-place matrix (1500x230) transposition using C completed == 
 == at 0.65186 milliseconds == 

 **** mkl #threads == 3 
 == Rectangular out-of-place matrix (1500x230) transposition using mkl_domatcopy completed == 
 == at 0.42065 milliseconds == 
 == Rectangular out-of-place matrix (1500x230) transposition using C completed == 
 == at 0.65242 milliseconds == 

 **** mkl #threads == 4 
 == Rectangular out-of-place matrix (1500x230) transposition using mkl_domatcopy completed == 
 == at 0.31615 milliseconds == 
 == Rectangular out-of-place matrix (1500x230) transposition using C completed == 
 == at 0.66491 milliseconds ==