Matrix product with MKL

Saurabh_Pradhan · ‎07-15-2016

Hi,

I want to do a complicated matrix product which is defined as

C[i,j] = \sum_{n,m} A[i,n]*A[j,n]* B[n,m] *A[i,m]*A[j,m]

Where A and B are two matrix with size of NxN, (N~1000).

To accomplish this task I just run four loops to creat the matrix C .

for(int i=0;i<ndim;i++)

for(int j=0;j<ndim;j++)

for(int n=0;n<ndim;n++)

for(int m=0;m<ndim;m++)

C[i,j] += A[i,n]*A[j,n]* B[n,m] *A[i,m]*A[j,m]

Can anyone suggest any better implementation using MKL matrix product function?

Thanks in advance.

Ying_H_Intel · ‎07-18-2016

Hi Saurabh,

It seems no direct mkl matrix product function can be called. (maybe some one in BLAS/LAPACK may help, like A*AT). In general, if there are many array operation, instead of to construct the array to fit mkl BLAS or LAPACKnput, i may recommend you to use OpenMP to optimize the 4 layer loop directly , you may find some example in Intel C++ compiler documentation or http://stackoverflow.com/questions/6780075/openmp-optimizations

or call Array annotation :

https://software.intel.com/en-us/blogs/2010/09/03/simd-parallelism-using-array-notation/

https://www.cilkplus.org/tutorial-array-notation

Best Regards,

Ying

Saurabh_Pradhan · ‎07-20-2016

Thanks for the suggestions.

McCalpinJohn · ‎07-20-2016

I built a simple test code to perform this operation on array sizes of 512x512 (to keep the runtime short for the single-core case).

The Intel 16 compiler has no trouble vectorizing the code, and my manual operation count combined with the fixed-function "actual cycles no halted" counter show that I am getting just under 12 FP operations per unhalted core cycle on a single KNL core. Since the code is dominated by multiplications, the peak performance should be slightly over 16 FP operations per cycle, so this is pretty good overall efficiency.

Adding an OpenMP parallelization pragma on the outermost loop does not hurt the single-thread performance. The resulting code gets a speedup of more than 40x on 64 cores on a Xeon Phi 7210 (Knights Landing, 64 core, nominal 1.3 GHz). The small problem size helps scaling by making the problem cache-contained -- the three arrays only use 6 MiB for my slightly smaller test. The small problem size hurts scaling by giving each thread less work to do.

My first thought was that this problem might need extensive cache blocking to run effectively, but this is not the case for cache-containable sizes. The original problem size of 1000x1000 should fit in L3 on any single 10-core or larger Xeon E5 processor, or in the aggregate L3 of any 2-socket Xeon E5 system.

Saurabh_Pradhan · ‎08-09-2016

Thanks John.

Actually I have all the array of the type complex<double> . Probably that's the reason its taking much more time. I have already implemented the openmp.

McCalpinJohn · ‎08-09-2016

Switching to double complex hurts the efficiency moderately in my tests. Using icc 16.0.3 I get about 8 FLOPS per cycle, compared to about 12 FLOPS per cycle for the non-complex version. (Both cases run using 64 cores on a KNL Xeon Phi 7210, a matrix size of N=512, compiled with "icc -qopenmp -std=c99 -O3 -xMIC-AVX512" and run with "OMP_NUM_THREADS=64; KMP_AFFINITY=verbose,scatter".) The FLOP counts are manual in each case, assuming each complex multiplication generates 4 multiplies and 2 adds and each complex addition generates 2 adds. Increasing the problem size to N=1000 does not have much impact on the effective FLOP rate -- the A and B arrays are still cacheable on KNL at this size.

Sometimes it helps a lot to switch from the default interleaved storage format of complex variables to separate arrays for the real and imaginary parts. The compiler has to do a lot less lane swizzling and can use full-width vector operations almost all the time. You have to implement the complex arithmetic manually, but for a simple loop like this that should not take too long...

Saurabh_Pradhan · ‎08-11-2016

Thanks John.

Once I had that thought but multiplication of 5 complex number will generate 32 term (16 real and 16 imaginary ). On top of this In my original problem there are many term like this. Is there any better way to do this analitycal multiplication?

McCalpinJohn · ‎08-11-2016

With separate (contiguous) arrays for the real and imaginary parts, the direct implementation of matrix multiplication and addition are typically vectorized very effectively.

With interleaved format the multiplication operation requires some ugly data motion to vectorize, so it is not surprising that separate contiguous arrays works better.

Complex addition does not require cross-lane data motion, but the 16.0.3 compiler includes some unnecessary re-loading of pointers in the inner loop that adds a bit of overhead.

My test code (below) was compiled with icc 16.0.3 and the compile line:

"icc -O3 -xMIC-AVX512 simple_complex_test.c tacc_timers.c -o simple_complex_test.AVX512"

Typical output shows the difference in timing and instruction counts. On a Xeon Phi 7210, "native" complex multiplication requires more than twice as many instructions per element as the manually implemented version (with separate real and imaginary arrays) and requiring (for L2-contained data) taking about 1.7 times as long to execute. Addition is closer, with the "native" version requiring about 1.64 times as many instructions and taking (up to) 1.22 times as long to execute.

$ taskset -c 2 ./simple_complex_test.AVX512
Native Complex multiplication requires 2.827000 cycles per element, 2.691500 instructions per element
Manual Complex multiplication requires 1.668750 cycles per element, 1.254750 instructions per element
Native Complex multiplication sample result 8000000.000000 4000000.000000
Manual Complex multiplication sample result 8000000.000000 4000000.000000
Native Complex addition requires 2.532500 cycles per element, 1.441500 instructions per element
Manual Complex addition requires 2.070000 cycles per element, 0.879750 instructions per element
Native Complex addition sample result 6000.000000 2000.000000
Manual Complex addition sample result 6000.000000 2000.000000

#include <stdio.h>
#include <complex.h>
#include "tacc_timers.h"

#define N 4000

int main()
{
	int i;
	double complex a,b,c;
	double a_r,a_i,b_r,b_i,c_r,c_i;
	unsigned long instr0, instr1, instr2, instr3;
	unsigned long cycles0, cycles1, cycles2, cycles3;
	double cycles_per_elem, instr_per_elem;

	for (i=0; i<N; i++) {
		a = (double) (i+N/2) + ((double) (N-i))*_Complex_I;
		b = (double) (N-i) + ((double) (i-N/2))*_Complex_I;
		c = 0.0 + 0.0*_Complex_I;
		a_r = (double) (i+N/2);
		a_i = (double) (N-i);
		b_r = (double) (N-i);
		b_i = (double) (i-N/2);
		c_r = 0.0;
		c_i = 0.0;
	}

	cycles0 = rdpmc_actual_cycles();
	instr0 = rdpmc_instructions();
	// complex multiplication
	for (i=0; i<N; i++) c = a*b;
	cycles1 = rdpmc_actual_cycles();
	instr1 = rdpmc_instructions();

	// equivalent on separate real and imaginary components
	for (i=0; i<N; i++) {
	   c_r = a_r*b_r - a_i*b_i;
	   c_i = a_r*b_i + a_i*b_r;
	}
	cycles2 = rdpmc_actual_cycles();
	instr2 = rdpmc_instructions();

	cycles_per_elem = (double)(cycles1-cycles0)/(double)N;
	instr_per_elem = (double)(instr1-instr0)/(double)N;
	printf("Native Complex multiplication requires %f cycles per element, %f instructions per element\n",cycles_per_elem,instr_per_elem);

	cycles_per_elem = (double)(cycles2-cycles1)/(double)N;
	instr_per_elem = (double)(instr2-instr1)/(double)N;
	printf("Manual Complex multiplication requires %f cycles per element, %f instructions per element\n",cycles_per_elem,instr_per_elem);

	printf("Native   Complex multiplication sample result %f %f\n",creal(c[N/2]),cimag(c[N/2]));
	printf("Manual   Complex multiplication sample result %f %f\n",c_r[N/2],c_i[N/2]);

	cycles0 = rdpmc_actual_cycles();
	instr0 = rdpmc_instructions();
	// complex addition
	for (i=0; i<N; i++) c = a+b;
	cycles1 = rdpmc_actual_cycles();
	instr1 = rdpmc_instructions();

	// equivalent on separate real and imaginary components
	for (i=0; i<N; i++) {
		c_r = a_r + b_r;
		c_i = a_i + b_i;
	}
	cycles2 = rdpmc_actual_cycles();
	instr2 = rdpmc_instructions();

	cycles_per_elem = (double)(cycles1-cycles0)/(double)N;
	instr_per_elem = (double)(instr1-instr0)/(double)N;
	printf("Native Complex addition requires %f cycles per element, %f instructions per element\n",cycles_per_elem,instr_per_elem);

	cycles_per_elem = (double)(cycles2-cycles1)/(double)N;
	instr_per_elem = (double)(instr2-instr1)/(double)N;
	printf("Manual Complex addition requires %f cycles per element, %f instructions per element\n",cycles_per_elem,instr_per_elem);

	printf("Native   Complex addition sample result %f %f\n",creal(c[N/2]),cimag(c[N/2]));
	printf("Manual   Complex addition sample result %f %f\n",c_r[N/2],c_i[N/2]);

}

For completeness, the auxiliary timer routines use the "tacc_timers.h" header file:

unsigned long rdpmc_instructions();
unsigned long rdpmc_actual_cycles();

and the "tacc_timers.c" file with the inline assembly routines:

unsigned long rdpmc_instructions()
{
   unsigned a, d, c;

   c = (1<<30);
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

unsigned long rdpmc_actual_cycles()
{
   unsigned a, d, c;

   c = (1<<30)+1;
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}