MKL is slow comparing with own functions. Seems it is not vectorized or parallelized

Ilia_S_1 · ‎09-16-2015

Hello, everybody.

I decided to compare MKL functions with my own implementations and I was wondered, when I found, that MKL is much slower. I will post my code below and may be some of you could help me to find, what I did wrong. I'm searching already for 3 days but the answers what I found just helped partially. Several questions are remaining.

-------------------------- Problem N 1 v3=a*v1+v2 -------------------------------------

First of all, I found, that cblas_zaxpy is several times slower:
The code is

MKL part:

	int n=1000000;
	vector<MKL_Complex16> v1(n,1), v2(n,2), v3(n,0);

	MKL_Complex16 val=3.;

memcpy(v3.data(),v2.data(),sizeof(MKL_Complex16)*n);
		cblas_zaxpy (n, &val, v1.data(), 1, v3.data(), 1);

Own part:

#pragma omp parallel for simd
		for(int j=0;j<n;j++)
		{
			v3=val*v1+v2;
		}

Test performance times with 1 omp thread (in ms): OMP_NUM_THREADS 1; MKL_NUM_THREADS 1;

MKL time: 22299;
OWN time: 2548;
MKL time: 9224;
OWN time: 2243;

Tests with 2 omp threads: OMP_NUM_THREADS 2; MKL_NUM_THREADS 2;

MKL time: 25855;
OWN time: 1370;
MKL time: 8058;
OWN time: 986;

It seems, that MKL doesn't use vectorization and parallelization too. What could be the reason? Another question, why first time values in test iterations are so big?

-------------------------- Problem N 2 v2=a*H*v1+b*v1 ------------------------------------

Here I consider nultiplication of matrix and vector and a sum with another vector.
I implemented own procedure in two ways. The code

MKL part:

	int n=2000;
	vector<MKL_Complex16> v1(n,1.0), v2(n,0.0), H(n*n,1.0), tmp(n,0.), v3(n,1.0);

	MKL_Complex16 a(1,2), b(2,1);

	MKL_Complex16* v1data= v1.data();
	MKL_Complex16* mdat = H.data();
	MKL_Complex16* tmpdat = tmp.data();

	memcpy(v2.data(),v1.data(),sizeof(MKL_Complex16)*n);
		cblas_zhemv(CBLAS_ORDER::CblasRowMajor, CBLAS_UPLO::CblasLower, n, &a, H.data(), n, v1.data(), 1, &b, v2.data(), 1);

Own part 1:

#pragma omp parallel for
		for(int i=0;i<n;i++)
		{
			MKL_Complex16 *subv = &(mdat[i*n]);

// it was not vectorized by the vec-report (Why?)
			MKL_Complex16 val= __sec_reduce(complex<double>(0,0), subv[0:n] * v1data[0:n], operator+ );

			v2= a*val+b*v1;
		}

Own part 2:

#pragma omp parallel for
		for(int i=0;i<n;i++)
		{
			MKL_Complex16 *subv = &(mdat[i*n]);

			double valr = 0.0, vali=0.0;

#pragma simd  reduction(+:valr,vali) // the problem with passing complex variable in reduction
			for(int j=0;j<n;j++)
			{
				MKL_Complex16 val = subv * v1data;
				valr +=  val.real();
				vali +=  val.imag();
			}

			v2= a*MKL_Complex16(valr,vali)+b*v1;

		}

Test performance times with 1 omp thread:

MKL time: 5328
OWN time 1st variant: 11588
OWN time 2nd variant: 15835

Tests with 6 threads (peak performance):

MKL time: 5663
OWN time 1st variant: 4207; testval (2002,4001)
OWN time 1st variant: 4274; testval (2002,4001)

After I changed external for to cilk_for and removed openMP, I got for 6 threads more optimization

MKL time: 5574
OWN time 1st variant: 2858; testval (2002,4001)
OWN time 1st variant: 2657; testval (2002,4001)

Here we see, that for 1 thread MKL is much faster, but for 6 threads it seems to be non-parallelized, or it has even overhead from parallelization. On the other hand my own implementations got optimized.

There are also another questions about vectorization and complex numbers. I could not pass complex variable in reduction. And
__sec_reduce(), which is taken from the manual, is not vectorized because "unsupported data type".

Thank you in advance for the reply.

Ilia Sivkov.

TimP · ‎09-16-2015

Much of your question is answered explicitly or implicitly in the documentation.

Level 1 BLAS functions are better implemented in line, as you have done, at least for moderate vector sizes. It will require a vector length of 1000 or more for the MKL call to be competitive with single thread vectorization, probably 10000 for a combination of vectorization and threading. If the MKL does implement threading, it will not kick in for smaller vectors, and also will require setting OMP_NESTED and stuff like OMP_PLACES=cores, OMP_PROC_BIND=spread,close if you are calling from an OpenMP parallel region, as in effect it's OpenMP inside MKL.

If you are calling MKL from a non-OpenMP threaded region, you have additional considerations to deal with.

Cilk(tm) Plus documentation may not be clear, but the present versions don't invoke threading for CEAN or reducer syntax, so those can be combined with OpenMP without problem (other than potential future incompatibility). I haven't seen expert discussion of it, but I think C99 vector types would be the most likely to work with omp simd reduction or Cilk reducer.

I believe the original reason for providing Level 1 BLAS in MKL was for use with non-vectorizing compilers (definitely prior to the concept of omp simd). With current compilers, you would not even need the omp simd for auto-vectorization in your example if you were willing to use C99 restrict or the __restrict extension.

Ilia_S_1 · ‎09-17-2015

Thank you for the reply.

Tim Prince wrote:
.

If you are calling MKL from a non-OpenMP threaded region, you have additional considerations to deal with.

Yes, In my example I call MKL from non-OpenMP region. So, it is unclear, why I don't have parallel in MKL. Sizes of vector in first case is 1000000 and dimension of matrix and vectors in the second case is 2000.

You also said to use C99, but the problem is that I use c++11. When I tried to use complex numbers from <complex.h>, then I found, that arithmetic of complex numbers doesn't work correctly. May be you know how to deal with complex numbers, vectorization in c++?

Thanks,

Ilia Sivkov

TimP · ‎09-17-2015

Optimization of complex arithmetic is a "complex" subject, and I'm not aware of a thorough reference. It's not helped by the divergence between C and C++. Intel used to recommend setting -std=c99 in order to use C99 complex, even inside C++. This seems difficult to reconcile now that -std=c++11 and the like are so often needed.

Reading documentation about complex in Intel compilers is difficult. If you try the search window in the documentation installed with the compiler and MKL, you will get a mixture of references to useful fragments and stuff which is unrelated.

If you allow your compilation to default to SSE2, it goes almost without saying, there is no simd optimization of complex arithmetic. SSE3 is the only ISA option supporting complex arithmetic optimization which is valid for all CPUs sold in the last decade, but it's not a default. SSE3 simd optimization of complex double is not reported as vectorization (because it is still basically serial) even though it should double the performance of basic arithmetic. AVX simd optimization would be reported as vectorization unless it happened to drop back to AVX128.

If you turn on your search engine, you will find admissions of cases where past Intel compilers required pragmas such as simd vectorlength(2) to apply simd to std::complex, but no indication whether that problem was fixed later. It would seem that the adoption of std::complex as MKL data types would endorse the usage.

In order to use complex correctly (depending in part on what you mean by correct), you may need to be aware of the limited-range options such as gcc -fcx-limited-range (included in -ffast-math) and icc -complex-limited-range (included in -fast and -fp-model fast=2). Without those options, complex double abs(), sqrt(), and division aren't vectorized, even though reported as vectorized. complex float is implemented by promoting those intrinsics to double, so it is possible to have simd optimization with reduced performance. With those options, there is no protection against over- and under-flow resulting from squaring the real and imaginary parts, so the useful exponent range is cut roughly in half, and your application is likely to break if you exceed the limits.

I've asked whether the reporting of complex simd optimization could be improved, and the answer was no.

Ilia_S_1 · ‎09-17-2015

But as I know, Is Fortran complex type is vectorized?

Another problem, which is still unclear, why is MKL not parallelized? I set both MKL_NUM_THREADS and OMP_NUM_THREADS. But it has no affect.

Regards,

Ilia Sivkov

TimP · ‎09-17-2015

The limited range complex options are the same for the companion Fortran, C, and C++ compilers. In Fortran, you don't have to deal with __restrict qualifiers or the C vs. C++ complex data type variations.

If you mean that __sec_reduce doesn't optimize your case, you might ask on the Cilk(tm) Plus forum. Priority of Cilk(tm) Plus support seemed to be cut back when OpenMP 4 neared release. Intel 16.0 compilers did more to implement the "vector always" aspect of Cilk array notation; earlier compilers might choose not to vectorize when that was expected to reduce performance. Note that the more typical complex dot product (as in Fortran) is sum(conjg(x)*y).

MKL looks at the size of the problem, as well as the OMP_NESTED and HyperThreads situation, before choosing a threaded code path, even if you have linked with MKL:parallel.

Ilia_S_1 · ‎09-20-2015

If MKL looks the size of the problem, there should be a way to disable this "smart" feature and push MKL to do a parallel calculation. Because with parallelization my code runs faster even without vectorization

jimdempseyatthecove · ‎09-20-2015

Seeing that you have observed that your code using OpenMP, in this case, exceeds the default behavior of MKL (in this case), it is not unreasonable to think your application itself, may include OpenMP parallel regions that are not serviced by MKL available functions. This being the case, you should be aware that when you use MKL from an application that is using a threading paradigm (OpenMP, TBB, CilkPlus, pthreads, boost threads, etc...), that the generally correct way to use MKL is to link in the serial (sequential but thread safe) version of MKL (not the parallel version of MKL). The reason for this is to not have competing thread pools that cause oversubscription of threads.

Does your application, exclusive of MKL, use parallelization?

Jim Dempsey

Ilia_S_1 · ‎09-20-2015

In this region, where I call MKL functions - it doesn't. I call these MKL functions from sequential region.

jimdempseyatthecove · ‎09-21-2015

It does not matter (generally) if within a parallized application if MKL is called from within or outside a parallel region, you should (generally) link in the serial version of MKL with your parallel application. The reason for this is (generally) OpenMP applications (generally) run best with the KMP_BLOCKTIME (or OMP_BLOCKTIME) set at a few 100ms. Under this circumstance (generally) your program will exit some parallel region and run up to the MKL call while the other threads of the prior parallel region are in a compute intensive SpinWait (waiting for entry to next parallel region). When this happens, and you enter MKL (parallel version), upon decision to go parallel, MKL starts its thread pool (assuming the other threads of its thread pool were not also SpinWaiting), and now you have an oversubscription of threads with one pool competing for CPU time with the other pool competing for CPU time. This leads to at best, half performance, but in practice (generally) much worse.

The reason behind all the (generally)'s is, that with careful control of calling kmp_set_blocktime you can cause the prior parallel region to not enter SpinWait after exit of the prior parallel region. But this requires you to make this setting prior to entry to the parallel region that precedes the MKL call, then reset it after the MKL call, unless you have a subsequent MKL call after the parallel region, and in that case you would not set the threads to SpinWait. While you could introduce a "null" parallel region in front of the MKL call (preceded by setting the SpinWait time to 0), this encounters the overhead of a thread barrier. You might be able to make use of the nowait clause, however you may have some unresolved places in your code where you did not get your no-SpinWait/SpinWait setup correctly.

Jim Dempsey

Ilia_S_1 · ‎09-21-2015

Probably, I explained not very well.
So, my code is:

int void main()
{                
// array initializations...

      long int t = getTick();


		memcpy(v2.data(),v1.data(),sizeof(MKL_Complex16)*n);
		cblas_zhemv(CBLAS_ORDER::CblasRowMajor, CBLAS_UPLO::CblasLower, n, &a, H.data(), n, v1.data(), 1, &b, v2.data(), 1);


		cout<<"MKL time: "<< getTick()-t<<endl;
}

That's it. I have no prior "#pragma omp parallel region" a all. I just call cblas_zhemv

TimP · ‎09-21-2015

Now you've changed the subject, as far as which MKL facilities you are interested in. What is your comparison basis to say that zhemv is slow?

If zhemv doesn't have a threaded implementation in MKL, you might gain something by building the public source with parallelization.

I note that current ifort assumes the problem size is too small (40x40?) to be worth parallelizing. You could over-ride that decision if your problem size is big enough. You could also build explicitly for SSE3 or AVX2 and test which is better for you, and you could try gfortran, where the AVX2 instruction choice may be significantly different. Curiously, ifort says SSE3 may be marginally faster than SSE4.2.

We once had a preliminary discussion about whether zhemv might be worth running on MIC. I didn't hear of it going far enough to draw any conclusions.

Ilia_S_1 · ‎09-21-2015

I compared zhemv with my threaded implementation. Matrix is 2000x2000 in this example.
I assumed, that zhemv has threaded implementation. I didn't even think, that it could be built without threading. May be this is the reason, why I don't see the increasing of performance. I will ask administrators.

jimdempseyatthecove · ‎09-21-2015

MKL has multiple libraries of which there are two major design differences (threaded and non-threaded). The choice of which is a Link time option. As stressed earlier, unless you know what you are doing,

For a parallel application, tile your arrays in a parallel region, then have each thread in the parallel region call the sequential version of MKL (which is Linked with your program).

For a sequential application, have the only (visible) thread of the application call the parallel version of MKL (which is Linked with your program).

This eliminates trying to coordinate the two thread pools. Using both application parallel together with parallel version of MKL is difficult to do.

There are other threads on this forum where we went trough the exercise of parallelizing the application across NUMA nodes, then within each NUMA node calling the parallel version of MKL. You may find these threads of interest if you wish to go this route.

Jim Dempsey

TimP · ‎09-21-2015

Your admins don't have a way of knowing what isn't in the installed mkl documentation which you can read. Where MKL functions haven't been threaded, those would appear identically in the threaded and sequential library versions.

In principle, it should be easy to control nested parallelism, using OpenMP 4 support in the library. For example, if you want 4 threads each calling MKL using 4 threads, you ought to be able to get there by setting OMP_NUM_THREADS=16, OMP_NESTED, OMP_PROC_BIND=spread,close I gather that's not what you want any longer; your problem seems big enough to use 16 threads or more in a single instance of zhemv.

The quoted simd speedups when using ifort to compile zhemv aren't large; in fact many of the minor loops should drop vectorization due to the lack of suitable support in AVX2 (tell that to the marketers!), which could account for some advantage of SSE3 over AVX. You also have the semantic question of whether to call simd optimization vectorization in SSE3 which sets double complex vector length to 1.