VTune results for mkl dot product call

dehvidc1 · ‎01-31-2011

Using the VTune hotspots analysis shows the majority of my application time is being spent in the call to mkl_blas_mc3_ddot. Using the VTune General Exploration facility shows that the CPI for this call is 1.124, Retire stalls is 0.718, LLC miss is 1.335 and Exec stalls is 0.327. These event results seem very high.

The application is rrunning single-threaded on a node reserved for timing runs.

Any suggestions on what might be causing this performance?

David

Konstantin_A_Intel · ‎01-31-2011

Hi David,

Thank you for using Intel tools for the analysis of your application's performance.

Let's take a look at your example. In fact, ddot operation is memory limited (the number of LOAD/STORE and FP instructions are about the same). So, if your program performs ddot operation on large arrays of data then the results look reasonable - you always should pull the data to caches from the memory. It's not possible to achieve significantly better CPI numbers, as far as reduce LLC misses in this case. The only way to improve the performance is to organize computations so that ddot is being performed on cached data, but it depends on the details of the algorithm and not always possible.

Regards,

Konstantin

Gennady_F_Intel · ‎01-31-2011

David, it would be helpfull if you compare the similar event results with dgemm results.

dehvidc1 · ‎02-01-2011

Thanks for the reply, Konstantin. A couple of things:

a/ Without stepping into the Intel MKL assembler I'm guessing that the Intel code implementing the MKL dot product call is optimised to use as much prefetching as possible? The dataset I'm working with for timing is only 100MB. Production datasets are up to tens of GB and likely to exceed that fairly soon. Prefetching would seem to be a potentially useful strategy.

b/ The data arrays are using Blitz maths library definitions. I haven't investigated if Blitz is allocating the data buffers within its arrays on 16byte boundaries. Is there a quick way to determine this?

David

dehvidc1 · ‎02-01-2011

Hi Gennady,

Apologies if I'm missing something but how would I use gemm if the calculation is a vector x vector dot product giving the result as a scalar? Isn't gemm for matrix-matrix calculations giving a matrix as the result

Thanks

David

Konstantin_A_Intel · ‎02-01-2011

Hi David,

Thanks for your questions. I'll try to answer:

a) Prefetching was being done most likely, but it should be realized that it did not help much in such memory limited operations like ddot because prefetch can not increase memory bandwidth.

b) About Blitz library. Is it from here?http://www.oonumerics.org/blitz/manual/blitz.html

I've just looked through sources and found this definition in tuning.h header:

#undef BZ_ALIGN_BLOCKS_ON_CACHELINE_BOUNDARY

So, if you wish Blitz++ to allocate arrays aligned to cache line, please make it defined:

#define BZ_ALIGN_BLOCKS_ON_CACHELINE_BOUNDARY 1

and rebuild the library.

BTW, which MKL version and which processor do you use?

Regards,

Konstantin

Konstantin_A_Intel · ‎02-01-2011

About Gennady's recommendation - I believe he just pointed you out to the routine which presumably is much more efficient teoretically, because ddot has ~2N memory reads and N FP ops, but dgemm - ~2N^2 memory reads and N^3 FP operation. It's just for comparison of theretical peak of both functions.

Of course, you should not use dgemm instead of ddot in your calculations.

Regards,

Konstantin

dehvidc1 · ‎02-01-2011

Thanks, Konstantin.

a/ But efficiently used, prefetching should mean that bandwidth betwen main memory and the cache hierarchy was the gating constraint? I think you're saying that the VTune results I gave indicate that this is the case?

b/ But how can I tell that Blitz is doing this correctly on 16byte boundaries?

From the MKL version query example:

Intel Math Kernel Library Version 10.3.0 Product Build 20100927 for Intel 64 architecture applications

Major version: 10
Minor version: 3
Update version: 0
Product status: Product
Build: 20100927
Processor optimization: Intel Core i7 Processor

The node I profiled on is:

cpu family : 6
model : 44
model name : Intel Xeon CPU E5620 @ 2.40GHz
stepping : 2
cpu MHz : 2394.063
cache size : 12288 KB

and the nodes on the production cluster are 5670's @ 2.93GHz

David

Konstantin_A_Intel · ‎02-03-2011

Hi David,

Thanks for the information you provided. I will try to make some experiments with MKL ddot just to be sure.

a) You're right, memory-to-cache bandwidth should be limiting factor here. But the thing looking strange for me in your numbers is LLC cache miss rate ~1.0; I would look into it a bit closer.

b) From Blitz code it seems that they align to 128-byte bound:

const int cacheBlockSize = 128; // Will work for 32, 16 also

dataBlockAddress_ = reinterpret_cast

(new char[numBytes + cacheBlockSize - 1]);

// Shift to the next cache line boundary

ptrdiff_t offset = ptrdiff_t(dataBlockAddress_) % cacheBlockSize;

ptrdiff_t shift = (offset == 0) ? 0 : (cacheBlockSize - offset);

const int cacheBlockSize = 128; // Will work for 32, 16 also
dataBlockAddress_ = reinterpret_cast (new char[numBytes + cacheBlockSize - 1]);
// Shift to the next cache line boundary
ptrdiff_t offset = ptrdiff_t(dataBlockAddress_) % cacheBlockSize; ptrdiff_t shift = (offset == 0) ? 0 : (cacheBlockSize - offset);

I will update you if found something valuable during my experiments.

Regards,

Konstantin

dehvidc1 · ‎02-03-2011

Thanks, Konstantin.

a/ What would you recommend as a strategy to look into the possible anomaly? The code is the Intel MKL dot product implementation so I can't do much about that :) The recommendations in the Intel presentation at software.intel.com/file/15529 suggest using:

i/ pre-fetching (as we've been discussing),
ii/ data blocking. I've tried using O3 but the application overall runs slower. The data array sizes are not known at compile time which would interfere with the ability of the compiler to block/tile. I could hand block/tile but this means hacking the code around quite a bit and as the datasets change in length I would have to put effort into making the loop prologue and epilogue flexible.
iii/ local variables for threads. Single threaded.
iv/ padding structures to cacheline boundaries. Why I'm keen to check if the alignment is working correctly.
v/ Changing your algorithm to reduce data storage. Don't think that's possible in this situation.

b/ I've been through this code in Blitz as well. But what I'm after is a diagnostic I can use to quickly analyse if the data portion of the arrays is being correctly allocated on an optimal boundary. I guess one way is to step through the assembler and eyeball the address of the data item. Is that the only way? Can I infer whether or not the alignment is correct from some other tool?

Regards

David

dehvidc1 · ‎02-03-2011

I'm trying to call gemm as per:

cblas_dgemv(CblasRowMajor, CblasNoTrans, x.rows(), x.columns(), 1, &x(0,0), x.rows(), g.data(), 1, 0, dotProdxg.data(), 1);

but I'm getting a runtime error message:

MKL ERROR: Parameter 7 was incorrect on entry to cblas_dgemv

Assuming that 7 is the correct number and that this is numbering from 1 this would indicate that something is wrong with the number returned by x.rows(). Which would be a bit odd as this is also being used for parameter 3 with no complaints.

Regards

David

timintel · ‎02-04-2011

You can't call dgemm via dgemv. Your call doesn't match the prototype for dgemv. Don't you get a compile error (assuming you have the header in scope)?
I don't understand why you have (partly) switched the topic to dgemv.

dehvidc1 · ‎02-06-2011

Typo in the previous post where I should have written dgemv rather than dgemm.

I'm calling gemv:

cblas_dgemv(CblasRowMajor, CblasNoTrans, x.rows(), x.columns(), 1, &x(0,0), x.rows(), g.data(), 1, 0, dotProdxg.data(), 1);

as per:

void cblas_dgemv(const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const double alpha, const double *A, const int lda, const double *X, const int incX, const double beta, double *Y, const int incY);


I tried this (dgemv) as one of the other Intel engineers suggested trying dgemm. Presumably dgemm exercises some different code on the Intel side to teh vec vec dot product call for which I reported the original VTune event results. 
I don't have any matxmat dotproduct calls in the code but I do have a mat-vec call. Presumably the dgemv call also exercises different code to teh original vecxvec call.

regards

David

timintel · ‎02-06-2011

dgemv ought in principle to be appropriate for a matrix vector product, but you could use the header as well as the documentation in an effort to get your syntax straight and let the compiler begin to catch your errors.

dehvidc1 · ‎02-06-2011

???? What? I don't understand your comment, Tim. Of course I'm including the mkl_cblas.h header file.

Just to recap. I'm getting a runtime error that says one of the parameters to the MKL call is wrong. I'm not seeing any compile time errors. The parameter reported in the MKL error is used as a parameter twice in the call. The prototype type for both these parameters is MKL_INT. The MKL runtime error reporting is saying the parameter is wrong once.

The include file has the prototype as:

void cblas_dgemv(const CBLAS_ORDER order,
const CBLAS_TRANSPOSE TransA, const MKL_INT M, const MKL_INT N,
const double alpha, const double *A, const MKL_INT lda,
const double *X, const MKL_INT incX, const double beta,
double *Y, const MKL_INT incY);

I guess I could explicitly cast all the parameters I'm using but given the compiler isn't complaining I haven't.

Last week I wondered if it might be an 32/64bit int size issue as per (from the mkl types header):

#ifdef MKL_ILP64
#define MKL_INT MKL_INT64
#define MKL_LONG MKL_INT64
#else
#define MKL_INT int
#define MKL_LONG long int
#endif

but I don't have MKL_ILP64 defined. And I tried linking with lp64 and ilp64 with no difference

David

Konstantin_A_Intel · ‎02-11-2011

Hi David,

Could you please report the values of passed parameters (which were reported as incorrect by LAPACK)?

And please do not try to use ilp64 without clear need of it :)

Regards,

Konstantin