BLAS performance vs naive c code

jakerr · ‎04-13-2011

Hi

I am a new BLAS user, trying to improve c code for solving a time dependent 2D wave equation (PML absorbing boundaries) by replacing some of my loops with cBLAS functions. Just to get a feel, I started by concentrating on one code block within the program, the block for updating p below.

struct grid {
double dt;
int ny;
int nx;
and other grid info ....
};

struct pmlwaves{
double *p;
double *pdot;
otherpointers to double neededfor the PML absorbing boundary method...
};

main(){

struct pmlwaves w;
struct grid g;
int k;

different initializations....

for(tstep=0; tstep
different calculations to find pdot...

/*option 1:naive code */
for(k=0; k w.p =w.p + g.dt* w.pdot; /* advance solution one time step */

/* option 2: cBLAS */
cblas_daxpy(g.ny*g.nx, g.dt, w.pdot,1,w.p,1); /* advance solution one time step */

different cacluations arising from PML absorbing boundary method....

} /*end time stepping loop */

} /* end main() */

This was compiled with icc on a multi-core (Xeon 7550) machine, the exact compilation command was

icc myfile.c -O2 -mkl -openmp

The values used were g.nx = g.ny = 481, num_tsteps = 7500.

I used the openmp function omp_get_wtime() to measure the wall clock execution time of this code block and to accumulate these times throughout the time stepping loop.

The following (surprising?) results were obtained:
1) Option 1 accumulated time: around 3 sec.
2) Option 2 accumulated time: around 90 sec!!
3) When w.p and w.pdot were replaced with "regular" pointers (which are not fields of a structure) the time
of option 2 was around 3 sec (just like the naive loop in option 1).

My questions:
1) Why the incredibly long running time when the pointer arguments to cblas_daxpy were fields in a structure? They just pass an address to daxpy dont they? Why doesent cblas_daxpy regard w.p and w.pdot just as pointers to double (as they are)?

2) When comparing items 1 and 3 in the results, why doesent cblas_daxpy offer any advantage over the naive loop? Did I ommitt any flags/options to the compiler?

3) There seems to be a whole lot of things to know about proper running of BLAS/cBLAS, mainly about compiler options/flags and makefile issues and their compatibility to a specific machine. Where can all this be learned? Is there some good resource/website/book?

Thanks a lot

Jake

TimP · ‎04-14-2011

The usual case for daxpy to give a performance advantage is for the reputed 90% of programmers using a compiler without auto-vectorization, or a case of failed vectorization (possibly due to a compiler being concerned about operand overlap). It's not possible for daxpy to out-perform correctly auto-vectorized code, at least not when there is no opportunity to gain by vector parallel execution. Even on the Xeon 7550, to see a significant gain from threading, the structs would have to be much larger and accessed consistently (from first touch) by threads affinitized to the 2 sockets.
I suppose you'd have to analyze the actual example to try to find out what optimizations are missed with the struct of pointers. As far as I know, it's not a common programming model which would appear in a corpus of code for which the compiler is performance tested. You could experiment by making an explicit local plain pointer copy just ahead of the loop.

Murat_G_Intel · ‎04-14-2011

Hi Jake,

As you mentioned, one would expect that using pointers or pointers in structs would not make a dramatic performance difference. I verified this using your code. For me, there was no performance difference between pointers and pointers in structs. One thing to pay attantion here is how the pointers are allocated/initialized. Proper allignment of the pointer addresses usually improves the performance. You could try using mkl_malloc for allocating alligned memory.

MKL daxpy binaries are compiled with the optimal compiler flags. Therefore, the compiler flags you use in your program should not make a dramatic difference on the MKL daxpy performance.

Intel Math Kernel Library for Linux* OS Users Guide provides guidelines for using MKL and summarizes factors that affect performance.

Thanks,

Efe