Showing results for 
Search instead for 
Did you mean: 

BLAS performance vs naive c code


I am a new BLAS user, trying to improve c code for solving a time dependent 2D wave equation (PML absorbing boundaries) by replacing some of my loops with cBLAS functions. Just to get a feel, I started by concentrating on one code block within the program, the block for updating p below.

struct grid {
double dt;
int ny;
int nx;
and other grid info ....

struct pmlwaves{
double *p;
double *pdot;
otherpointers to double neededfor the PML absorbing boundary method...


struct pmlwaves w;
struct grid g;
int k;

different initializations....

for(tstep=0; tstep
different calculations to find pdot...

/*option 1:naive code */
for(k=0; k w.p =w.p + g.dt* w.pdot; /* advance solution one time step */

/* option 2: cBLAS */
cblas_daxpy(g.ny*g.nx, g.dt, w.pdot,1,w.p,1); /* advance solution one time step */

different cacluations arising from PML absorbing boundary method....

} /*end time stepping loop */

} /* end main() */

This was compiled with icc on a multi-core (Xeon 7550) machine, the exact compilation command was

icc myfile.c -O2 -mkl -openmp

The values used were g.nx = g.ny = 481, num_tsteps = 7500.

I used the openmp function omp_get_wtime() to measure the wall clock execution time of this code block and to accumulate these times throughout the time stepping loop.

The following (surprising?) results were obtained:
1) Option 1 accumulated time: around 3 sec.
2) Option 2 accumulated time: around 90 sec!!
3) When w.p and w.pdot were replaced with "regular" pointers (which are not fields of a structure) the time
of option 2 was around 3 sec (just like the naive loop in option 1).

My questions:
1) Why the incredibly long running time when the pointer arguments to cblas_daxpy were fields in a structure? They just pass an address to daxpy dont they? Why doesent cblas_daxpy regard w.p and w.pdot just as pointers to double (as they are)?

2) When comparing items 1 and 3 in the results, why doesent cblas_daxpy offer any advantage over the naive loop? Did I ommitt any flags/options to the compiler?

3) There seems to be a whole lot of things to know about proper running of BLAS/cBLAS, mainly about compiler options/flags and makefile issues and their compatibility to a specific machine. Where can all this be learned? Is there some good resource/website/book?

Thanks a lot


0 Kudos
2 Replies
Black Belt

The usual case for daxpy to give a performance advantage is for the reputed 90% of programmers using a compiler without auto-vectorization, or a case of failed vectorization (possibly due to a compiler being concerned about operand overlap). It's not possible for daxpy to out-perform correctly auto-vectorized code, at least not when there is no opportunity to gain by vector parallel execution. Even on the Xeon 7550, to see a significant gain from threading, the structs would have to be much larger and accessed consistently (from first touch) by threads affinitized to the 2 sockets.
I suppose you'd have to analyze the actual example to try to find out what optimizations are missed with the struct of pointers. As far as I know, it's not a common programming model which would appear in a corpus of code for which the compiler is performance tested. You could experiment by making an explicit local plain pointer copy just ahead of the loop.

Hi Jake,

As you mentioned, one would expect that using pointers or pointers in structs would not make a dramatic performance difference. I verified this using your code. For me, there was no performance difference between pointers and pointers in structs. One thing to pay attantion here is how the pointers are allocated/initialized. Proper allignment of the pointer addresses usually improves the performance. You could try using mkl_malloc for allocating alligned memory.

MKL daxpy binaries are compiled with the optimal compiler flags. Therefore, the compiler flags you use in your program should not make a dramatic difference on the MKL daxpy performance.

Intel Math Kernel Library for Linux* OS Users Guide provides guidelines for using MKL and summarizes factors that affect performance.