Re: Unexplained result with mkl_cspblas_dcsrsymv - Page 2

jay_oswald · ‎04-10-2009

I think that this is a linking error, maybe someone can help.

I'm using VS2008 compiling a 64bit app and callingmkl_cspblas_dcsrsymv in a c++ routine and am getting incorrect results. The relevant libraries I link aremkl_em64t.lib,mkl_intel_lp64.lib,mkl_intel_thread.lib,mkl_solver_lp64.lib,libiomp5md.lib. (I use the Paradiso solver elsewhere and it works perfectly).

If I replacemkl_intel_thread.lib withmkl_sequential.lib all works fine, except of course I don't get any speedup. Whether or not the compiler option for OpenMP support is enabled seems not to matter. I'm compiling with the /MD (multi-threaded dll runtime library) option set.

jay_oswald · ‎05-28-2009

Here are the results in Linux w/ 8 cores. These results are kind of interesting, the random sparcity pattern actually gets slightly better scaling. Also quad core Xeons running Linux seem to scale much better than my Core 2 Duo running Windows (granted these are E5405 w/ 12 MB cache vs an old E6600 w/ 4 MB cache).

Regardless, I'm fairly happy with the results now, as I run most of my simulations on the Xeon CPUs. I may get an i7 system to tinker around with soon too. If anyone is interested to see the scaling performance on it, I'd be happy to try it.

BANDED matrix

********************************************************

********************* SPEED TEST *********************
********************************************************
# rows = 100000, max threads = 8
B f2 f3 f4 f5 f6 f7 f8 nonzeros
2 0.86 0.87 0.83 0.75 0.68 0.63 0.58 199999
4 1 1.1 1.1 0.97 0.9 0.78 0.76 399994
8 1.3 1.5 1.5 1.4 1.4 1.2 1.2 799972
16 1.4 1.7 2.1 1.8 1.7 1.5 1.5 1599880
32 1.5 1.8 2 1.8 1.8 1.7 1.7 3199504
64 1.8 2 2.2 2 2.1 2 2.1 6397984
128 1.7 2.1 2.5 2.3 2.4 2.4 2.5 12791872
256 1.9 2.3 2.8 2.6 2.7 2.7 2.8 25567360
512 1.9 2.4 2.9 2.7 2.8 2.9 2.9 51069184

RANDOM sparcity
********************************************************
********************* SPEED TEST *********************
********************************************************
# rows = 100000, max threads = 8
B f2 f3 f4 f5 f6 f7 f8 nonzeros
2 0.76 0.78 0.75 0.69 0.65 0.6 0.54 199999
4 1 1.1 1.1 0.98 0.93 0.84 0.74 399994
8 1.4 1.6 1.7 1.5 1.5 1.4 1.3 799972
16 1.7 2.1 2.4 2.2 2.2 2.1 2.1 1599880
32 1.6 2.1 2.5 2.4 2.4 2.5 2.5 3199504
64 1.8 2.4 2.8 2.7 3 3 3.2 6397984
128 1.7 2.2 2.8 2.5 2.8 3 3.2 12791872
256 1.8 2.5 2.9 2.9 3.1 3.3 3.5 25567360
512 1.8 2.4 3.2 2.7 3.4 3.2 3.8 51069184
1024 1.9 2.3 3.1 2.9 3.1 3.3 3.5 101876224

*edit I should explain these tables - B is the number of non-zeros per row (its symmetric so 2B-1 is the actual row size), fx is the speedup factor going from 1 to x cores, i.e. f8 is the time it takes to run with 1 core divided by the time it takes with 8 cores, nonzeros is the total number of nonzeros in the matrix (again this is only counting the upper triangular part, so the actual number is 2*nonzeros - rows).

Gennady_F_Intel · ‎05-29-2009

Hi Jay,
Thanks for the results and for the benchmarks.
We've already encountered with different scaling results linux/windows for this routine and still investigating the problem. We will be back if any news.

--GIF