different CPU leads to different results

Azua_Garcia__Giovann · ‎06-29-2012

Hello,

I have been developing an iterative algorithm where most of the computation involves MMM, MVM, forward and backward solve, as well as several BLAS LAPACK functions available in MKL.

For big problem sizes I get diverging results in two different CPUs. All the software is exactly the same:

OS Linux Ubuntu 11.10 kernel version 3.0.0-22-generic
Intel parallel_studio_xe_2011_sp1_update2_intel64.tgz (MKL 10.2)
Intel l_mkl_10.3.10.319_intel64.tgz update
icc (ICC) 12.1.3 20120212

The two systems I have:

Intel 2 Core Duo on a MacBook Pro T9900 17'' Mid. 2009 (dual boot Ubuntu 11.10 kernel 3.0.0-22-generic)
Intel i7 3930K C2 stepping Desktop on an ASUS Rampage Extreme IV (Ubuntu 11.10 kernel 3.0.0-22-generic)

Basically the Intel Core 2 Duo MBP produces correct results whereas the Intel i7 3930K the results differ greatly (final result, number of iterations etc). To discard possibilities I started downgrading the icc settings e.g. removed -no-prec-div and this improved the situation for the n=2000 problem size but for larger problem sizes it fails to converge correctly. I switched to use g++ instead of icpc and the non reproduceability problem still persists. Hence, all signs point different MKL behavior depending on the processor.

I came across the article below. Is this a solution to my problem or is there a way to ensure reproduceability using the current MKL release? http://software.intel.com/en-us/articles/intro-to-CBWR-in-intel-mkl/

Many TIA,

Best regards,

Giovanni

TimP · ‎06-29-2012

I suppose CBWR is the most likely means to confirm or eliminate MKL achitecture dependencies. At least, you might link against the same MKL libraries, if you are testing intel64 linux in both cases. It looks like you should try identical icc options such as -xSSE4.1 -fp-model source, if you aren't doing so.
I suppose the MKL has been updated in the more recent releases.

Gennady_F_Intel · ‎06-30-2012

Yes, this new functionality ( CBWR ) will help you for getting the identical result while you use these routines.

These functions are available in of MKL version 11.0 beta.

Azua_Garcia__Giovann · ‎06-30-2012

Hello,

Thank you TimP I wasn't using the settings you suggest, in fact I was using -xHost which will exploit all CPU natively available optimizations and features AFAIK. I changed the compiler settings to what you suggested and it helped a lot. I now get divergent results only for one problem size and it does look like a bug in my code. I am testing it now using valgrind. Thank you.

Gennady Thanks I am using MKL 11.0 now, the only bit that worries me is memory alignment. I use a central bufferpool that preallocates all the memory needed for my algorithm once and upon startup. My matrices are all page size aligned and the vectors are all 16 byte memory address aligned (SSE). However, some times I need to pass to MKL memory addresses which are not directly memory-aligned allocated e.g. a column vector within one of the matrices and in cases like this I am wondering what the outcome would be. "To ensure MKL calls return the same results on all Intel or Intel compatible CPUs supporting SSE2 instructions or later make sure your application uses a fixed number of threads, in/output arrays in Intel MKL function calls are aligned properly, and"

Page size alignment:
double* buffer = NULL;
posix_memalign((void**) &buffer, sysconf(_SC_PAGESIZE), size*sizeof(double));

SSE alignment:
double* buffer = NULL;
posix_memalign((void**) &buffer, 16, size*sizeof(double));

Best regards,
Giovanni

Azua_Garcia__Giovann · ‎07-01-2012

Okay now I have predictable results even with the highest icc compiler options:
-align
-finline-functions
-malign-double
-O3
-no-prec-div
-openmp
-xHost
-opt-multi-version-aggressive
-scalar-rep
-unroll-aggressive

To my surprise the problem was memory misalignment for some of the matrices/vectors used as input to MKL. This would only affect reproduceability of the results while using the i7 3930K and not while using the older Core 2 Duo processor. So my problem was due to the alignment. While using MKL 11.0 beta and tweaking the environment variable MKL_CBWR does have an effect and setting it to COMPATIBLE ensures correct results for all problem sizes. Unsetting it for large problem sizes produces slightly different results. I have to try setting it to AVX, I am not sure what the default is when unsetting it?

The gcc/g++/gfortran compiler produces better performance results than icc with the options:
-mtune=native
-march=native
-fopenmp
-O3
-fomit-frame-pointer
-funroll-loops
-ffast-math
-funsafe-math-optimizations

In conclusion, a combination of 1) aligning all the inputs to MKL and 2) switching to MKL 11.0 beta solved my reproduceability problems. I wonder if switching back to MKL 10.x latest would still work and whether the performance will be worse/ better than using MKL 11 beta? I wonder whether the latest MKL 11.0 beta I downloaded couple of days ago will be as fast as the latest MKL update 10.x. I can find this out but separate from the time consuming reinstallation, it also takes some time to run all problem sizes and the slight variations depending on performance parameters: NB block sizes, Single vs Multi-Threaded, etc etc.

Many TIA,
Best regards,
Giovanni

TimP · ‎07-01-2012

MKL default (in the absence of MKL_CBWR) should be to pick the code matching the CPU detected at run time, so it would use AVX on the Intel AVX-capable CPU. If you go back to an early 10.x version, you are likely to lose optimizations, particularly in AVX mode.
AVX mode may take advantage of alignments up to 32-byte aligned.
When you request multi-threaded, MKL may still choose single threaded if the problem isn't large enough to benefit from multiple threads.
Your gcc/gfortran options include the equivalent of icc -complex-limited-range which could make a big difference if you have complex arithmetic.
It will make a difference which versions of icc and gcc you use, particularly for AVX at -O3.

Azua_Garcia__Giovann · ‎07-02-2012

Hello TimP,

Thank you. I aligned all MKL input vectors to 32-byte alignment and it produces perfectly accurate results. Indeed a quick check reveals some substantial speed up moving from MKL 10.x to MKL 11.0 beta, great work!

Best regards,
Giovanni

Victor_K_Intel1 · ‎07-02-2012

Giovanni,

Actually, I am a little bit concerned by your statement
Basically the Intel Core 2 Duo MBP produces correct results whereas the Intel i7 3930K the results differ greatly (final result, number of iterations etc).
However, it is rather common misunderstanding. Indeed, why do you think that the result obtained on one proc is correct whereas on another is incorrect? They all are incorrect, right? And if they differ greatly the algorithm is no quite stable.
Despite CBWR feature can draw a veil over numerical stability issue it does not resolve it. Actually, the CBWR feature is destined to be used in situations when you know that your calculations are intentionally unstable and this is some kind of regularization method (like in ill-posed problems).
So, probably you have to investigate stability of your method (if possible).

Thanks
Victor

Azua_Garcia__Giovann · ‎07-03-2012

Hello Victor,

Thank you for your support. Actually I'm working on an optimization algorithm which is iterative and converges depending on an epsilon threshold . The results in the Intel Core Duo worked consistently for all versions of this algorithm (with and without using MKL). However, when I moved to the i7 architecture I noticed the differences in number of iterations for the big problem sizes. Note that the algorithm would still converge but not with the exact same number of iterations for the big problem sizes. At the time I posted I also had an issue rooting from a broken Ubuntu kernel update.

After researching the issue, it boils down to enabling the AVX optimizations or not at either the compiler or MKL levels and the different behavior is documented in the article cited in the OP. You suggest I might have some ill conditioned problems, but I do not think so. I observed that for my algorithm toggling the AVX at MKL level has a stronger effect than toggling it at icc level i.e. -xHost. The heavy-weight of the computation that I do manually outside MKL are orthogonal transformations/ Givens rotations which are known to be numerically very stable.

Best regards,
Giovanni