Intel MKL 8.1 and memory alignment at 16 byte boundary

evgeny1 · ‎09-02-2006

Hi,

I am a developer for Applied Wave Research, Inc. and would appreciate feedback on the following topic:

The MKL manual says that arrays should be aligned at 16 byte boundary for optimal performance. In C/C++ code, memory is usually allocated by malloc or new, so it is not aligned at 16 byte boundary by default. However, one can make it aligned at 16 byte boundaryby using the proper allocator.

I saw numerical differences (aligned vs non-aligned arrarys). Did anybody experience the same problem ?

Did anyone have problems with ZHEEV or ZGELSY routines functions in MKL 8.1 ? We are getting intermittent overflows (the same test works sometimes, and fails sometimes), all in ZGELSY or ZHEEV. However, it's hard to create a small test case as the problem is intermittent.

We switched to MKL 8.1 from MKL 6.1, and these functions seized to work. We are using mkl_c.lib library.

Thank you in advance,

Evgeny

TimP · ‎09-02-2006

The order of additions in sum reduction is affected by non-aligned data. If your data are such that overflow can be produced by changing the order, you have a dilemma when attempting to assure both reliability and performance. Could you consider scaling the data to avoid extreme magnitudes?
According to your use of mkl_c.lib, I suspect you are using an older compiler which employs only x87 code, helping you avoid overflow of intermediate results outside MKL.

No doubt, working with double data types with less than 8-byte alignment would be a particularly difficult case. Even Microsoft saw the light when designing their 64-bit ABI, so they have joined the vast majority of 64-bit operating systems in fixing malloc() and new so as to assure default alignment. It is certainly possible, in my opinion, that MKL 8.1 became more vulnerable to mis-alignment, with much of the testing being performed on 64-bit OS, or on linux, where even the 32-bit OS gives 8-byte minimum alignments.

We have struggled for years with alignment-dependent results (outside of MKL), and never fully solved the problem for 32-bit Windows or linux. Intel compilers were just recently changed so as to remove many of the alignment-dependent numerical inconsistencies; maybe that will help the next MKL. If you are interested, I would suggest filing a problem report on your premier.intel.com, asking if there are any new developments which could help you to resolve the problem with alignment dependent results, even if you are not able to submit a reproducer.

evgeny1 · ‎09-02-2006

Tim,

Many thanks for taking a look at my post. When I call ZHEEV (eigenvalues of Hermitian matrix C), I have C=S^H*S where S is the scattering matrix, and ^H denotes Hermitian conjugate of a matrix. Magnitude of elements of S matrix is small (no greater than 1 for passive, or no greater than 100 or so for active devices). So there is no reason to expect any overflows as the scaling of the problem is really good. Overflows happen inmtermittently (I can run a case and it works fine, but it may crash with overflow in another run).

We are using Microsoft Visual C++ .NET on 32 bit Windows XP with Pentium IV processors. When I call ZHEEV the arrays are aligned at 32 byte boundary by using a specilazed allocator that guarantees alignment. However prior to that MKL call there are several hundreds of calls to MKL/ BLAS routines for LU decomposition, ZGEMV (matrix-vector mult.) and backsubstitution made by my colleague's code. They do not align the arrays at all (whatever operator new returns), so I was speculating that if I align their arrays at 32 byte boundary, the code may become more stable and perform better too for large matrices.

Do you think it's a reasonable idea ?

What alignment (16 byte, 32 bytes, 64 byte boundary) is best in your opinion ? I don't worry about using 64 bytes extra, it's just the matter of being most robust and accurate.

It is possible that we have a dormant bug someplace but everything was fine before we switched to MKL 8.1 from MKL 6.1.

Best regards,

Evgeny

TimP · ‎09-02-2006

16-byte alignment should be sufficient to eliminate numerical variations due to alignment and gain most of the available performance. So there should be no difference in results between 16, 32, and 64-byte alignment. Performance variations are possible, but normally should be small, since all are multiples of 16 and should take the identical code paths.