MKL Memory and Managed Interop ??

Deleted_U_Intel · ‎06-25-2008

I'm throwing this one out to the forum to gleam tips and techniques regarding C++/CLI, memory functions, and blittable data structures to better interface with mkl functions.

NOTE: I'm currently using managed arrays and pining pointers (which implicitly map to native pointers* when passed to mkl functions).

So to my questions:

//////////////

Q1) Regarding aligning arrays to 16-byte boundaries, as suggested in the MKL Manuals.

The msdn documentation sugests that: {__declspec(align(#)), StructLayoutAttribute, or FieldOffsetAttribute} should be used to align types from the primitive level up. Whereas the mkl reference manual outlines using the low level MKL_malloc functions.

e.g: a = (double*)MKL_malloc(n*n*sizeof(double),128);

The MKL_malloc example then appears to index[] into these buffers in the following way:

a = (double)(i+1);

Does this mean that MKL_malloc is simply creating a native array under the covers, which would be equivalent to constructing a managed array from aligned primitives, or is there something else happening here ??

//////////////

Q2) As the manual states that all arrays should be aligned to 16-byte boundaries (i.e. made up of 128-bit units), how should this be interpreted when comparing arrays of the base 4 mkl types [s,d,c,z] ??

To better illustrate this question i'll use the following diagram (bit reversed for simplicity).

/* The following is illustrating an array of aligned mkl types of length [2], in each of the current mkl primitives. An asterisk represents 1-byte of data and a dash signifies 1-byte of padding */

Array of singles padded to make up 16-bytes:
|0_______|8_______||16______|24______|
|****----|--------||****----|--------|

Array of doubles padded to make up 16-bytes:
|0_______|8_______||16______|24______|
|********|--------||********|--------|

Array: single precision complex w. the real and imag components take up 4 bytes each and padded to make up 16-byte units:
|0_______|8_______||16______|24______|
|****----|****----||****----|****----|

Array: double precision complex w. the real and imag components take up 8 bytes each to make up 16-bytes and no padding:
|0_______|8_______||16______|24______|
|********|********||********|********|

Is this interpretation correct ??

Or is this on half scale whereby a single complex data type takes up 256-bits to normalize all memory units to accommodate x86 and x64 in one common form ??

//////////////

Q3) 'If' the previous question's illustration is correct, how should I be treating the first 2 scenarios i.e. single and double arrays?

As in, would I have to define custom structures that consist solely of a single field aligned to the 16-byte (128-bit) boundary that replace the CLI primitive float and double types ??

//////////////

Q4) Anything I might have missed or not considered, regarding memory/interop, that could make a system more robust (relating to mkl memory functions) ??

TimP · ‎06-25-2008

OK, I'll take a stab at this, although I may have missed a few of your points.

1. I don't believe the memory allocation handled by MKL_malloc() is used any differently from one created by standard malloc(). Unfortunately, standard malloc() for 32-bit Windows doesn't support other than 4-byte alignment. In those contexts where declspec(align(16)) works, it is a good way to obtain alignment.
Although the concept of dgemm() involves 2-d Fortran arrays, those are simple linear arrays from the C point of view, with the columns beginning at intervals of NN in the example.

2. There's no padding in the arrays generally used by MKL. In a 128-bit data segment, you could have 4 float data elements, 2 doubles, or float complex, elements, or 1 double complex.

TimP · ‎06-25-2008

You're correct on the order of data in 2-d array.
The suggestion of 16-byte aligned arrays is to improve efficiency of vectorized MKL functions, and to avoid minor numerical differences due to alignment-dependent order of evaluation in those functions.
I'm concerned about unjustified generalization if I try to answer about memory management. The main reason for offering various replacements for malloc() is the varying alignment of the standard 32-bit malloc() and new[]. One might use the MKL versions for compatibility between 32- and 64-bit builds, otherwise, the 64-bit malloc() has satisfactory alignment by default. Both the OpenMP library (used with mkl_thread), and the MKL malloc(,) can be expected to maintain compatibility with the standard malloc(), provided that you don't mix inconsistent malloc(), free(), and realloc(). As you imply, if you are using some other memory management at the top level, and it supports passing aligned data regions down to MKL, you might choose that.