Question on BLAS const arguments

Petros_Mamales · ‎01-13-2016

Hi,

I have a couple of general questions concerning multiple const arguments in blas and blas extension calls (C++ user).

1) When one has 2 const data (i.e. pointer) arguments, is there any harm in using the same data ?

For example, in dgemm, can the A and B matrices be the same, obtaining so, A^2, A*A^t and A^t*A ?

Is it a case by case answer? If so, how can one know?

2) In the mkl_?imatcopy which is an in-place operation why does one need 2 leading dimension arguments?

Why, in the documentation, a symbol AB is used instead of simply A ? Are you trying to indicate something that eludes me ?

TIA,

Petros

TimP · ‎01-13-2016

There should be no problem withsetting 2 input pointers to the same data, e.g. A and B in ?gemm.

The related case, where an output pointer matches an input, seems not to be discussed in the MKL document. The code may be written to allow for it (as some customers expect), even though in general (e.g. in the Fortran reference code), this would violate language standards and lead to optimization errors when compiled with full optimization.

mecej4 · ‎01-13-2016

2) In the mkl_?imatcopy which is an in-place operation why does one need 2 leading dimension arguments?

Why, in the documentation, a symbol AB is used instead of simply A ? Are you trying to indicate something that eludes me ?

In general, A and B are not square, and the allocated size may be larger than the working size. So the leading dimension of each matrix is needed to make sure that the input array elements are gathered correctly, and the output array elements are put into their proper places. The designation 'AB' seems to bridge the gap between the mathematical view, according to which A and B are distinct, and the programmer's view point, in which a certain block of memory initially contains A + unused space and, after the calculation is done, the same block contains B + unused space. Therefore, there is a gradual progression from A -> B, which could be indicated by A -> A B -> B. In other words, in the intermediate stages, the block of memory contains elements of A in some parts and B in other parts, and if there is any more space that is left over and may be safely written into, that part could even be used as a scratch pad.

Petros_Mamales · ‎01-13-2016

Tim P.

Thank you for confirming. I understand that mkl expects non-aliased buffers to avoid temporaries.

mecej4,

I am now totally confused by your response. This is an in-place operation. Why are you talking of two matrices ( A and B)?

Also, I was told in the past to assume that mkl is written in FORTRAN. Can u please confirm that this does not do internal allocations?

My understanding is that, the leading dimension is providing with the ability to operate on (contiguous) ranges of matrices. Then why would we care for what is the buffer of which the range we transpose ? Are you implying that the transposing extends further than the range?

(In C++, in a naive single threaded implementation one would traverse the lower triangle and for each element would interchange (after properly scaling) the elements with the upper triangle using a couple of temporaries. Therefore, one only needs logically only one leading dimension to implement the subscript operator for the lookup)

Finally, I have a couple more questions:

a) As an example of general behavior, can I assume that when I use dscal with scale 0, it is optimized internally to do a simple set, or mkl performs the multiplications anyway ? (or if it is 1 will it skip them?)

b) when I do dcopy and set incy = 0 do I obtain a fill?

Thank both of you for helping out.

Petros

TimP · ‎01-13-2016

Even for blas and lapack functions where the public reference is Fortran source, chances are proprietary optimized versions have been translated to compile with Intel c++ simd intrinsics and openmp. The expectation would be that performance and correctness tests would be run which exercise each of the original source code paths and corresponding optimized code paths. I wouldn't expect dcopy to perform a fill efficiently (unless there is an explicit case for it in reference source) but you're welcome to check it.

Intel invested significant effort in the automatic fast_memset and fast_memcpy function substitution feature of their compilers so a range of cases should be covered with minimum additional development effort.

Petros_Mamales · ‎01-14-2016

mecej4

or, Somebody else,

Could you please clarify the second (leading dimension) argument of the in-place copy mkl_imatcopy and answer my points?

Tim P.,

Thank you again.

mecej4 · ‎01-18-2016

I held back from responding because you have not given a specific example with details regarding the actual arguments for a situation where you feel that the design of the interface is lacking or over-specified.

With dgemm(), for example, in general the operation performed is C := α*op(A)*op(B) + β*C. If you choose to use dgemm to calculate A*A for a square matrix A, some of the function arguments are, no doubt, redundant. Nevertheless, they have to be given, because there is only one subroutine that has to cover all possible inputs. In Fortran 77, the number of subroutine arguments is fixed and their order is fixed. Even if the values of some arguments are not needed in a specific call, they have to be present and in their preset places.

In the special case of A := A*A that you mentioned, the aliasing rules of Fortran 77 require that you not use A as the second to last argument. You have to pass C in that place, and copy the results back to A after the call is completed.

With regard to your specific question regarding the LdA, LdB arguments to mkl_?imatcopy(), I have to defer to the Intel people for a response. The example file provided, simatcopy.c, has LdA = LdB = 6. Had they provided nother example with LdA not equal LdB, it would have been helpful.

Petros_Mamales · ‎01-19-2016

mecej4

Thank you for coming back to me.

The question for dgemm is pretty clear to me (Tim P.). Even in c++ if you wanted to calculate the square of a matrix, you are forced to create a temporary to avoid aliases.

I just wanted to make sure that there is no other rule with the const arguments of the functions (hence the name of the topic).

As far as the in-place transposition (with possible scaling), I see no reason for the second leading dimension argument, since we are talking for a single buffer, as the sketch for naive implementation that I showed above shows.

Let's wait the Intel people then. Any idea for when we would be having an answer?

Thank you again.

Dmitry_B_Intel · ‎01-20-2016

Hi Petros,

Leading dimension (LD) exists not just to allow for a contiguous range of matrices. This is just a parameter needed to specify indexing in a general way. In other words, element A(i,j) of matrix A may be located at offset i*s1 + j*s2, where s1 and s2 are 'strides'. In Fortran's column-major world we have s1=1, s2=LD. In C/C++ row-major layouts we have s1=LD, s2=1. If a two-dimensional array is a slice of multi-dimensional matrix, both s1 and s2 may be greater than 1 (this is supported by FFT functionality for example). With in-place transposition you have a matrix stored in an array as described by some strides on input, and the transposed matrix stored in the same array but described generally by different strides on output.

I hope this justifies the need for the second LD argument.

Thanks
Dima

Petros_Mamales · ‎01-21-2016

Dima,

Thank you very much for the answer.

Yes, it does explain the second argument !

However, there is nowhere mention in the documentation that this function applies to slices of matrices as well - maybe something to be added?

Hence, the confusion.

Thank you,

Petros

PS: which kind of begs the question: what other functions apply to matrix slices? ;-))