Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!

MATMUL causing stack overflow

Roman1
New Contributor I
304 Views

I noticed that for large matrices, MATMUL is crashing with a stack overflow.  I can fix this with /heap-arrays0.   The program does not crash when calling dgemm from MKL.  I ran some tests, and the results from MATMUL and dgemm are identical.  However MATMUL needs a large stack, and dgemm doesn't.   Is this the correct behavior, or is there a bug somewhere?

Roman

 

0 Kudos
3 Replies
Lorri_M_Intel
Employee
304 Views

Most of the time, the MATMUL intrinsic is implemented by in-line code.  That is, the actual instructions to multiply each element of the matrices is generated by the compiler (including the loops to go from element to element, etc).  That's different than a call to dgemm, where it is a simple routine call.

That said, because using /heap-arrays resolves the stack overflow, I'm going to guess that  the compiler cannot detect that there won't be overlap between the result variable and the two operands, and for safety reasons it creates a temporary array.  By default, this is on the stack.

        To answer your question, I would characterize this as "expected behavior", and not likely a bug.

                        --Lorri

 

Roman1
New Contributor I
304 Views

I ran a test, and the performance of dgemm is slightly better than matmul.  Based on your reply, this might be because there is an extra step where values are copied from the temporary memory to the result variable.

 

TimP
Black Belt
304 Views
In the case where the matmul result is stored explicitly A = matmul(B,C) Some compilers are able to avoid allocation of a temporary. Usually, the difference in performance would come mostly from efficiency of cache usage. In the more general case, A = b*matmul(C,D) + E It seems unlikely for a compiler to optimize the temporary away.
Reply