We recently upgrade our MKL to version 18.3 and TBB to version 18.4. When running a debug version of our code we noticed that we were getting heap corruption when calling the DGEEV MKL function. It appears that the work vector passed into the function was overwritten. Importantly, one can query for the optimum length of the work vector by passing in -1 as the length of the work vector. In this particular case, we have a 60x60 input matrix and the optimum length of the work vector was returned as 2040. However, we were passing in a value of 6x60 = 360 for the work vector. After running DGEEV in debug mode, it appears that the function was assuming the length of the work vector was the optimum length as all of the data up to that length (2040) had been modified.
When running a release version of our application, the DGEEV function did not write beyond the specified length vector of 360.
I then created a console application that loads the input matrix from a file and then calls the DGEEV function. The behavior mentioned above showed the potential heap corruption on the work vector.
Is this a known issue? If so, are there any other MKL functions that we must worry about?
Thanks in advance for your response.
P.S. I realize that querying the function for the optimum work vector length is a workaround.
This issue was (coincidentally) fixed in MKL 2019, after we integrated a new function into this particular routine. It will also be fixed in MKL 2018 update 4. Unfortunately, for now the best workaround is to use the optimal workspace query size.
Can you be a bit more specific about what MKL functions are affected? Fixing only DGEEV is relatively easy to obtain the optimal workspace. However, this is a much more daunting task for all MKL functions that take a work vector.
Do we need to worry about this ONLY for debug versions, or can the issue also occur for Release versions of our software?
Thanks in advance for the reply.
Sure. There was a bug in an internal function used by the ?geev (all precisions) functions that resulted in an incorrect optimal workspace being returned - this resulted in the ?geev functions using too much workspace when only the minimal workspace was input. The internal function in question is ?gehrd.
We fixed the issue for this function, which fixed the issue in ?geev. Looking at the caller graph for ?gehrd (http://www.netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga2611cc9dfdc84e2a08ec57a5dd6cdd2e.html#ga2611cc9dfdc84e2a08ec57a5dd6cdd2e), it seems like there are around ~9 functions that could potentially be impacted by this issue (some of the "callers" in that graph are testing functions for Netlib). I've looked over each of them quickly, and it seems like setting lwork to the optimal value should be a workaround for each one of these functions until MKL 2019 and MKL 2018u4 are released.
I can't positively guarantee that the rest of our functions are free from similar bugs, but the majority of our functions that use work arrays shouldn't have this issue - it was a bug affecting a particular function (and its callers), and it is not a bug with the way workspace is used in LAPACK in general. Using the minimal workspace for other functions should work just fine. I'm not exactly sure why running your application in release vs. debug mode fixed the issue, but that is most likely a quirk particular to either this bug (probably) or to your application so in general, running in release mode won't fix bugs of this type. But, again, it was a bug in a particular function, so generally you shouldn't be too worried about running into this type of problem in other MKL functions.
Let me know if I can provide any more information, hopefully this helps!