Unfortunately, there is no in-place cblas_dgemm in MKL. What are the typical dimensions for the A and B matrices? If you have sufficiently large matrices, the overhead of copying to a temporary buffer will be insignificant compared to the dgemm execution time.
Not sure if it is still interested. I get news from our developers. After investigation, we think the way you did ( blocking and partial overwrite) would be still beneficial. We don't expect performance improvement by implementing the faction natively. So please go ahead to use your self implementation,