I need to do a tons of
C = A*B^T
Where all matrices n times n big. n is typical 256 but could be smaller or bigger. Both A and B are used multiple times. Moreover the C is used in later multiplications i.e. C replace A or B.
NOTE I am only interested in the sequential case. I do not want MKL to parallelize anything.
It seems the matrices are too large for the compact type. In any case compact seems to be for multiple matrices i.e. liked batched.
In the packed type the C will not be packed so I have to pack it.
There is also the mkl_jit_create* routines.
Now my question is what I should go for among the possible matrix multiplication methods?
PS: An interesting alternative is to use BLASFEO(https://github.com/giaf/blasfeo) which course you cannot say anything about but give an idea about my use case.
Let me answer my question myself. It seems all but packed matrices are for very small matrices.
My computational test shows using packed matrices reduces run time by 20% in the best case. For smallish matrices there are no benefit.