Greetings Intel Math Kernal Library Development Team,
I was disappointed with the packaging of the MKL Java JNI. Firstly the files were in an examples folder. The JNI interface really does not belong there.
Secondly it required an MS Visual Studio or Intel compiler. This is not necessary. Id expect a .jar of the compiled classes and three or four of the standard processor/platform library files (mkl_java_stubs_winx64.dll, mkl_java_stubs_winx86.dll or mkl_java_stubs.so).
I was able to get my hands a bit dirty with the source though and found that I could significantly reduce JNI call runtime by simply changing the array accessing methods to use GetPrimitiveArrayCritical.
That is, this tells the Java VM to pin the address space so the DLL can access array contents directly rather than through a copy of the data.
For instance this could be done in CBLAS_gemm.h like this (changes bold):
* Class: com_intel_mkl_CBLAS
* Method: sgemm, or dgemm
JNIEXPORT void JNICALL NAME_JNI_REAL_CRITICAL(JNIEnv *env, jclass klass,
jint Order, jint TransA, jint TransB, jint M, jint N, jint K,
DATA_SCALAR alpha, DATA_VECTOR A, jint lda, DATA_VECTOR B, jint ldb,
DATA_SCALAR beta, DATA_VECTOR C, jint ldc)
DATA_SCALAR *aElems, *bElems, *cElems;
aElems = (DATA_SCALAR*)(*env)->GetPrimitiveArrayCritical(env,A,NULL);
bElems = (DATA_SCALAR*)(*env)->GetPrimitiveArrayCritical(env,B,NULL);
cElems = (DATA_SCALAR*)(*env)->GetPrimitiveArrayCritical(env,C,NULL);
assert(aElems && bElems && cElems);
Depending on the matrix size, I achieved a reduction in runtime up to 25% in dgemm. GetPrimitiveArrayCritical may halt GarbageCollection so in theory should be avoided for really long calculations. So it may be better to offer both versions, something like dgemm and dgemmCRITICAL.
If a Java coder has sought out the MKL JNI she is probably seeking better performance than what the open source Java implementations provide. Disabling the memcopy aspect of MKL's JNI parameter passing is low hanging fruit that will improve performance and help close the Java to C performance gap when using these libraries.