topic MIC offload support in Intel® oneAPI Math Kernel Library

Couple of Intel MKL automatic parallelisation/Phi coprocessor questions

dehvidc1 — Thu, 23 Jan 2014 03:42:20 GMT

Hi,

I'm doing an evaluation of the MKL library with a Phi coprocessor for possible use in production at a bioinformatics institute. The production datasets are very large but I'm starting with some test data of about 6GB to fit in the 8GB Phi coprocessor memory. I've used the Automatic Offload MKL capability (very appealing in terms of rolling this hardware out while minimising code changes) to do a Cholesky factorisation with LAPACKE_dpotrf. Comparing runtimes for one thread on the Xeon not using the Phi coprocessor to a single Xeon thread Automatic Offloading to the Phi coprocessor gives about a 65% reduction in wallclock time. The runtime includes the data download to the Phi coprocessor. The speedup would probably improve if I could also do the inverse on the Phi coprocessor using the LAPACKE_dpotri MKL call on the result from the Cholesky call. But I don't think the dpotri call is supported for Automatic Offload as yet. To my queries:

a/ Perhaps someone from Intel could give some guidance on when the dpotri call might be supported for Automatic Offload? I could have a crack at implementing the function but would prefer to use an optimised Intel version.

b/ Is there a list of what MKL calls are currently supported on the Phi environment and which are optimised? The release notes have incremental details so I guess I could put this together but it would be handy to have this already collated.

c/ As part of this work we noticed using the MKL automatic parallelisation on just the Xeon cores (2 CPU's 10 cores/CPU) with no coprocessor involvement (MKL_MIC_ENABLE=0) that the runtimes dropped off nicely from 1 to 8 cores, runtime for 9 cores was higher than for 8 cores, runtimes dropped nicely again to 12 cores, increased at 13 cores and then dropped for 14 cores after which the runtimes were about the same as the 14 core case. For the Xeon core work we are using

KMP_AFFINITY=verbose,granularity=fine,compact,1,1

I'm a bit perplexed about the runtime increases at 9 and 13 cores.

Thanks in advance for any help

David

For the question about

TimP — Thu, 23 Jan 2014 13:05:00 GMT

For the question about scaling vs. number of threads on host, can we assume you have hyperthreading enabled but are trying to place each thread on a separate physical core? You may want to try KMP_AFFINITY settings which balance explicitly the work between CPUs for comparison. Unfortunately, such settings must take into account whether hyperthreading is active. If you must use a single AFFINITY setting, scatter, or OMP_PROC_BIND=spread, might make more sense for such a scaling study.

In my experience, straightforward BLAS implementation with the open source and Intel OpenMP compiler can be effective for relatively small cases which don't need the full capability of the coprocessor. It would take much expert development to optimize for large cases such as yours.

Thanks, Tim. Hyperthreading

dehvidc1 — Thu, 30 Jan 2014 08:08:46 GMT

Thanks, Tim. Hyperthreading isn't enabled for this work. With respect to affinity settings, I think what I've set as listed in the first post will bind the threads to physical cores on alternate CPU's ie inherently balanced. So still perplexed about why we are seeing the anomalous behaviour at 9 and 13 cores.

If you have hyperthreading

TimP — Thu, 30 Jan 2014 13:33:10 GMT

If you have hyperthreading off in BIOS and are binding your host threads to odd numbered cores, that's probably an excellent setting for 10 threads split evenly between CPUs; I think you'd need to study the verbose output about what is happening beyond 10 threads. With an odd number of threads, you necessarily unbalance the work load between CPUs so might expect less performance per thread than with neighboring even numbers of threads.

At 13 threads, on the face of it, you are assigning 2 threads each to cores 1,3,5. Maybe this automatically uses neighboring even numbered cores, but you appear to have 8 threads on one CPU and 5 on the other.

MIC offload support

nsmeds — Wed, 03 Dec 2014 12:44:45 GMT

MIC offload support

dpotrf and dpotri are now supported for MIC offload (as of MKL 11.2 update 1).

A quick-and-dirty trick to find out if a function is offload enabled is the following:

1) Create a (very short) stub program such as the following three lines:
main (){
dpotrf();
}
Be sure not to include the MKL header file so that you can have a dummy call to the function without having to match the calling interface.

2) Compile statically against the Intel MKL library
icc -mkl -static-intel x.c

3) Look for any appearance of the MKL offload environment variables in the generated binary
strings -a a.out | grep MKL_MIC_
MKL_MIC_THRESHOLDS_DGEMM
MKL_MIC_ENABLE
MKL_MIC_DISABLE_HOST_FALLBACK
MKL_MIC_RESOURCE_LIMIT
MKL_MIC_REGISTER_MEMORY

The above strings seems not to appear if you call a function that is not offload enabled.

the quickest way to see if

Gennady_F_Intel — Thu, 04 Dec 2014 03:44:59 GMT

the quickest way to see if computation is offloaded is just to set OFFLOAD_REPORT environment variables. You will see a lot of information about offloading process and You don't need to change the original code in that case at all. See more about OFFLOAD_REPORT into compiler's documentation