VTune and Performance questions

Andrew2 · ‎10-31-2013

Hi,

I am fairly new to OpenMP and Phi programming, but I am working on a program that performs model fitting over a large number of independent models. The model fitting routine boils down to a few calls to dgemm and dgesv on small (~50-100 x 1000) matrices. I've implemented it with MKL and use Open MP to parallelize over the possible models, and it is pretty sporty on the dual E5-2609 machine I am using. I wanted to see how it worked on the installed Phi card (3120A) so I compiled it to run as a native application with -mmic, transferred it to mic0 and was surprised that it was significantly slower than on the host. I've set OMP_NUM_THREADS=244 and KMP_AFFINITY=compact,granularity=fine and the result is about 4x slower than the host with OMP_NUM_THREADS=8 and KMP_AFFINITY set to default.

So in order to try to get some idea of what's going on, I compiled with "-debug inline-debug-info", fired up the VTune GUI and set up a project that runs micnativeloadex (with my application as the parameter and all the environment variables set up properly). I ran a Knight's Corner Hotspot analysis and found that, by far, most of the time is spent in "[vmlinux]", which I assume is the linux kernel running on the Phi card.. more than 10x the time spent in the MKL dgemm routines. My questions are the following:

(1) Is the [vmlinux] result an artifact of how I set up the VTune hot spot analysis? If so, how should I set it up so that I get useful information? If it really is spending all its time in the kernel, what does that tell me and how do I fix it?
(2) Generally speaking, why is my native MIC application so much slower than the host application with so many fewer cores? Where should I look to improve performance?
(3) Are there any best practices for using MKL functions inside OpenMP threads? I wonder if the multi-threading in MKL is competing with the independing model threads.

Thanks,
Andrew

Andrew2 · ‎11-01-2013

I've been playing with it some more and when I don't use multiple OpenMP threads to test independent models, I find that most of the time is spent in __kmp_wait_sleep and __kmp_static_yield... When I set OMP_NUM_THREADS=114, my code runs about 30% faster. What is going on here? How do I reduce all this overhead?

robert-reed · ‎11-01-2013

Welcome, Andrew. You have a lot of new technology to discover and a lot of questions. Let me try at least to start helping you discover some answers. Hopefully you'll discover that there are a lot of articles written in the Intel Developer Zone that can help answer some of your questions.

Starting last question first: yes you may have a problem calling Intel Math Kernel Library (Intel MKL) functions that have been threaded from code that is also threaded. The result can be an explosion of threads that may just bog down the machine. There is a special Intel MKL library module with single-threaded versions that can be used in cases where the parallelism is originating in the user code. The basic problem scenario you describe--a lot of parallel work trying alternatives that each amount to a smallish matrix computation using gemm and gesv--would probably be best served by this model. You can find out more about using Intel MKL in a threaded environment in this article about parallelism in Intel MKL.

One of the biggest problems people encounter when trying to port to the coprocessor is being able to provide enough parallel slack to take advantage of the available bandwidth and computational power there. Often a first VTune Amplifier run reveals most of the activities in functions that represent idle time. Another contender for the top position is [vmlinux], the module comprising the coprocessor kernel. But the kernel does other things and the easiest way to disambiguate and divide it according to function is to supply to VTune Amplifier the location of the kernel maps. For Intel MPSS 2 the location is /lib/firmware/mic (NOTE: this location changes with Intel MPSS 3.0). Adding this path to the symbols and objects tab in the VTune Amplifier project properties will let the program attribute those events to the various kernel functions, which should remove [vmlinux] (the notation for modules without available symbols) and distribute those event counts into actual functions at places lower in the list.

The processor design on the Intel Xeon Phi coprocessor currently uses an architecture designed to be low power and drive the vector units; success with the coprocessor depends on maximizing parallel slack and minimizing serial time, which by Amdahl's Law will seriously inhibit peak parallel performance with even a tiny fraction of serial time. Hopefully the hints I've provided here will get you started, but there's lots of information available about the coprocessor on the web, http://software.intel.com/mic-developer. Various manuals and articles on specifics are available there to guide you through the porting and optimization processes.

jofre · ‎03-19-2014

Hi,

I think the prolem of Andrew is much more simple (if I am right of course...):

It is just that libiomp5.so is not found and then the source code does not take much time to run!

In any case, I found similar results as Andrew. I realise that when I used the command line "amplxe-cl" I obtained the following error:

/home/jofre/mic0fs/a.out: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

In any case, that's my issue (i just found this thread when trying to solve it...)

Have a good day,

Jofre

TimP · ‎03-19-2014

No, Andrew wouldn't have got as far as he reports if he hadn't made compiler/lib/mic/libiomp5.so visible on the coprocessor, e.g. by mounting and setting LD_LIBRARY_PATH, or copying it to /lib64/.

jofre · ‎03-19-2014

You may be right...

(I just explained my problem in a different post at http://software.intel.com/en-us/forums/topic/507646)

Jofre