How MIC works

Al-jawaheri__Hayder · ‎02-03-2015

Hi,

I was wondering how does the application work in the MIC (native mode)?
Namely, how doing the program land in each core and precisely in the thread?

Thanks in advance.

jimdempseyatthecove · ‎02-03-2015

You compile your native programs on the host processor. You copy the files to the MIC using scp or other remote file copy program. The MIC appears as a device on a network.

Then you ssh (you can keep this session open) to the MIC. No different than ssh-ing to a remote sysem.

When you copy your program, you may also need to have copied libraries that the program uses. Any libraries must be compiled for the MIC environment and instruction set extension/limitations.

From the ssh session, you start the application from the command line as you would on the host.

Look at the Intel Xeon Phi Coprocessor Developer's Quick Start Guide for an overview.

The application that you write essentially is programmed like your host processor. Instead of say 4 cores and 8 hardware threads, you may have 60 cores and 240 hardware threads, and your SIMD instructions can be extended to 512 bits (16 floats or 8 doubles).

An exceptionally good book to read is Intel Xeon Phi Coprocessor High-Performance Programming by Jim Jeffers and James Reinders, Morgan Kaufman publishers, ISBN: 978-0-12-410414-3.

Jim Dempsey

McCalpinJohn · ‎02-03-2015

The details of controlling the placement of threads on physical resources depends on which programming language you are using.

For OpenMP programs, the Intel compilers provide run-time support through the KMP_AFFINITY and KMP_PLACE_THREADS environment variables that make it very easy to specify the most commonly desired thread placement schemes.

For serial programs the Linux "taskset" command is available. This is sometimes useful for testing.

At a lower level, the linux "sched_setaffinity()" call allows you to specify a completely arbitrary set of "logical processors" where each thread is allowed to run. Most people won't need this, but sometimes it is useful for testing purposes.

Al-jawaheri__Hayder · ‎02-03-2015

Dear Jim Dempsey and John D. McCalpin,

Thanks again for fast response.

Actually, I am looking for information about porting the program code in the cores of MIC.
For example, in SCC system, each core loads its executable image (the program + library) onto its own local L2 cache memory,
Then, each core will execute its code base on the hosting core ID.
So, it is a similar approach, isn't it?

jimdempseyatthecove · ‎02-04-2015

On MIC, you would use KMP_AFFINITY=... or KMP_PLACE_THREADS=... (read documentation as to what you would use for ... arguments).

There are other settings and/or API's you can use.

What essentially you are doing is pinning the software thread to a specific hardware thread (within a core) or to any of a selected set of hardware threads in one or more cores. Typically on MIC you would test your program using 2, 3 or 4 threads per core with each software thread pinned to a single hardware thread. Preferably scheduling tasks that share data within the same core.

Unlike a host processor such as Xeon or Core i7, the Xeon Phi requires at lest two hardware threads per core to be active. You will have to test with 3 and 4 threads per core and experiment with how to partition your work on various thread counts per core in order to obtain maximum performance.

If you have multiple Xeon Phi cards you would normally use OpenMPI or the (asynchronous) Offload mode of programming.

On Xeon Phi, each core has its own L1 and L2 cache, and at a longer latency it can fetch data from other cores cache as well as from RAM. You as the programmer do not specify what chunk of RAM maps to what core cache, rather it is the execution history of the thread running in the core, and its memory references, that determine what/when data gets cached. (it may steal data from a different core's cache in the process).

Jim Dempsey

McCalpinJohn · ‎02-04-2015

Xeon Phi is not very much like the SCC system -- it is much better to think about Xeon Phi as an ordinary SMP. On Xeon Phi all of the cores are fully cache coherent and there is one OS running on the Xeon Phi that manages all of the cores --- exactly the same as any other multicore system.

Multi-threaded programs can be written using OpenMP, pthreads, or other programming languages that support shared-memory threading. OpenMP runs very well for programs with relatively coarse granularity and static scheduling.

The controls for mapping application threads to hardware resources are similar to those on other systems, but Xeon Phi has a few extra features that make it easier to use. The KMP_PLACE_THREADS environment variable was developed especially for Xeon Phi and it is quite useful for controlling the set of cores and threads that an OpenMP program can use.