mp_linpack HPL in auto offload

Paulius_V_ · ‎11-10-2014

Hello.

I'm trying to run precompiled intel AO benchmark. For some reason, when running it the MIC never goes beyond 50% utilization, as if HT was disabled on it.

By default all cores should be used. Am I not setting up my environment variables properly?

Running on Haswell 2690 2x on dell R730 with B1PRQ-5110P/5120D mic card.

Sunny_G_Intel · ‎11-10-2014

Hello Paulius,

Can you please give some more details about your MIC related environment variables and also let me know what affinity/threading environment you are trying to achieve.

You may also refer the following link to get further information about setting the environment variables for Intel Xeon Phi.

https://software.intel.com/en-us/articles/openmp-thread-affinity-control

Paulius_V_ · ‎11-10-2014

I'm following the intel guide on running linpack_mp precompiled HPL. according to the documentation the default behaviour should exploit all the cores without needing to set any of the environment variables yourself other than MKL_ENABLE=1

https://software.intel.com/sites/products/documentation/doclib/iss/2013/mkl/mkl_userguide_lnx/GUID-8879E25D-3D54-4B60-BE84-ADAA5656A950.htm

Sunny_G_Intel · ‎11-11-2014

Hi Paulius,

I see that you are using Automatic offload version of mp_linpack. 50% core utilization is an expected behavior. It is hard coded for optimal performance using automatic offload version. It won't accept any kind of OMP related environment. In order to see 100% utilization you can try running linpack natively on card using SMP Linpack rather than MP offload linpack.

Let me know if you have any questions.

Paulius_V_ · ‎11-12-2014

I compiled my own code for the mic to be run natively through mpi in collaboration with the host Xeon cores. I cannot get above 50% that way either. Do you need to be running a threaded program to get past that? Or would MPI processes work? The card crashes when I try to go above 120 processes

TimP · ‎11-12-2014

If you're trying to duplicate methods used at major computer centers to quote linpack performance, you must read up on how it was done. For sure, an MKL threaded image was run on each coprocessor, with data set size optimizations and a lot more.

There's unlikely to be any advantage in attempting multiple MPI ranks per core. Even for applications which don't scale with threads to the extent this benchmark does, choices such as 6 ranks of 30 or 40 threads are likely to perform much better than excessive numbers of ranks on a single coprocessor, with attention to balancing work between host and coprocessor.

Paulius_V_ · ‎11-12-2014

I am trying to set up MICs for an HPC cluster. I'm trying to run a hybrid version of linpack by spawning one process per node and one process per mic and setting the number of threads on each accordingly. The problem is that the threads are not spawning on the mic properly and it seems that an excessive ammount of threads get spawned on the host. When using MPI processes on the MIC I cannot get past the 50% load mark.

In addition to this, is there a big difference between hpl offload and hpl hybrid? In offload 50% is not satisfactory, I would like the card to be utilized completely.

Running a molecular dynamics program with FFTW/MKL I managed to get the card loaded to 100%.

Another thing is that one of the codes I will need to be running is not threaded. I managed to compile it for the MIC and it runs now but of course when I run on the haswell and mic in tandem the performance is bad because of load imbalance.

Thank you for your prompt responses.

Bart_O_ · ‎11-21-2014

Hi,

just to add a "me too" but with a twist. This is for a single Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz host with two B1PRQ-5110P Xeon Phi cards.

The benchmark results for mp_linpack xhpl_offload_intel64 I get for 16 Xeon cores + 2 Phi cards are usually around 1100 Gflops and in those cases "micsmc" reports the cores as being only 50% busy. This is only 47% of the theoretical peak (which is 333+2*1010=2353 Gflops for this configuration).

However every now and then (1 out of 10 of identical runs) a job keeps the cards busy and I get performance up to 1926 Gflops, about the expected result (82% of peak).

I did some digging but have not been able to figure out why sometimes the offload code launches more and sometimes fewer hyperthreads per core (suspect: bug in offload code? autodetection depends on cards sleep state?) -- however all the benchmark data show that DGEMM performance is best with 4 hyperthreads/core. Is there a hidden environment variable I could try to force more threads on the card?

Regards,

Bart

Frances_R_Intel · ‎11-21-2014

FYI, we call them threads, not hyperthreads, on the Intel Xeon Phi coprocessor. You can find an interesting discussion of why in this forum post.

But enough trivia. Now to the real issue.

Paulius, Bart, you are both using micsmc to measure the coprocessor utilization? Even if a core is only being 50% utilized, it does not mean that there are fewer than 4 thread 'running' on the core. Some of the threads may be waiting on cache/memory or a synchronization point or whatever. Are you using any other method to determine the number of threads being spawned? I don't necessarily believe it is varying.

Paulius, when you use MPI on the coprocessor, each rank requires it's own memory. If you have 120 MPI ranks, each processing a hefty chunk of data, you can use up all the memory and, since there is no swap on the coprocessor, when you use up all the memory, you crash.

As Tim pointed out, there is not likely to be an advantage to running multiple MPI ranks per core. If nothing else, you will make poor use of your cache (if you don't run out of memory first). Depending on where the parallelism is in the code, you may find that a fairly small number of MPI ranks, each using OpenMP, will give better results. For a code like mp_linpack, given a data set large enough to make good use of the coprocessor, using straight offload with OpenMP (in this case the automatic offload) can be enough for good results. Again as Tim pointed out, getting every last tick out of the system is an art. But getting every last tick out of the system isn't really the goal. The goal is to have your codes run enough better and faster, without you having to stand on your head, that you can do things you couldn't do before.

So, the question is why does mp_linpack get 50% core utilization and, more interestingly, why do Bart's results occasionally jump to 87% for no apparent reason? Bart, do you have any other tools (like VTune) that you can use to get some more details on the running code?

Bart_O_ · ‎11-23-2014

Frances,

thanks for clarifying the thread/hyperthread terminology.

Our cluster license does not include Vtune but I could try it out with the trial license. Anyway for now I have two runs following each other on the same node with some analysis:

Run 1: 1.77042e+03 Gflops, power mic0/mic1: 253/248 Watts

Run 2: 1.10924e+03 Gflops, power mic0/mic1: 218/213 Watts

but everything else is pretty much the same: both have 121 threads running on the mic (unlike what I suspected, it really is 2 threads per core for both runs), both have 50% core utilisation according to a "micsmc -a" snapshot taken at 2 minutes past the start, temperatures are also similar despite the higher wattage.

I have attached the output of the runs, like this:

Bart_O_ · ‎11-25-2014

I the end I found the reason for this mystery: the Phis were fine, but the Sandy Bridge host had (between two xhpl runs) its frequency scaled back from 3.0GHz to 1.2GHz, which obviously caused a slowdown. This is still really odd, especially since cpufreq was not active on the host (the nodes are set to run at 3.0GHz continuously, only flipping between C0 and C1 states), but it's certainly not the Phi card that was to blame. My apologies.

JJK · ‎11-25-2014

Hi Bart,

a 'scale down' to 1.2 GHz sounds normal for a Sandy Bridge machine - it's not cpufreq which does this, but the ACPI module; ACPI is needed for things like TurboBoost as well. The best way to measure the real clock speed during an application run is to use the 'turbostat' tool from the pm-tools package - it will display the actual frequencies used for the CPU (not the Phi!).

McCalpinJohn · ‎11-25-2014

The other mechanism that will slow down the Sandy Bridge systems is the C1E state. If all of the cores on a chip are in the C1 idle state, the chip will transition to C1E, which includes dropping the core and uncore frequencies to 1.2 GHz. This will happen even if cpufreq (or some other mechanism) has requested that the cores stay at maximum frequency.

The easiest way to work around this problem is to run a "dummy" process on each socket that just sits and spins. This will keep one core active, and allow the uncore to run at the same frequency as the active core. If you are not sure whether the dummy process might conflict with "real" user processes, you can use "nice" to set its priority to the minimum value.

zhang_x_ · ‎03-15-2015

Hi

We have the similar cluster to you ,but I can only get 10% of peak,so can you tell us the environment vars.