Webinar: Optimizing LAMMPS* for Intel Xeon Phi coprocessors - Oct 29

BelindaLiviero · ‎10-02-2014

A new webinar has been scheduled. Details below:

Optimizing LAMMPS* for Intel® Xeon Phi™ Coprocessors

Wednesday, October 29, 2014 11:00 AM - 12:00 PM PDT

LAMMPS* is a large scale atomic/molecular massively parallel simulator distributed by Sandia National Laboratories. This package is used in a variety of areas in Life Sciences research and development (and anything else concerned with molecular dynamics). Come learn what Intel specifically did to optimize LAMMPS to take advantage of Intel® Xeon® and Intel® Xeon® coprocessors, and about the resulting performance from those optimizations.

Sign up here! >> https://www1.gotomeeting.com/register/530570520

BelindaLiviero · ‎11-12-2014

This webinar was recorded -- we now have the the recording posted here:

https://software.intel.com/en-us/videos/optimizing-lammps-for-intel-xeon-phi-coprocessors

Below is a list of the questions that came up during the talk and respective answers--

---Q/A -----

Q: On systems like Stampede is it best to use 16MPI tasks on CPU-- 1 task per core (no omp)?

With and without offload, in current LAMMPS, all MPI is generally best on 1 node. For scaling to large simulations with long-range electrostatics, using more OpenMP threads can often help improve performance. With hyperthreading off, it currently can help to leave some cores free to handle extra threads generated by the offload runtime to handle asynchronous offload. This might change in the future.

Q: in slide 39, how that bar graph generated..which tool is used for that

In LAMMPS, I have added timers into the code that are spit out at the end of every run. This is always the case, not just for debugging. This helps to make root-causing issues reported by users easy. The graph was created in Excel from the timer output. The one exception was the “Imbalance” time. This is only available as a debug compile option and is basically just a timer around MPI_Barrier().

Q: Can you use the Intel Package on a system without any PHIs?

Yes. The same code is used for execution on the CPUs and the Phis. If not using Intel coprocessors, you can even compile with non-Intel compilers and run anywhere that LAMMPS will run.

Q: How does the Intel v15 compiler eliminate the need to zero out calcs for code with a lot of conditionals?

It does not eliminate this, it just removed the need to alter the structure of the branches. In earlier compilers, using for example:

if (r<cutoff) {

eng=calculate_energy(…);

}

tot_eng+=eng;

would perform worse than:

eng=calculate_energy(…);

if (r<cutoff) tot_eng+=eng;

for the AVX target.

Q: Do you have results with two PHIS per node?

I haven’t collected a lot of data using this configuration, but it is supported by LAMMPS.

Q: Can you explain the neighbor list padding a little more? Does the aligning on 64 byte boundary not handle this already?

If, for example, there are 18 neighbors for a single atom and the vector width is 16, the last iteration of the vectorized loop only uses 2 data lanes out of the 16. There are different ways this can be handled, but generally the compiler will generate separate code to handle the case where 16 data lanes aren’t used. Running this separate “remainder” code can sometimes cause a performance hit due to multiple reasons. By padding, 16 data lanes are always used and we do this in a way that cannot influence the result.

This will often be unnecessary and newer versions of the Intel® Vtune Amplifier will tell you if this is a problem for your code.

Q: How is the lammps code optimized for handling the random access inside the force calculation? Specifically, I would expect poor performance when trying to load the neighboring particles data, since this is non-contigious. You mention vectorization with unit stride, but isn't your x,y,z,type data structure implying 4-unit stride, and hence the alignment shouldn't matter I would guess.

There is a performance penalty for random access and without changing the algorithm, gather instructions will be used on Xeon Phi regardless of the atom data structure. Periodically sorting the atom data in memory so that atoms that are nearby in space are also nearby in memory helps to improve data locality and can reduce the number of instructions retired for gather. This was already used in LAMMPS before the optimizations presented here.

Approaches to mitigate the penalty for gathering data into vector registers have been developed by others including the use of explicit vector intrinsics to perform fast gathers for atom data and approaches to build neighbor lists using groups of atoms with data packed to improve vector performance for each group. Due to the fact that the vector intrinsics are architecture and floating-point precision specific, we have chosen to omit this optimization in favor of portable code with reduced complexity. For the same reasons, we have not explored neighbor groups, due to the large

variance in typical neighbor list sizes for the many simulation models available in LAMMPS. We note that both optimizations

have the potential to improve performance further for the measurements presented here, however, and that new features in

version 15 of the Intel® compilers can potentially be exploited for explicit vectorization of only the atom data gather.

Alignment and data structure does play some role, however. For the AVX target (on CPUs) it is better to use {x,y,z,type} rather than {x,y,z}. This also prevents the data from a single atom from being split across cache-lines.

Q: can you explain the precision handling a little more? Are particle positions stored in single or double precision? How do you prevent the loss of precision when summing the energy or virial?

In mixed precision, the atom data and most calculations are stored and performed in single precision (in memory and registers). For any accumulated value (forces, torques, virials, energy), the results are stored in double precision (in memory and registers) and the summation is performed in double precision. For the vectorized force calculation, this means that the results are converted to double precision at the end of a loop iteration and only 8 data lanes are used for the vector computation (instead of 16).

Q: What is the easiest way to use the intel MIC force calculation in other in house codes? Can the intel code be called as a library as well?

Off the top of my head, I don’t see any problems with using the Intel Package as part of a LAMMPS library called from other codes unless routines outside of LAMMPS try to set or alter the thread affinity on the coprocessor (if used). Also, if using offload, the Intel compiler might be needed for linking with the LAMMPS library in order to assure that the correct libraries are included.

Q: Is this optimized/modified LAMMPS distribution available for download?

Yes, it is available in the current LAMMPS downloads.

Q: (1) Can I understand the USER-INTEL package in this way: if I need to run LAMMPS for ReaxFF force field, during the compile stage, I need to include both the USER-INTEL and the ReaFF libiary?

Yes. Although the Intel package does not currently have any optimizations for ReaxFF.

Q: (2) Could you please share a Makefile in the end?

There are several example Makefiles included with the LAMMPS download:

src/MAKE/OPTIONS/Makefile.intel_cpu # Build without support for offload to a coprocessor

src/MAKE/OPTIONS/Makefile.intel_phi # Build with support for offload

src/MAKE/MACHINES/Makefile.stampede # Example makefile for TACC Stampede

src/MAKE/MACHINES/Makefile.beacon # Example makefile for ORNL Beacon