Expected performance gain ... 5960X vs Xeon Phi?

olivier__james · ‎02-19-2015

Hello...
I am a retired theoretical physical chemist with a long association with computers and computing.
As briefly as possible, my interests are in the behavior of fluids at a phase boundary, such as a real gas at a solid
surface: the attractive forces of the solid cause an increased concentration (density) of the gas in the region near the surface,
a measureable phenomenon called "adsorption". Thermodynamics requires that, at equilibrium at a constant temperature and
pressure, all parts of the system have the same "chemical potential" and that the distribution of densitys must also yield the minimum value
of free energy for the system.

A fairly recent way of treating such problems is by "Density Functional Theory". The integral equation can be expressed as a series of functions, each of which is a function of the density at the system coordinates (hence a functional).

One of the terms in the relationship is a type of "N-Body" problem, in which it is necessary to calculate the potential (a scalar quantity) at each of N
volume elements due to all other N volume elements in the system; an (O)N^2 problem. Usually, the Lennard-Jones potential is used.
Since the fluid (gas) is inhomogenious, every volume element is characterized by its own density. The solution to the problem
is the distribution of densities that minimize the free energy of the system. This can be done by a numerical iterative method, but convergence is slow
hence the iterations required are many. Naturally, the Lennard-Jones term uses way over 98% of the time. A reasonable value for N in a small (nanoscale) 3 dim model will be a grid with a few million points.

The fastest CPU system I have at my disposal is an Intel 5960x (8 cores/16 threads) and x99 chipset running happily at 4.3 Ghz. (lspci says its a Xeon E5 v3/core I7) Using Intel's latest C tools, and after a lot of tuning, the inhomogeneous L-J problem I have outlined maxes
out at 342 seconds for 1e6 points ( 1e12 interactions ). I estimate about 30 double precision ops/interaction,
so roughly 87 Gflop/sec. (This is on Scientific Linux 6.6 platform) The system monitor shows all threads at 100%.

This would not really be intolerable for exploring a small, but meaningful model ... I'm retired, after all.
But it is frustratingly slow for finishing the task, as developing the heuristics to get efficient and reliable convergence takes
a lot of experimentation. As far as I know, a fully 3D non-local DFT has never been explored.

Briefly, I rasterize the 3-D grid into an array of struct of float x,y,z coordinates (call it vecA) and similar vector with an additional member, float 'dens'.(vecB). The calculations consist of a nested loop, each element of A looks at each element of B,caclulates a function of their separation times the density at B and sums it into a temp variable. After all B, the sum is stored in a third (C) vector element identified with index of A. So no data is changed during the calculation. The result is stored in vecC. Only the densities ar changed between calls to the function.

So my questions are,
Do you think I'm getting about the max from my present CPU? (Already about 92x an unoptimized single thread)
How much more could I expect to get from a Xeon Phi, and which one? (I'd really like a factor of 10...)

J. P. Olivier

Andrey_Vladimirov · ‎02-19-2015

James,

you may find this case study useful: http://research.colfaxinternational.com/file.axd?file=2014%2f11%2fColfax-Optimization.pdf

Depending on how your code is implemented, there may be a low-hanging fruit for optimization by converting your data structure to arrays of cordinates from arrays of structures of coodinates. Feel free to contact me if you would like a deeper study of your code on a modern CPU or a Xeon Phi.

Andrey

jimdempseyatthecove · ‎02-19-2015

The ColfaxInternational .PDF contains a good overview of how to restructure an N-body program to achieve good performance on Xeon Phi. It also provides hints on improving scalar tuning. This said, their N-body calculation requirements may be somewhat different than your calculation requirements.

The first thing I suggest you do is to read the ColfaxInternational .PDF to see if you can apply some of (all of) the techniques to your 5960. You might see an additional 2x if you missed something.

The second thing I suggest you do is to look at the problem in light of maintaining good vectorization, while reducing the number of computations. Some of the fluid dynamics see good improvements through tiling then using the equivalent of a Barycenter for distant tiles, though this may increase the error/uncertainty of the results. If you have time on your hands, look at the fluidanimate benchmark program in the PARSEC benchmark suite (http://parsec.cs.princeton.edu/download.htm). Unfortunately this is a 3GB download. This program (fluidanimate) is not targeted towards Xeon Phi, however it does illustrate one method for tiling of a fluid dynamics problem.

Third, re-review the code to assure it is making use of the full width of the AVX2 registers on your 5960. Although this may not be as significant performance impact on the host (as it may require moving particles from one domain/tile to another), the return on your investment in programming time will (likely) come back when you migrate to a system with wider vectors (Xeon Phi).

Jim Dempsey