MPI on Larrabee is certainlyan interesting idea -- it opens up, in principle, access to a portfolio of existing code, and gets us experimenting with a message-passing model (in contrast to shared-state threads, such as OpenMP or pthreads).
Analogous work is already out there: MPI runs just fine on multi-core CPU systems, typically putting one MPI process per core. While modern MPI implementations recognize and exploit the shared memory - and are thus reasonably efficient at message passing in that context- there are reports of scaling challenges for typical applications. In particular: MPI codes moving in from clusters are usually designed with large node memories in mind; downsizing those memory expectations to a single"node" presentsobviousdesign constraints. Taking things still further, down to the memory localpresent ona single Larrabee processor, will amplify those constraints.
I wonder if this were the issue wateenellende had in mind, anticipating what might or might not be "optimal"?
Agree, managing data locality is going to be a challenge for certain workloads, and also agree there are likely lessons to be harvested from the cluster community.
One interesting place to start -- "interesting" in the sense that we'll have better chance to succeed -- would be codes which were redesigned to accomodate cluster interconnect B/W limitations. Any suggestions?
This application may be interesting: http://openlb.org/
This is the so-called "Open Lattice Boltzmann" package. This lattice boltzmann thing is a relatively new method of doing fluid dynamics. From a computational standpoint, it is very closely related to cellular automata, except that the calculations are on real numbers, not ints. Each iteration of the "machine" requires 2 steps: some internal computations per cell, and then exchanging information with the neighbour cells.
These cellular automata/lattice boltzmann applications are almost embarrassingly parallel, so Larrabee would probably be the perfect platform for it. Please do note that I am not involved in this project, so I might some of the details wrong, but anyway, here goes:
-It is open-source and has an active community, so it's easily available
-There are 3 versions: serial, OpenMP and MPI, so it's usable as a benchmark
On the serial and MPI versions, the 2 steps of each iteration can be done with one pass over the dataset, with OpenMP each iteration requires 2 passes over the dataset. To my knowledge (i.e. according to what I've heard) the OpenMP version is hardly any faster that the serial version, because any computational gain is offset by the increased memory bandwidth usage.
The question is then, how do/would the 3 versions compare on a Larrabee ?