No doubt using MPI on larrabee isn't "optimal", but given the opportunity to acquire a decent speedup with little effort and at low cost, it would still be interesting.
For example, the approach by NVidia - CUDA - is fairly disastrous. Everything would need to be rewritten, and it's very complicated, and not very stable. They promise a great speedup, but it's simply not worth it.
We have code here that we are happy with - maintainable, extensible. We just need to be able to run it as it is. Having a C++ compiler, OpenMP and MPI would make many people here happy.
MPI on Larrabee is certainlyan interesting idea -- it opens up, in principle, access to a portfolio of existing code, and gets us experimenting with a message-passing model (in contrast to shared-state threads, such as OpenMP or pthreads).
Analogous work is already out there: MPI runs just fine on multi-core CPU systems, typically putting one MPI process per core. While modern MPI implementations recognize and exploit the shared memory - and are thus reasonably efficient at message passing in that context- there are reports of scaling challenges for typical applications. In particular: MPI codes moving in from clusters are usually designed with large node memories in mind; downsizing those memory expectations to a single"node" presentsobviousdesign constraints. Taking things still further, down to the memory localpresent ona single Larrabee processor, will amplify those constraints.
I wonder if this were the issue wateenellende had in mind, anticipating what might or might not be "optimal"?
If we assume that the data of the program fits into the memory of one node with a larrabee in it, then certainly the larrabee has computational power in abundance. The question is if we can get the data there fast enought to keep all those resources occupied.
On a cluster, it is already problematic to keep 1 cpu busy due to the memory-bandwidth, I would expect it to be 24 or 48 times worse on a larrabee.
I guess this question is equally valid for OpenMP or pthreads.
Agree, managing data locality is going to be a challenge for certain workloads, and also agree there are likely lessons to be harvested from the cluster community.
One interesting place to start -- "interesting" in the sense that we'll have better chance to succeed -- would be codes which were redesigned to accomodate cluster interconnect B/W limitations. Any suggestions?
This application may be interesting: http://openlb.org/
This is the so-called "Open Lattice Boltzmann" package. This lattice boltzmann thing is a relatively new method of doing fluid dynamics. From a computational standpoint, it is very closely related to cellular automata, except that the calculations are on real numbers, not ints. Each iteration of the "machine" requires 2 steps: some internal computations per cell, and then exchanging information with the neighbour cells.
These cellular automata/lattice boltzmann applications are almost embarrassingly parallel, so Larrabee would probably be the perfect platform for it. Please do note that I am not involved in this project, so I might some of the details wrong, but anyway, here goes:
-It is open-source and has an active community, so it's easily available
-There are 3 versions: serial, OpenMP and MPI, so it's usable as a benchmark
On the serial and MPI versions, the 2 steps of each iteration can be done with one pass over the dataset, with OpenMP each iteration requires 2 passes over the dataset. To my knowledge (i.e. according to what I've heard) the OpenMP version is hardly any faster that the serial version, because any computational gain is offset by the increased memory bandwidth usage.
The question is then, how do/would the 3 versions compare on a Larrabee ?