Re: mpi and larrabee

wateenellende · ‎08-30-2008

Will there be MPI on larrabee ? That is, will Intel's MPI library and compiler, or maybe another platform, allow one to run programs in parallel on the larrabee architecture ?

No doubt using MPI on larrabee isn't "optimal", but given the opportunity to acquire a decent speedup with little effort and at low cost, it would still be interesting.

TimP · ‎09-02-2008

The announcement already committed to C (and C++) compiler support. It doesn't say specifically in what way parallelism would be supported, but it looks like OpenMP would have more priority than MPI, in view of the initial market.

wateenellende · ‎09-02-2008

I see. OpenMP is certainly useful, as is a C compiler, but in my work environment (University computer science department) most code is C++ and many things that need to be fast are written for clusters - and hence parallelized with MPI.

For example, the approach by NVidia - CUDA - is fairly disastrous. Everything would need to be rewritten, and it's very complicated, and not very stable. They promise a great speedup, but it's simply not worth it.

We have code here that we are happy with - maintainable, extensible. We just need to be able to run it as it is. Having a C++ compiler, OpenMP and MPI would make many people here happy.

Michael_W_Intel1 · ‎09-16-2008

MPI on Larrabee is certainlyan interesting idea -- it opens up, in principle, access to a portfolio of existing code, and gets us experimenting with a message-passing model (in contrast to shared-state threads, such as OpenMP or pthreads).

Analogous work is already out there: MPI runs just fine on multi-core CPU systems, typically putting one MPI process per core. While modern MPI implementations recognize and exploit the shared memory - and are thus reasonably efficient at message passing in that context- there are reports of scaling challenges for typical applications. In particular: MPI codes moving in from clusters are usually designed with large node memories in mind; downsizing those memory expectations to a single"node" presentsobviousdesign constraints. Taking things still further, down to the memory localpresent ona single Larrabee processor, will amplify those constraints.

I wonder if this were the issue wateenellende had in mind, anticipating what might or might not be "optimal"?

wateenellende · ‎09-17-2008

Actually, my first concern regarding MPI on larrabee was about memory bandwidth. Datasets in simulations that are usually computed on clusters with MPI are definately greater than the cache. A limiting factor for speed is memory bandwidth, even if there are quite a lot of calculations per element.

If we assume that the data of the program fits into the memory of one node with a larrabee in it, then certainly the larrabee has computational power in abundance. The question is if we can get the data there fast enought to keep all those resources occupied.

On a cluster, it is already problematic to keep 1 cpu busy due to the memory-bandwidth, I would expect it to be 24 or 48 times worse on a larrabee.

I guess this question is equally valid for OpenMP or pthreads.

Michael_W_Intel1 · ‎09-19-2008

Agree, managing data locality is going to be a challenge for certain workloads, and also agree there are likely lessons to be harvested from the cluster community.

One interesting place to start -- "interesting" in the sense that we'll have better chance to succeed -- would be codes which were redesigned to accomodate cluster interconnect B/W limitations. Any suggestions?

wateenellende · ‎09-27-2008

Quoting - MICHAEL WRINN (Intel)

One interesting place to start -- "interesting" in the sense that we'll have better chance to succeed -- would be codes which were redesigned to accomodate cluster interconnect B/W limitations. Any suggestions?

This application may be interesting: http://openlb.org/

This is the so-called "Open Lattice Boltzmann" package. This lattice boltzmann thing is a relatively new method of doing fluid dynamics. From a computational standpoint, it is very closely related to cellular automata, except that the calculations are on real numbers, not ints. Each iteration of the "machine" requires 2 steps: some internal computations per cell, and then exchanging information with the neighbour cells.

These cellular automata/lattice boltzmann applications are almost embarrassingly parallel, so Larrabee would probably be the perfect platform for it. Please do note that I am not involved in this project, so I might some of the details wrong, but anyway, here goes:

-It is open-source and has an active community, so it's easily available

-There are 3 versions: serial, OpenMP and MPI, so it's usable as a benchmark

On the serial and MPI versions, the 2 steps of each iteration can be done with one pass over the dataset, with OpenMP each iteration requires 2 passes over the dataset. To my knowledge (i.e. according to what I've heard) the OpenMP version is hardly any faster that the serial version, because any computational gain is offset by the increased memory bandwidth usage.

The question is then, how do/would the 3 versions compare on a Larrabee ?

Michael_W_Intel1 · ‎10-30-2008

Thanks! The OpenLB is indeed an interesting case; I've finally got some bandwidth to look into this, compiling the codes on some manycore systems here....

decapsi · ‎04-14-2009

Quoting - wateenellende

Will there be MPI on larrabee ? That is, will Intel's MPI library and compiler, or maybe another platform, allow one to run programs in parallel on the larrabee architecture ?

No doubt using MPI on larrabee isn't "optimal", but given the opportunity to acquire a decent speedup with little effort and at low cost, it would still be interesting.

Hi everyone. I just wanted to add my interest in getting MPI working on GPGPU chips like Larrabee. For fluid dynamics, taking existing C code with MPI implementation and easily compiling it would be amazing. And also, when it comes to benchmarking runs, I think for GPGPU stuff since it's so new and harder to grasp the memory allocation issues, it would really great for somebody to release a compact benchmark utility that looks at where your bottlenecks are after running code so that you could tell easier if your wallclock times could be better with lower reolution data sets etc. or in general how you could tweak your input and scenarios to best utilize your GPGU's potential.